I have in mind the ability to push rows to an underlying DB without any transactional support.
On Mon, May 27, 2019 at 2:16 PM Paul Rogers <[email protected]> wrote: > Hi Ted, > > From item 3, it should like you are focusing on using Drill to front a DB > system, rather than proposing to use Drill to update files in a distributed > file system (DFS). > > > Turns out that, for the DFS case, the former HortonWorks put quite a bit > into working out viable insert/update semantics in Hive with the Hive ACID > support. [1], [2] This was a huge amount of work done in conjunction with > various partners, and is on its third version as Hive learns the semantics > and how to get ACID to perform well under load. Adding ACID support to > Drill would be a "non-trivial" exercise (unless Drill could actually borrow > Hive's code, but even that might not be simple.) > > > Drill is far simpler than Hive because Drill has long exploited the fact > that data is read-only. Once data can change, we must revisit various > aspects to account for that fact. Since change can occur concurrently with > queries (and other changes), some kind of concurrency control is needed. > Hive has worked out a way to ensure that only completed transactions are > included in a query by using delta files. Hive delta files can include > inserts, updates and deletes. > > If insert is all that is needed, then there may be simpler solutions: just > track which files are newly added. If the underlying file system is atomic, > then even this can be simplified down to just noticing that a file exist > when planning a query. If the file is visible before it is complete, then > some form of mechanism is needed to detect in-progress files. Of course, > Drill must already handle this case for files created outside of Drill, so > it may "just work" for the DFS case. > > > And, if the goal is simply to push insert into a DB, then the DB itself > can handle transactions and concurrency. Generally most DBs manage > transaction as part of a session. To ensure Drill does a consistent insert, > Drill would need to push the update though a single client (single minor > fragment). A distributed insert (using multiple minor fragments each > inserting a subset of rows) would require two-phase commit, or would have > to forgo consistency. (The CAP problem.) Further, Drill would have to > handle insert failures (deadlock detection, duplicate keys, etc.) reported > by the target DB and return that error to the Drill client (hopefully in a > form other than a long Java stack trace...) > > All this said, I suspect you have in mind a specific use case that is far > simpler than the general case. Can you explain more a bit what you have in > mind? > > Thanks, > - Paul > > [1] > https://hortonworks.com/tutorial/using-hive-acid-transactions-to-insert-update-and-delete-data/ > [2] > https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/using-hiveql/content/hive_3_internals.html > > > > > > On Monday, May 27, 2019, 1:15:36 PM PDT, Ted Dunning < > [email protected]> wrote: > > I would like to start a discussion about how to add insert capabilities to > drill. > > It seems that the basic outline is: > > 1) making sure Calcite will parse it (almost certain) > 2) defining an upsert operator in the logical plan > 3) push rules into Drill from the DB driver to allow Drill to push down the > upsert into DB > > Are these generally correct? > > Can anybody point me to analogous operations? >
