Re: adding insert

Ted Dunning Mon, 27 May 2019 17:28:50 -0700

I have in mind the ability to push rows to an underlying DB without any
transactional support.




On Mon, May 27, 2019 at 2:16 PM Paul Rogers <[email protected]>
wrote:

> Hi Ted,
>
> From item 3, it should like you are focusing on using Drill to front a DB
> system, rather than proposing to use Drill to update files in a distributed
> file system (DFS).
>
>
> Turns out that, for the DFS case, the former HortonWorks put quite a bit
> into working out viable insert/update semantics in Hive with the Hive ACID
> support. [1], [2] This was a huge amount of work done in conjunction with
> various partners, and is on its third version as Hive learns the semantics
> and how to get ACID to perform well under load. Adding ACID support to
> Drill would be a "non-trivial" exercise (unless Drill could actually borrow
> Hive's code, but even that might not be simple.)
>
>
> Drill is far simpler than Hive because Drill has long exploited the fact
> that data is read-only. Once data can change, we must revisit various
> aspects to account for that fact. Since change can occur concurrently with
> queries (and other changes), some kind of concurrency control is needed.
> Hive has worked out a way to ensure that only completed transactions are
> included in a query by using delta files. Hive delta files can include
> inserts, updates and deletes.
>
> If insert is all that is needed, then there may be simpler solutions: just
> track which files are newly added. If the underlying file system is atomic,
> then even this can be simplified down to just noticing that a file exist
> when planning a query. If the file is visible before it is complete, then
> some form of mechanism is needed to detect in-progress files. Of course,
> Drill must already handle this case for files created outside of Drill, so
> it may "just work" for the DFS case.
>
>
> And, if the goal is simply to push insert into a DB, then the DB itself
> can handle transactions and concurrency. Generally most DBs manage
> transaction as part of a session. To ensure Drill does a consistent insert,
> Drill would need to push the update though a single client (single minor
> fragment). A distributed insert (using multiple minor fragments each
> inserting a subset of rows) would require two-phase commit, or would have
> to forgo consistency. (The CAP problem.) Further, Drill would have to
> handle insert failures (deadlock detection, duplicate keys, etc.) reported
> by the target DB and return that error to the Drill client (hopefully in a
> form other than a long Java stack trace...)
>
> All this said, I suspect you have in mind a specific use case that is far
> simpler than the general case. Can you explain more a bit what you have in
> mind?
>
> Thanks,
> - Paul
>
> [1]
> https://hortonworks.com/tutorial/using-hive-acid-transactions-to-insert-update-and-delete-data/
> [2]
> https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/using-hiveql/content/hive_3_internals.html
>
>
>
>
>
>     On Monday, May 27, 2019, 1:15:36 PM PDT, Ted Dunning <
> [email protected]> wrote:
>
>  I would like to start a discussion about how to add insert capabilities to
> drill.
>
> It seems that the basic outline is:
>
> 1) making sure Calcite will parse it (almost certain)
> 2) defining an upsert operator in the logical plan
> 3) push rules into Drill from the DB driver to allow Drill to push down the
> upsert into DB
>
> Are these generally correct?
>
> Can anybody point me to analogous operations?
>

Re: adding insert

Reply via email to