Re: Bringing ACID into ORC

Alan Gates Wed, 13 Sep 2017 14:26:19 -0700

It seems wasteful and error prone to have multiple copies of the same
functionality.  But that’s an argument we can have on the hive lists;
whether Hive chooses to adopt this will be up to them.


Alan.

On Wed, Sep 13, 2017 at 12:40 PM, Eugene Koifman <[email protected]>
wrote:

> I disagree with number 4.  I work on Acid a lot and
> VectorizedOrcAcidRowBatchReader and
> OrcRawRecordMerger (as well as the Write path) are modified quite often.
> Moving this logic
> into a separate project will add large burden of having to release ORC
> project to make
> progress on Hive features.
>
> If Hive is refactored to have a pluggable Acid reader such that Hive
> contains an
> implementation (exactly as it does now) then Hive ACID’s dependency on ORC
> is not increased.
> ORC can create its own implementation so that it can be used by projects
> using ORC directly, w/o Hive.
>
> Eugene
>
>
>
> On 9/13/17, 11:34 AM, "Alan Gates" <[email protected]> wrote:
>
>     When ORC moved out of Hive, it didn’t bring the ACID work along.  I’d
> like
>     to start working to remedy that.  I wanted to give an outline of how I
> am
>     thinking of approaching it.
>
>     In general, I plan to focus on supporting the new split update (aka
> ACID
>     2.0) layout, where delta files contain either all inserts or all
> deletes
>     (updates are accomplished by putting in a delete and an insert).  This
> is
>     what Hive supports in its trunk (but not in Hive 1 or 2).
>
>     I also plan to follow the ORC pattern of focusing on vectorized row
> batches
>     first, and then building the row by row readers and writers as shims
> on top
>     of this.
>
>     Proposed plan:
>     1) Build a version of RecordReader that can handle ACID files.  This
> would
>     be roughly analogous to Hive’s VectorizedOrcAcidRowBatchReader.
>
>     2) I haven’t looked into the details here yet, but I assume I will need
>     some changes on the Writer side as well to handle writing out base
> versus
>     delta files as well as insert versus delete delta files.
>
>     3) Put the shims in place to support ORC equivalents to Hive’s
>     AcidInputFormat and AcidOutputFormat.
>
>     4) Change Hive to use the code now in ORC rather than duplicating this
> code
>     in Hive.
>
>     Seem reasonable?
>
>     Should I do this in master or in a branch?  In general I prefer to
> work in
>     master when possible.  But I see a couple of reasons to branch:
>     1) This will require changes in Hive, some that aren’t released yet.
> For
>     example, this will depend on moving ValidTxnList to storage-api (which
> I
>     plan to do anyway, but haven’t yet).  It would be convenient to be
> able to
>     depend on SNAPSHOT versions of storage-api rather than forcing a bunch
> of
>     releases.  But I don’t want to do that in master because it can make it
>     hard for people to build and it makes releases impossible.
>
>     2) This is going to take a while and I suspect ORC will want to release
>     multiple times before it’s done.  I’m not sure we want have half baked
>     features in the releases.
>
>     Alan.
>
>
>

Re: Bringing ACID into ORC

Reply via email to