It seems wasteful and error prone to have multiple copies of the same functionality. But that’s an argument we can have on the hive lists; whether Hive chooses to adopt this will be up to them.
Alan. On Wed, Sep 13, 2017 at 12:40 PM, Eugene Koifman <[email protected]> wrote: > I disagree with number 4. I work on Acid a lot and > VectorizedOrcAcidRowBatchReader and > OrcRawRecordMerger (as well as the Write path) are modified quite often. > Moving this logic > into a separate project will add large burden of having to release ORC > project to make > progress on Hive features. > > If Hive is refactored to have a pluggable Acid reader such that Hive > contains an > implementation (exactly as it does now) then Hive ACID’s dependency on ORC > is not increased. > ORC can create its own implementation so that it can be used by projects > using ORC directly, w/o Hive. > > Eugene > > > > On 9/13/17, 11:34 AM, "Alan Gates" <[email protected]> wrote: > > When ORC moved out of Hive, it didn’t bring the ACID work along. I’d > like > to start working to remedy that. I wanted to give an outline of how I > am > thinking of approaching it. > > In general, I plan to focus on supporting the new split update (aka > ACID > 2.0) layout, where delta files contain either all inserts or all > deletes > (updates are accomplished by putting in a delete and an insert). This > is > what Hive supports in its trunk (but not in Hive 1 or 2). > > I also plan to follow the ORC pattern of focusing on vectorized row > batches > first, and then building the row by row readers and writers as shims > on top > of this. > > Proposed plan: > 1) Build a version of RecordReader that can handle ACID files. This > would > be roughly analogous to Hive’s VectorizedOrcAcidRowBatchReader. > > 2) I haven’t looked into the details here yet, but I assume I will need > some changes on the Writer side as well to handle writing out base > versus > delta files as well as insert versus delete delta files. > > 3) Put the shims in place to support ORC equivalents to Hive’s > AcidInputFormat and AcidOutputFormat. > > 4) Change Hive to use the code now in ORC rather than duplicating this > code > in Hive. > > Seem reasonable? > > Should I do this in master or in a branch? In general I prefer to > work in > master when possible. But I see a couple of reasons to branch: > 1) This will require changes in Hive, some that aren’t released yet. > For > example, this will depend on moving ValidTxnList to storage-api (which > I > plan to do anyway, but haven’t yet). It would be convenient to be > able to > depend on SNAPSHOT versions of storage-api rather than forcing a bunch > of > releases. But I don’t want to do that in master because it can make it > hard for people to build and it makes releases impossible. > > 2) This is going to take a while and I suspect ORC will want to release > multiple times before it’s done. I’m not sure we want have half baked > features in the releases. > > Alan. > > >
