When ORC moved out of Hive, it didn’t bring the ACID work along. I’d like to start working to remedy that. I wanted to give an outline of how I am thinking of approaching it.
In general, I plan to focus on supporting the new split update (aka ACID 2.0) layout, where delta files contain either all inserts or all deletes (updates are accomplished by putting in a delete and an insert). This is what Hive supports in its trunk (but not in Hive 1 or 2). I also plan to follow the ORC pattern of focusing on vectorized row batches first, and then building the row by row readers and writers as shims on top of this. Proposed plan: 1) Build a version of RecordReader that can handle ACID files. This would be roughly analogous to Hive’s VectorizedOrcAcidRowBatchReader. 2) I haven’t looked into the details here yet, but I assume I will need some changes on the Writer side as well to handle writing out base versus delta files as well as insert versus delete delta files. 3) Put the shims in place to support ORC equivalents to Hive’s AcidInputFormat and AcidOutputFormat. 4) Change Hive to use the code now in ORC rather than duplicating this code in Hive. Seem reasonable? Should I do this in master or in a branch? In general I prefer to work in master when possible. But I see a couple of reasons to branch: 1) This will require changes in Hive, some that aren’t released yet. For example, this will depend on moving ValidTxnList to storage-api (which I plan to do anyway, but haven’t yet). It would be convenient to be able to depend on SNAPSHOT versions of storage-api rather than forcing a bunch of releases. But I don’t want to do that in master because it can make it hard for people to build and it makes releases impossible. 2) This is going to take a while and I suspect ORC will want to release multiple times before it’s done. I’m not sure we want have half baked features in the releases. Alan.
