Review Request 50934: HIVE-14233 Improve vectorization for ACID by eliminating row-by-row stitching

Saket Saurabh Tue, 09 Aug 2016 13:53:14 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/50934/
-----------------------------------------------------------


Review request for hive and Eugene Koifman.


Repository: hive-git


Description
-------

https://issues.apache.org/jira/browse/HIVE-14233


Diffs
-----

  ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java 
334cb31c5406f500c122a11eccef25b92d357cd4 
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java 
e46ca51eff9c230147166e9428d7f462d2f9e772 
  
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
 PRE-CREATION 
  ql/src/test/queries/clientpositive/acid_vectorization.q 
832909bdb1bc79e01163373beed03eaaffcefd3d 
  ql/src/test/results/clientpositive/acid_vectorization.q.out 
1792979156ec361c85882ac8b6968e93d42b5f31 

Diff: https://reviews.apache.org/r/50934/diff/


Testing
-------

This JIRA proposes to improve vectorization for ACID by eliminating row-by-row 
stitching when reading back ACID files. In the current implementation, a 
vectorized row batch is created by populating the batch one row at a time, 
before the vectorized batch is passed up along the operator pipeline. This 
row-by-row stitching limitation was because of the fact that the ACID 
insert/update/delete events from various delta files needed to be merged 
together before the actual version of a given row was found out. HIVE-14035 has 
enabled us to break away from that limitation by splitting ACID update events 
into a combination of delete+insert. In fact, it has now enabled us to create 
splits on delta files.
Building on top of HIVE-14035, this JIRA proposes to solve this earlier 
bottleneck in the vectorized code path for ACID by now directly reading row 
batches from the underlying ORC files and avoiding any stitching altogether. 
Once a row batch is read from the split (which may be on a base/delta file), 
the deleted rows will be found by cross-referencing them against a data 
structure that will just keep track of deleted events (found in the 
deleted_delta files). This will lead to a large performance gain when reading 
ACID files in vectorized fashion, while enabling further optimizations in 
future that can be done on top of that.


Thanks,

Saket Saurabh

Review Request 50934: HIVE-14233 Improve vectorization for ACID by eliminating row-by-row stitching

Reply via email to