[jira] [Updated] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching

Saket Saurabh (JIRA) Thu, 21 Jul 2016 16:21:46 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Saket Saurabh updated HIVE-14233:
---------------------------------
    Attachment: HIVE-14233.02.patch

Second version of the patch with optimized code path to remove deleted rows 
from a given vectorized row batch. This is done by loading all the delete 
events into memory at once and using an optimized binary search algorithm.

> Improve vectorization for ACID by eliminating row-by-row stitching
> ------------------------------------------------------------------
>
>                 Key: HIVE-14233
>                 URL: https://issues.apache.org/jira/browse/HIVE-14233
>             Project: Hive
>          Issue Type: New Feature
>          Components: Transactions, Vectorization
>            Reporter: Saket Saurabh
>            Assignee: Saket Saurabh
>         Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch
>
>
> This JIRA proposes to improve vectorization for ACID by eliminating 
> row-by-row stitching when reading back ACID files. In the current 
> implementation, a vectorized row batch is created by populating the batch one 
> row at a time, before the vectorized batch is passed up along the operator 
> pipeline. This row-by-row stitching limitation was because of the fact that 
> the ACID insert/update/delete events from various delta files needed to be 
> merged together before the actual version of a given row was found out. 
> HIVE-14035 has enabled us to break away from that limitation by splitting 
> ACID update events into a combination of delete+insert. In fact, it has now 
> enabled us to create splits on delta files.
> Building on top of HIVE-14035, this JIRA proposes to solve this earlier 
> bottleneck in the vectorized code path for ACID by now directly reading row 
> batches from the underlying ORC files and avoiding any stitching altogether. 
> Once a row batch is read from the split (which may be on a base/delta file), 
> the deleted rows will be found by cross-referencing them against a data 
> structure that will just keep track of deleted events (found in the 
> deleted_delta files). This will lead to a large performance gain when reading 
> ACID files in vectorized fashion, while enabling further optimizations in 
> future that can be done on top of that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching

Reply via email to