[
https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Saket Saurabh updated HIVE-14233:
---------------------------------
Attachment: HIVE-14233.02.patch
Second version of the patch with optimized code path to remove deleted rows
from a given vectorized row batch. This is done by loading all the delete
events into memory at once and using an optimized binary search algorithm.
> Improve vectorization for ACID by eliminating row-by-row stitching
> ------------------------------------------------------------------
>
> Key: HIVE-14233
> URL: https://issues.apache.org/jira/browse/HIVE-14233
> Project: Hive
> Issue Type: New Feature
> Components: Transactions, Vectorization
> Reporter: Saket Saurabh
> Assignee: Saket Saurabh
> Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch
>
>
> This JIRA proposes to improve vectorization for ACID by eliminating
> row-by-row stitching when reading back ACID files. In the current
> implementation, a vectorized row batch is created by populating the batch one
> row at a time, before the vectorized batch is passed up along the operator
> pipeline. This row-by-row stitching limitation was because of the fact that
> the ACID insert/update/delete events from various delta files needed to be
> merged together before the actual version of a given row was found out.
> HIVE-14035 has enabled us to break away from that limitation by splitting
> ACID update events into a combination of delete+insert. In fact, it has now
> enabled us to create splits on delta files.
> Building on top of HIVE-14035, this JIRA proposes to solve this earlier
> bottleneck in the vectorized code path for ACID by now directly reading row
> batches from the underlying ORC files and avoiding any stitching altogether.
> Once a row batch is read from the split (which may be on a base/delta file),
> the deleted rows will be found by cross-referencing them against a data
> structure that will just keep track of deleted events (found in the
> deleted_delta files). This will lead to a large performance gain when reading
> ACID files in vectorized fashion, while enabling further optimizations in
> future that can be done on top of that.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)