[ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-9660:
--------------------------------
    Attachment: HIVE-9660.patch

This patch does:
* implements a PositionedOutputStream.Callback to track when compression blocks 
and RLE are finished.
* Adds lengths to the OrcProto.RowIndexEntry.
* Uses the lengths when determining the number of bytes to read when doing 
predicate push down.
* Creates a callback for RowIndexEntry in the WriterImpl such that the entry 
isn't finalized until all of the streams do their callback. To ensure that the 
entry isn't finalized before all of the streams are added there is an 
activation after the last stream has been added to the RowIndexEntry.
* Removing the positions and lengths from the RowIndexEntry for ispresent 
stream removal is done softly so that remaining callbacks don't get impacted.
* The code dealing with the string columns and the dictionary vs direct 
encoding has been significantly cleaned up.
* TreeWriter.writeStripe has been split into a flush method that will finalize 
all of the streams.
* Lots of test case updates for the changes ORC file sizes.
* A new test case that tests the callbacks.

> store end offset of compressed data for RG in RowIndex in ORC
> -------------------------------------------------------------
>
>                 Key: HIVE-9660
>                 URL: https://issues.apache.org/jira/browse/HIVE-9660
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HIVE-9660.01.patch, HIVE-9660.02.patch, 
> HIVE-9660.03.patch, HIVE-9660.04.patch, HIVE-9660.05.patch, 
> HIVE-9660.06.patch, HIVE-9660.07.patch, HIVE-9660.07.patch, 
> HIVE-9660.08.patch, HIVE-9660.09.patch, HIVE-9660.10.patch, 
> HIVE-9660.10.patch, HIVE-9660.11.patch, HIVE-9660.patch, HIVE-9660.patch, 
> HIVE-9660.patch, owen-hive-9660.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to