[ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267179#comment-15267179
 ] 

Sergey Shelukhin commented on HIVE-9660:
----------------------------------------

{noformat}
The run length encoder doesn't perform the callback, but when its RLE block is 
finished passes the same callback to the OutStream for when the OutStream 
finishes the next compression block. Thus it is easy to guarantee that you only 
get called back when compression block finishes after the RLE finishes, which 
is the required condition. Obviously, for cases where there isn't an RLE, it 
just puts the callback directly on the OutStream and it works exactly the same 
way.
{noformat}
RG can have several RLE blocks; RLE block can contain several RGs. Moreover, in 
case of a boolean writer, there are two levels of buffering - the byte, and the 
RLE buffer in the underlying byte writer.

There's also the issue of dictionaries and strings, where isPresent is written 
normally but the entries cannot be finalized.
In general, I feel like all the coordination complexity will still be 
necessary, it would just end up moving around a bit.

> store end offset of compressed data for RG in RowIndex in ORC
> -------------------------------------------------------------
>
>                 Key: HIVE-9660
>                 URL: https://issues.apache.org/jira/browse/HIVE-9660
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HIVE-9660.01.patch, HIVE-9660.02.patch, 
> HIVE-9660.03.patch, HIVE-9660.04.patch, HIVE-9660.05.patch, 
> HIVE-9660.06.patch, HIVE-9660.07.patch, HIVE-9660.07.patch, 
> HIVE-9660.08.patch, HIVE-9660.09.patch, HIVE-9660.10.patch, 
> HIVE-9660.10.patch, HIVE-9660.11.patch, HIVE-9660.patch, HIVE-9660.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to