[ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267179#comment-15267179
 ] 

Sergey Shelukhin edited comment on HIVE-9660 at 5/2/16 6:22 PM:
----------------------------------------------------------------

{quote}
The run length encoder doesn't perform the callback, but when its RLE block is 
finished passes the same callback to the OutStream for when the OutStream 
finishes the next compression block. Thus it is easy to guarantee that you only 
get called back when compression block finishes after the RLE finishes, which 
is the required condition. Obviously, for cases where there isn't an RLE, it 
just puts the callback directly on the OutStream and it works exactly the same 
way.
{quote}
RG can have several RLE blocks; RL reader will need to know when to pass the 
callback (assuming the callback maps to RG; otherwise, how does the WriterImpl 
know which RG is done after a callback?); RLE block can contain several RGs, 
too. Moreover, in case of a boolean writer, there are two levels of buffering - 
the current byte, and the RLE buffer in the underlying byte writer.

There's also the issue of dictionaries and strings, where isPresent is written 
normally but the entries cannot be finalized.
In general, I feel like all the coordination complexity will still be 
necessary, it would just end up moving around a bit.

For uncompressed, if the exact boundary had to be determined, a callback would 
need to be called every RLE buffer, and in some cases like for boolean writer 
it could be as often as every few bytes.


was (Author: sershe):
{quote}
The run length encoder doesn't perform the callback, but when its RLE block is 
finished passes the same callback to the OutStream for when the OutStream 
finishes the next compression block. Thus it is easy to guarantee that you only 
get called back when compression block finishes after the RLE finishes, which 
is the required condition. Obviously, for cases where there isn't an RLE, it 
just puts the callback directly on the OutStream and it works exactly the same 
way.
{quote}
RG can have several RLE blocks; RLE block can contain several RGs. Moreover, in 
case of a boolean writer, there are two levels of buffering - the current byte, 
and the RLE buffer in the underlying byte writer.

There's also the issue of dictionaries and strings, where isPresent is written 
normally but the entries cannot be finalized.
In general, I feel like all the coordination complexity will still be 
necessary, it would just end up moving around a bit.

For uncompressed, if the exact boundary had to be determined, a callback would 
need to be called every RLE buffer, and in some cases like for boolean writer 
it could be as often as every few bytes.

> store end offset of compressed data for RG in RowIndex in ORC
> -------------------------------------------------------------
>
>                 Key: HIVE-9660
>                 URL: https://issues.apache.org/jira/browse/HIVE-9660
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HIVE-9660.01.patch, HIVE-9660.02.patch, 
> HIVE-9660.03.patch, HIVE-9660.04.patch, HIVE-9660.05.patch, 
> HIVE-9660.06.patch, HIVE-9660.07.patch, HIVE-9660.07.patch, 
> HIVE-9660.08.patch, HIVE-9660.09.patch, HIVE-9660.10.patch, 
> HIVE-9660.10.patch, HIVE-9660.11.patch, HIVE-9660.patch, HIVE-9660.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to