[jira] [Comment Edited] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC

Sergey Shelukhin (JIRA) Sat, 19 Mar 2016 00:38:41 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200794#comment-15200794
 ]


Sergey Shelukhin edited comment on HIVE-9660 at 3/18/16 1:45 AM:
-----------------------------------------------------------------

The fundamental problem with this patch is that logical writers (e.g. RLE 
writer) buffer the data. And for some writers like bit writer, we cannot even 
force the flush at the end of the RG, which would have solved this problem at 
some small size cost (all the encoding segments would have to terminate at RG 
boundaries). And in combination with buffering inside the output streams 
(before compression), it's practically impossible to know when a particular RG 
is fully in (i.e. last rows were added to some concrete compression buffer). 



was (Author: sershe):
The fundamental problem with this patch is that logical writers (e.g. RLE 
writer) buffer the data. And for some writers like bit writer, we cannot even 
force the flush at the end of the RG, which would have solved this problem at 
some small size cost (all the encoding segments would have to terminate at RG 
boundaries). 

> store end offset of compressed data for RG in RowIndex in ORC
> -------------------------------------------------------------
>
>                 Key: HIVE-9660
>                 URL: https://issues.apache.org/jira/browse/HIVE-9660
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HIVE-9660.WIP2.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC

Reply via email to