[
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200794#comment-15200794
]
Sergey Shelukhin edited comment on HIVE-9660 at 3/18/16 1:45 AM:
-----------------------------------------------------------------
The fundamental problem with this patch is that logical writers (e.g. RLE
writer) buffer the data. And for some writers like bit writer, we cannot even
force the flush at the end of the RG, which would have solved this problem at
some small size cost (all the encoding segments would have to terminate at RG
boundaries). And in combination with buffering inside the output streams
(before compression), it's practically impossible to know when a particular RG
is fully in (i.e. last rows were added to some concrete compression buffer).
was (Author: sershe):
The fundamental problem with this patch is that logical writers (e.g. RLE
writer) buffer the data. And for some writers like bit writer, we cannot even
force the flush at the end of the RG, which would have solved this problem at
some small size cost (all the encoding segments would have to terminate at RG
boundaries).
> store end offset of compressed data for RG in RowIndex in ORC
> -------------------------------------------------------------
>
> Key: HIVE-9660
> URL: https://issues.apache.org/jira/browse/HIVE-9660
> Project: Hive
> Issue Type: Bug
> Reporter: Sergey Shelukhin
> Assignee: Sergey Shelukhin
> Attachments: HIVE-9660.WIP2.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of
> compressed buffers for each RG, or end offset, or something, to remove this
> estimation magic
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)