Norbert Luksa reassigned ORC-614:

    Assignee: Norbert Luksa

> Implement efficient seek() in decompression streams
> ---------------------------------------------------
>                 Key: ORC-614
>                 URL: https://issues.apache.org/jira/browse/ORC-614
>             Project: ORC
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Csaba Ringhofer
>            Assignee: Norbert Luksa
>            Priority: Major
> The current implementation of 
> ZlibDecompressionStream/BlockDecompressionStream::seek resets the state of 
> the decompressor and the underlying file reader and throws away their 
> buffers. The buffers can still have usable data in the following cases;
> 1. If the new row group's start position is in the same compressed chunk we 
> were reading, then we just jumped to another position within the same 
> uncompressed buffer, so both the original compressed buffer and the 
> decompressed  buffer can be reused. This is a very common scenario with the 
> default ORC configs of unaligned 256KB>=chunks and 10K row groups, e.g. chunk 
> can contain 3 full row groups of 8 byte int without any encoding.
> 2.  If the new row group's start position is in another compressed chunk, but 
> it starts in the current compressed  buffer (as we have read ahead during 
> file reading), then the compressed buffer can be kept and only the 
> uncompressed buffer needs to be dropped. This is the usual case in Apache 
> Impala, as 8 MB block size is used which leads to reading the whole stream to 
> the buffer for typical columns.
> The lack of these optimizations lead to regression during the testing of 
> https://github.com/apache/orc/pull/476, which uses seek() when a row group is 
> skipped due to predicate push down, as all seeks caused the whole stream to 
> be read again.

This message was sent by Atlassian Jira

Reply via email to