Csaba Ringhofer created ORC-614:

             Summary: Implement efficient seek() in decompression streams
                 Key: ORC-614
                 URL: https://issues.apache.org/jira/browse/ORC-614
             Project: ORC
          Issue Type: Improvement
          Components: C++
            Reporter: Csaba Ringhofer

The current implementation of 
ZlibDecompressionStream/BlockDecompressionStream::seek resets the state of the 
decompressor and the underlying file reader and throws away their buffers. The 
buffers can still have usable data in the following cases;
1. If the new row group's start position is in the same compressed chunk we 
were reading, then we just jumped to another position within the same 
uncompressed buffer, so both the original compressed buffer and the 
decompressed  buffer can be reused. This is a very common scenario with the 
default ORC configs of unaligned 256KB>=chunks and 10K row groups, e.g. chunk 
can contain 3 full row groups of 8 byte int without any encoding.
2.  If the new row group's start position is in another compressed chunk, but 
it starts in the current compressed  buffer (as we have read ahead during file 
reading), then the compressed buffer can be kept and only the uncompressed 
buffer needs to be dropped. This is the usual case in Apache Impala, as 8 MB 
block size is used which leads to reading the whole stream to the buffer for 
typical columns.

The lack of these optimizations lead to regression during the testing of 
https://github.com/apache/orc/pull/476, which uses seek() when a row group is 
skipped due to predicate push down, as all seeks caused the whole stream to be 
read again.

This message was sent by Atlassian Jira

Reply via email to