[ 
https://issues.apache.org/jira/browse/ORC-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17473187#comment-17473187
 ] 

Dongjoon Hyun commented on ORC-1087:
------------------------------------

BTW, does this affect only ORC 1.7.x? Could you link the ORC JIRA issue which 
causes this at Apache ORC 1.7.0?

> Seek overflow in an uncompressed chunk
> --------------------------------------
>
>                 Key: ORC-1087
>                 URL: https://issues.apache.org/jira/browse/ORC-1087
>             Project: ORC
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 1.7.0, 1.7.1, 1.7.2
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Critical
>         Attachments: scan_with_sarg.cc, seek-issue-snappy-500k.orc
>
>
> Reading the attached ORC file with SearchArgument "{{{}sr_return_amt > 
> 10000{}}}" using the C++ reader will fail with
> {code:java}
> Corrupt PATCHED_BASE encoded data (pl==0)!{code}
> It's ok to read it without the SearchArgument. The java reader is able to 
> read it with the same SearchArgument.
> Attached the source codes (scan_with_sarg.cc) for reproducing the issue. 
> Build the ORC lib and compile it by
> {code:bash}
> g++ scan_with_sarg.cc -o scan_with_sarg -I../c++/include -Ic++/include 
> -Lc++/src/ -Lsnappy_ep-prefix/src/snappy_ep-build/ 
> -Llz4_ep-prefix/src/lz4_ep-build/ -Lzlib_ep-prefix/src/zlib_ep-build/ 
> -Lzstd_ep-prefix/src/zstd_ep-build/lib/ 
> -Lprotobuf_ep-prefix/src/protobuf_ep-build/ -lorc -lz -lsnappy -llz4 -lzstd 
> -lprotobuf
> {code}
> Run it as
> {code:bash}
> $ LD_LIBRARY_PATH="$LD_LIBRARY_PATH:zstd_ep-prefix/src/zstd_ep-build/lib/" 
> ./scan_with_sarg 
> leaf-0 = (column(id=17) <= 10000), expr = (not leaf-0)
> terminate called after throwing an instance of 'orc::ParseError'
>   what():  Corrupt PATCHED_BASE encoded data (pl==0)!
> Aborted (core dumped)
> {code}
> *RCA*
> The sarg introduces a seek to RowGroup 42. The following codes in 
> {{DecompressionStream::seek}} didn't handle the case when 
> uncompressedBufferLength < posInChunk. Then seeks to an illegal position and 
> the length overflow.
> {code:cpp}
> if (headerPosition == seekedPosition
>     && inputBufferStartPosition <= headerPosition + 3 && inputBufferStart) {
>   position.next(); // Skip the input level position.
>   size_t posInChunk = position.next(); // Chunk level position.
>   // Overflow here! uncompressedBufferLength=30950, posInChunk=39498
>   outputBufferLength = uncompressedBufferLength - posInChunk;
>   outputBuffer = outputBufferStart + posInChunk;
>   return;
> }{code}
> That chunk is an uncompressed chunk, and the whole chunk is read in pieces. 
> The position (posInChunk) hasn't been read out yet. We need to handle this 
> case.
> I think this only happens on uncompressed chunks. For compressed chunks, they 
> are decompressed as a whole. So posInChunk will always be valid in the output 
> buffer.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to