[ https://issues.apache.org/jira/browse/ORC-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17473187#comment-17473187 ]
Dongjoon Hyun commented on ORC-1087: ------------------------------------ BTW, does this affect only ORC 1.7.x? Could you link the ORC JIRA issue which causes this at Apache ORC 1.7.0? > Seek overflow in an uncompressed chunk > -------------------------------------- > > Key: ORC-1087 > URL: https://issues.apache.org/jira/browse/ORC-1087 > Project: ORC > Issue Type: Bug > Components: C++ > Affects Versions: 1.7.0, 1.7.1, 1.7.2 > Reporter: Quanlong Huang > Assignee: Quanlong Huang > Priority: Critical > Attachments: scan_with_sarg.cc, seek-issue-snappy-500k.orc > > > Reading the attached ORC file with SearchArgument "{{{}sr_return_amt > > 10000{}}}" using the C++ reader will fail with > {code:java} > Corrupt PATCHED_BASE encoded data (pl==0)!{code} > It's ok to read it without the SearchArgument. The java reader is able to > read it with the same SearchArgument. > Attached the source codes (scan_with_sarg.cc) for reproducing the issue. > Build the ORC lib and compile it by > {code:bash} > g++ scan_with_sarg.cc -o scan_with_sarg -I../c++/include -Ic++/include > -Lc++/src/ -Lsnappy_ep-prefix/src/snappy_ep-build/ > -Llz4_ep-prefix/src/lz4_ep-build/ -Lzlib_ep-prefix/src/zlib_ep-build/ > -Lzstd_ep-prefix/src/zstd_ep-build/lib/ > -Lprotobuf_ep-prefix/src/protobuf_ep-build/ -lorc -lz -lsnappy -llz4 -lzstd > -lprotobuf > {code} > Run it as > {code:bash} > $ LD_LIBRARY_PATH="$LD_LIBRARY_PATH:zstd_ep-prefix/src/zstd_ep-build/lib/" > ./scan_with_sarg > leaf-0 = (column(id=17) <= 10000), expr = (not leaf-0) > terminate called after throwing an instance of 'orc::ParseError' > what(): Corrupt PATCHED_BASE encoded data (pl==0)! > Aborted (core dumped) > {code} > *RCA* > The sarg introduces a seek to RowGroup 42. The following codes in > {{DecompressionStream::seek}} didn't handle the case when > uncompressedBufferLength < posInChunk. Then seeks to an illegal position and > the length overflow. > {code:cpp} > if (headerPosition == seekedPosition > && inputBufferStartPosition <= headerPosition + 3 && inputBufferStart) { > position.next(); // Skip the input level position. > size_t posInChunk = position.next(); // Chunk level position. > // Overflow here! uncompressedBufferLength=30950, posInChunk=39498 > outputBufferLength = uncompressedBufferLength - posInChunk; > outputBuffer = outputBufferStart + posInChunk; > return; > }{code} > That chunk is an uncompressed chunk, and the whole chunk is read in pieces. > The position (posInChunk) hasn't been read out yet. We need to handle this > case. > I think this only happens on uncompressed chunks. For compressed chunks, they > are decompressed as a whole. So posInChunk will always be valid in the output > buffer. -- This message was sent by Atlassian Jira (v8.20.1#820001)