[
https://issues.apache.org/jira/browse/ORC-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gopal Vijayaraghavan reassigned ORC-1078:
-----------------------------------------
Assignee: Yu-Wen Lai
> Row group end offset doesn't accommodate all the blocks
> -------------------------------------------------------
>
> Key: ORC-1078
> URL: https://issues.apache.org/jira/browse/ORC-1078
> Project: ORC
> Issue Type: Bug
> Reporter: Yu-Wen Lai
> Assignee: Yu-Wen Lai
> Priority: Major
>
> The error message in current master:
> {code:java}
> java.lang.IllegalArgumentException
> at java.nio.Buffer.position(Buffer.java:244)
> at
> org.apache.orc.impl.InStream$CompressedStream.setCurrent(InStream.java:453)
> at
> org.apache.orc.impl.InStream$CompressedStream.readHeaderByte(InStream.java:462)
> at
> org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:474)
> at
> org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:528)
> at
> org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:515){code}
> The same error can appear a little differently in older version:
> {code:java}
> java.io.IOException: Seek outside of data in compressed stream Stream for
> column 15 kind DATA position: 111956 length: 383146 range: 0 offset: 36674
> limit: 36674 range 0 = 75
> 282 to 36674; range 1 = 151666 to 40267; range 2 = 228805 to 41623
> uncompressed: 1024 to 1024 to 111956{code}
> Here is the info extracted from the problematic orc file:
> {code:java}
> Compression: ZLIB
> Compression size: 1024
> Calendar: Julian/Gregorian
> Type: struct<col:timestamp>
> Row group indices:
> Entry 0: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max:
> 2021-12-06 21:09:56.000999999 positions: 0,0,0,0,0,0
> Entry 1: count: 10000 hasNull: false min: 2016-03-09 11:36:36.0 max:
> 2021-12-06 21:53:33.000999999 positions: 37509,683,98,2630,296,3
> Entry 2: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max:
> 2026-05-29 00:53:10.000999999 positions: 75265,256,229,5902,370,8
> Entry 3: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max:
> 2021-12-06 18:14:52.000999999 positions: 109907,934,398,8581,322,18{code}
> The issue happened when entry 2 is selected and read due to incorrect end
> offset for this row group. To be more specific, when compression size is
> smaller than 2048, there is edge case we cannot accommodate all the blocks by
> the factor of 2 (please see the code snippet below).
> {code:java}
> public static long estimateRgEndOffset(boolean isCompressed,
> int bufferSize,
> boolean isLast,
> long nextGroupOffset,
> long streamLength) {
> // figure out the worst case last location
> // if adjacent groups have the same compressed block offset then stretch
> the slop
> // by factor of 2 to safely accommodate the next compression block.
> // One for the current compression block and another for the next
> compression block.
> long slop = isCompressed?
> 2 * (OutStream.HEADER_SIZE + bufferSize): WORST_UNCOMPRESSED_SLOP;
> return isLast ? streamLength : Math.min(streamLength, nextGroupOffset +
> slop);
> }{code}
> In our case, we need slop > 934 (buffer) + 398 * 4 + header bytes, but slop =
> 1027 * 2 = 2054. That causes seeking outside of range.
> In terms of the worst case, we might have uncompressed block in compressed
> stream. Suppose compression size = C, the factor = 1 (buffer) + (511 * 4 +
> header bytes) / C.
> C = 1024 -> factor should be 3
> C = 512 -> factor should be 5 ... and so forth.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)