thexiay opened a new issue, #2619: URL: https://github.com/apache/orc/issues/2619
## Problem The `estimateRgEndOffset` method in `RecordReaderUtils.java` uses a `stretchFactor` to estimate how much compressed data to read ahead for a row group. The current formula: ```java int stretchFactor = 2 + (MAX_VALUES_LENGTH * MAX_BYTE_WIDTH - 1) / bufferSize; ``` does not account for the 2-byte RLEv2 DIRECT run header. This means the worst-case uncompressed payload is actually `MAX_VALUES_LENGTH * MAX_BYTE_WIDTH + 2` bytes (512 * 8 + 2 = 4098), not `MAX_VALUES_LENGTH * MAX_BYTE_WIDTH` (4096). ## Impact When data is incompressible (e.g., random bytes), each compression block expands to `HEADER_SIZE + bufferSize` bytes. With `bufferSize = 1024`, the old formula gives `stretchFactor = 5`, allocating space for 5 compressed blocks. However, 4098 bytes of uncompressed data requires `ceil(4098 / 1024) = 5` blocks of payload, plus the initial 2 blocks from the base factor, totaling 6 blocks needed. The old estimate falls short by one block, causing `IllegalArgumentException: Buffer size too small` when reading a full RLE v2 DIRECT run at the estimated boundary. ## Fix Include the RLEv2 header size in the worst-case calculation: ```java int maxRleDirectRunSize = MAX_VALUES_LENGTH * MAX_BYTE_WIDTH + 2; int stretchFactor = 2 + (maxRleDirectRunSize - 1) / bufferSize; ``` This correctly yields `stretchFactor = 6` for `bufferSize = 1024`, ensuring enough space is allocated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
