[I] estimateRgEndOffset slop calculation is insufficient for incompressible data [orc]

via GitHub Thu, 07 May 2026 08:55:59 -0700


thexiay opened a new issue, #2619:
URL: https://github.com/apache/orc/issues/2619


   ## Problem
   
   The `estimateRgEndOffset` method in `RecordReaderUtils.java` uses a 
`stretchFactor` to estimate how much compressed data to read ahead for a row 
group. The current formula:
   
   ```java
   int stretchFactor = 2 + (MAX_VALUES_LENGTH * MAX_BYTE_WIDTH - 1) / 
bufferSize;
   ```
   
   does not account for the 2-byte RLEv2 DIRECT run header. This means the 
worst-case uncompressed payload is actually `MAX_VALUES_LENGTH * MAX_BYTE_WIDTH 
+ 2` bytes (512 * 8 + 2 = 4098), not `MAX_VALUES_LENGTH * MAX_BYTE_WIDTH` 
(4096).
   
   ## Impact
   
   When data is incompressible (e.g., random bytes), each compression block 
expands to `HEADER_SIZE + bufferSize` bytes. With `bufferSize = 1024`, the old 
formula gives `stretchFactor = 5`, allocating space for 5 compressed blocks. 
However, 4098 bytes of uncompressed data requires `ceil(4098 / 1024) = 5` 
blocks of payload, plus the initial 2 blocks from the base factor, totaling 6 
blocks needed. The old estimate falls short by one block, causing 
`IllegalArgumentException: Buffer size too small` when reading a full RLE v2 
DIRECT run at the estimated boundary.
   
   ## Fix
   
   Include the RLEv2 header size in the worst-case calculation:
   
   ```java
   int maxRleDirectRunSize = MAX_VALUES_LENGTH * MAX_BYTE_WIDTH + 2;
   int stretchFactor = 2 + (maxRleDirectRunSize - 1) / bufferSize;
   ```
   
   This correctly yields `stretchFactor = 6` for `bufferSize = 1024`, ensuring 
enough space is allocated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] estimateRgEndOffset slop calculation is insufficient for incompressible data [orc]

Reply via email to