Re: [PR] ORC-2619: Fix estimateRgEndOffset slop calculation for incompressible data [orc]

via GitHub Sat, 09 May 2026 04:58:14 -0700


Copilot commented on code in PR #2620:
URL: https://github.com/apache/orc/pull/2620#discussion_r3213110862



##########
java/core/src/test/org/apache/orc/impl/TestInStream.java:
##########
@@ -1000,4 +1005,67 @@ public void testStreamResetWithoutIncreasedLength() 
throws IOException {
     byte[] inBuffer = new byte[5];
     assertEquals(5, inStream.read(inBuffer));
   }
+
+  /**
+   * Demonstrates that the old estimateRgEndOffset slop calculation is 
insufficient.
+   * When a compressed stream is truncated at the old estimated end offset,
+   * reading a full RLE v2 DIRECT run fails because the estimated slop doesn't
+   * account for enough compressed blocks.
+   */
+  @Test
+  public void testTruncatedRleV2DirectRunAtEstimatedEndFails() throws 
Exception {
+    final int bufferSize = 1024;
+    final int chunkSize = OutStream.HEADER_SIZE + bufferSize;
+    final int nextGroupOffset = bufferSize;
+    final int oldStretchFactor =
+        2 + (MAX_VALUES_LENGTH * MAX_BYTE_WIDTH - 1) / bufferSize;
+    final int oldEstimatedEnd = nextGroupOffset + oldStretchFactor * chunkSize;
+
+    TestInStream.OutputCollector receiver = new TestInStream.OutputCollector();
+    CompressionCodec codec = new ZlibCodec();
+    StreamOptions streamOptions = new StreamOptions(bufferSize)
+        .withCodec(codec, codec.getDefaultOptions());
+    byte[] data = new byte[bufferSize * 6];
+    new Random(42).nextBytes(data);
+    try (OutStream out = new OutStream("test", streamOptions, receiver)) {
+      out.write(data);
+      out.flush();
+    }
+
+    byte[] encoded = receiver.buffer.get();
+    assertEquals(nextGroupOffset + 5 * chunkSize, oldEstimatedEnd);
+    assertTrue(encoded.length > oldEstimatedEnd);
+
+    InStream stream = InStream.create("test",
+        new BufferChunk(ByteBuffer.wrap(encoded, 0, oldEstimatedEnd), 0),
+        0, oldEstimatedEnd,
+        InStream.options().withCodec(codec).withBufferSize(bufferSize));
+    byte[] rleDirectRun = new byte[MAX_VALUES_LENGTH * MAX_BYTE_WIDTH
+        + RecordReaderUtils.RLE_V2_HEADER_SIZE];
+
+    stream.seek(new SimplePositionProvider(nextGroupOffset, 0));
+    IllegalArgumentException error = assertThrows(
+        IllegalArgumentException.class, () -> {
+          int offset = 0;
+          while (offset < rleDirectRun.length) {
+            offset += stream.read(
+                rleDirectRun, offset, rleDirectRun.length - offset);

Review Comment:
   The read loop assumes `stream.read(...)` always returns a positive count; 
however `InStream#read(byte[],...)` returns `-1` at EOF. If the truncated 
stream hits EOF without throwing, `offset += -1` will lead to an infinite loop 
inside `assertThrows`. Guard against `-1` (e.g., fail/break) to keep the test 
from hanging and to produce a deterministic assertion failure.
   



##########
java/core/src/java/org/apache/orc/impl/RecordReaderUtils.java:
##########
@@ -300,6 +302,8 @@ public static int 
getIndexPosition(OrcProto.ColumnEncoding.Kind columnEncoding,
   // the maximum byte width for each value
   static final int MAX_BYTE_WIDTH =
       
SerializationUtils.decodeBitWidth(SerializationUtils.FixedBitSizes.SIXTYFOUR.ordinal())
 / 8;
+  // RLEv2 DIRECT run header size in bytes
+  public static final int RLE_V2_HEADER_SIZE = 2;

Review Comment:
   `RLE_V2_HEADER_SIZE` is declared `public`, while related internal sizing 
constants (`MAX_VALUES_LENGTH`, `MAX_BYTE_WIDTH`) are package-private. Since 
this is an impl/internal class and current usages are within the same package, 
consider making this constant package-private as well to avoid unintentionally 
expanding the public surface area.
   



##########
java/core/src/test/org/apache/orc/impl/TestInStream.java:
##########
@@ -1000,4 +1005,67 @@ public void testStreamResetWithoutIncreasedLength() 
throws IOException {
     byte[] inBuffer = new byte[5];
     assertEquals(5, inStream.read(inBuffer));
   }
+
+  /**
+   * Demonstrates that the old estimateRgEndOffset slop calculation is 
insufficient.
+   * When a compressed stream is truncated at the old estimated end offset,
+   * reading a full RLE v2 DIRECT run fails because the estimated slop doesn't
+   * account for enough compressed blocks.
+   */
+  @Test
+  public void testTruncatedRleV2DirectRunAtEstimatedEndFails() throws 
Exception {
+    final int bufferSize = 1024;
+    final int chunkSize = OutStream.HEADER_SIZE + bufferSize;
+    final int nextGroupOffset = bufferSize;
+    final int oldStretchFactor =
+        2 + (MAX_VALUES_LENGTH * MAX_BYTE_WIDTH - 1) / bufferSize;
+    final int oldEstimatedEnd = nextGroupOffset + oldStretchFactor * chunkSize;

Review Comment:
   `nextGroupOffset` is set to `bufferSize`, but for compressed streams ORC 
row-index positions are in *compressed* bytes (see `OutStream#getPosition`), 
which include the 3-byte chunk header. With `bufferSize=1024`, the first chunk 
boundary is typically `OutStream.HEADER_SIZE + bufferSize` (=1027) when stored 
as original. Using 1024 seeks into the middle of a chunk and can trigger 
"Buffer size too small" for the wrong reason. Derive `nextGroupOffset` from a 
real position (e.g., `chunkSize` or by recording `OutStream` positions) so the 
test targets the slop-estimation behavior.



##########
java/core/src/java/org/apache/orc/impl/RecordReaderUtils.java:
##########
@@ -217,7 +217,9 @@ public static long estimateRgEndOffset(boolean isCompressed,
     // Stretch the slop by a factor to safely accommodate following 
compression blocks.
     // We need to calculate the maximum number of blocks(stretchFactor) by 
bufferSize accordingly.
     if (isCompressed) {
-      int stretchFactor = 2 + (MAX_VALUES_LENGTH * MAX_BYTE_WIDTH - 1) / 
bufferSize;
+      // RLEv2 DIRECT runs can need a 2-byte header in addition to their value 
payload.
+      int maxRleDirectRunSize = MAX_VALUES_LENGTH * MAX_BYTE_WIDTH + 2;
+      int stretchFactor = 2 + (maxRleDirectRunSize - 1) / bufferSize;

Review Comment:
   `estimateRgEndOffset` introduces `RLE_V2_HEADER_SIZE`, but the calculation 
still hard-codes `+ 2`. Use the constant (e.g., `MAX_VALUES_LENGTH * 
MAX_BYTE_WIDTH + RLE_V2_HEADER_SIZE`) to avoid duplicated magic numbers and 
keep the method in sync with the constant.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] ORC-2619: Fix estimateRgEndOffset slop calculation for incompressible data [orc]

Reply via email to