Copilot commented on code in PR #2620:
URL: https://github.com/apache/orc/pull/2620#discussion_r3213110862
##########
java/core/src/test/org/apache/orc/impl/TestInStream.java:
##########
@@ -1000,4 +1005,67 @@ public void testStreamResetWithoutIncreasedLength()
throws IOException {
byte[] inBuffer = new byte[5];
assertEquals(5, inStream.read(inBuffer));
}
+
+ /**
+ * Demonstrates that the old estimateRgEndOffset slop calculation is
insufficient.
+ * When a compressed stream is truncated at the old estimated end offset,
+ * reading a full RLE v2 DIRECT run fails because the estimated slop doesn't
+ * account for enough compressed blocks.
+ */
+ @Test
+ public void testTruncatedRleV2DirectRunAtEstimatedEndFails() throws
Exception {
+ final int bufferSize = 1024;
+ final int chunkSize = OutStream.HEADER_SIZE + bufferSize;
+ final int nextGroupOffset = bufferSize;
+ final int oldStretchFactor =
+ 2 + (MAX_VALUES_LENGTH * MAX_BYTE_WIDTH - 1) / bufferSize;
+ final int oldEstimatedEnd = nextGroupOffset + oldStretchFactor * chunkSize;
+
+ TestInStream.OutputCollector receiver = new TestInStream.OutputCollector();
+ CompressionCodec codec = new ZlibCodec();
+ StreamOptions streamOptions = new StreamOptions(bufferSize)
+ .withCodec(codec, codec.getDefaultOptions());
+ byte[] data = new byte[bufferSize * 6];
+ new Random(42).nextBytes(data);
+ try (OutStream out = new OutStream("test", streamOptions, receiver)) {
+ out.write(data);
+ out.flush();
+ }
+
+ byte[] encoded = receiver.buffer.get();
+ assertEquals(nextGroupOffset + 5 * chunkSize, oldEstimatedEnd);
+ assertTrue(encoded.length > oldEstimatedEnd);
+
+ InStream stream = InStream.create("test",
+ new BufferChunk(ByteBuffer.wrap(encoded, 0, oldEstimatedEnd), 0),
+ 0, oldEstimatedEnd,
+ InStream.options().withCodec(codec).withBufferSize(bufferSize));
+ byte[] rleDirectRun = new byte[MAX_VALUES_LENGTH * MAX_BYTE_WIDTH
+ + RecordReaderUtils.RLE_V2_HEADER_SIZE];
+
+ stream.seek(new SimplePositionProvider(nextGroupOffset, 0));
+ IllegalArgumentException error = assertThrows(
+ IllegalArgumentException.class, () -> {
+ int offset = 0;
+ while (offset < rleDirectRun.length) {
+ offset += stream.read(
+ rleDirectRun, offset, rleDirectRun.length - offset);
Review Comment:
The read loop assumes `stream.read(...)` always returns a positive count;
however `InStream#read(byte[],...)` returns `-1` at EOF. If the truncated
stream hits EOF without throwing, `offset += -1` will lead to an infinite loop
inside `assertThrows`. Guard against `-1` (e.g., fail/break) to keep the test
from hanging and to produce a deterministic assertion failure.
##########
java/core/src/java/org/apache/orc/impl/RecordReaderUtils.java:
##########
@@ -300,6 +302,8 @@ public static int
getIndexPosition(OrcProto.ColumnEncoding.Kind columnEncoding,
// the maximum byte width for each value
static final int MAX_BYTE_WIDTH =
SerializationUtils.decodeBitWidth(SerializationUtils.FixedBitSizes.SIXTYFOUR.ordinal())
/ 8;
+ // RLEv2 DIRECT run header size in bytes
+ public static final int RLE_V2_HEADER_SIZE = 2;
Review Comment:
`RLE_V2_HEADER_SIZE` is declared `public`, while related internal sizing
constants (`MAX_VALUES_LENGTH`, `MAX_BYTE_WIDTH`) are package-private. Since
this is an impl/internal class and current usages are within the same package,
consider making this constant package-private as well to avoid unintentionally
expanding the public surface area.
##########
java/core/src/test/org/apache/orc/impl/TestInStream.java:
##########
@@ -1000,4 +1005,67 @@ public void testStreamResetWithoutIncreasedLength()
throws IOException {
byte[] inBuffer = new byte[5];
assertEquals(5, inStream.read(inBuffer));
}
+
+ /**
+ * Demonstrates that the old estimateRgEndOffset slop calculation is
insufficient.
+ * When a compressed stream is truncated at the old estimated end offset,
+ * reading a full RLE v2 DIRECT run fails because the estimated slop doesn't
+ * account for enough compressed blocks.
+ */
+ @Test
+ public void testTruncatedRleV2DirectRunAtEstimatedEndFails() throws
Exception {
+ final int bufferSize = 1024;
+ final int chunkSize = OutStream.HEADER_SIZE + bufferSize;
+ final int nextGroupOffset = bufferSize;
+ final int oldStretchFactor =
+ 2 + (MAX_VALUES_LENGTH * MAX_BYTE_WIDTH - 1) / bufferSize;
+ final int oldEstimatedEnd = nextGroupOffset + oldStretchFactor * chunkSize;
Review Comment:
`nextGroupOffset` is set to `bufferSize`, but for compressed streams ORC
row-index positions are in *compressed* bytes (see `OutStream#getPosition`),
which include the 3-byte chunk header. With `bufferSize=1024`, the first chunk
boundary is typically `OutStream.HEADER_SIZE + bufferSize` (=1027) when stored
as original. Using 1024 seeks into the middle of a chunk and can trigger
"Buffer size too small" for the wrong reason. Derive `nextGroupOffset` from a
real position (e.g., `chunkSize` or by recording `OutStream` positions) so the
test targets the slop-estimation behavior.
##########
java/core/src/java/org/apache/orc/impl/RecordReaderUtils.java:
##########
@@ -217,7 +217,9 @@ public static long estimateRgEndOffset(boolean isCompressed,
// Stretch the slop by a factor to safely accommodate following
compression blocks.
// We need to calculate the maximum number of blocks(stretchFactor) by
bufferSize accordingly.
if (isCompressed) {
- int stretchFactor = 2 + (MAX_VALUES_LENGTH * MAX_BYTE_WIDTH - 1) /
bufferSize;
+ // RLEv2 DIRECT runs can need a 2-byte header in addition to their value
payload.
+ int maxRleDirectRunSize = MAX_VALUES_LENGTH * MAX_BYTE_WIDTH + 2;
+ int stretchFactor = 2 + (maxRleDirectRunSize - 1) / bufferSize;
Review Comment:
`estimateRgEndOffset` introduces `RLE_V2_HEADER_SIZE`, but the calculation
still hard-codes `+ 2`. Use the constant (e.g., `MAX_VALUES_LENGTH *
MAX_BYTE_WIDTH + RLE_V2_HEADER_SIZE`) to avoid duplicated magic numbers and
keep the method in sync with the constant.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]