MaxGekk commented on code in PR #55922:
URL: https://github.com/apache/spark/pull/55922#discussion_r3458352302


##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java:
##########
@@ -1000,11 +1000,15 @@ private boolean readNextGroup() {
             this.currentBuffer = new int[this.currentCount];
           }
           currentBufferIdx = 0;
+          // Slice all packed bytes in one call (numGroups groups x bitWidth 
bytes each)
+          // instead of one slice per group, avoiding per-group ByteBuffer 
allocation.
+          int totalBytes = numGroups * bitWidth;
+          ByteBuffer packed = in.slice(totalBytes);

Review Comment:
   Non-blocking (test-coverage hardening, not a defect): this single-slice 
rewrite is provably equivalent to the old per-group `slice(bitWidth)` loop, but 
it introduces a distinct runtime path. When `in` is a `MultiBufferInputStream` 
and the packed run spans a buffer boundary, `slice()` returns a 
freshly-allocated contiguous buffer with `position() == 0`, so `pos` is based 
at 0 rather than at the absolute offset of the view case. 
`VectorizedRleValuesReaderSuite` only ever feeds a single buffer 
(`ByteBufferInputStream.wrap(ByteBuffer.wrap(...))` -> 
`SingleBufferInputStream`), so the base-0 path is exercised by reasoning only.
   
   Consider adding a test that feeds a multi-buffer stream whose packed run 
crosses a boundary, to cover the base-0 branch directly. (The equivalence holds 
regardless; this just hardens the proof.)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to