iemejia commented on code in PR #55919:
URL: https://github.com/apache/spark/pull/55919#discussion_r3345494876
##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaBinaryPackedReader.java:
##########
@@ -215,6 +217,114 @@ public void skipLongs(int total) {
skipValues(total);
}
+ // ---- Bulk read helpers for readIntegers / readLongs
----------------------------
+ //
+ // The generic readValues() path dispatches a lambda per value. For the two
most
+ // common callers (readIntegers, readLongs) we can do much better: compute a
prefix
+ // sum over the unpacked deltas in-place, then bulk-copy the result into the
column
+ // vector with putInts / putLongs (backed by System.arraycopy on-heap).
+
+ /**
+ * Callback for writing a chunk of prefix-summed absolute values from
+ * {@code unpackedValuesBuffer} into a column vector. Called once per
mini-block
+ * (not per value), so lambda overhead is negligible.
+ */
+ @FunctionalInterface
+ private interface BulkWriter {
+ void write(WritableColumnVector c, int rowId, long[] values, int start,
int count);
+ }
+
+ /** Narrows long[] -> int[] scratch and bulk-writes via putInts. */
+ private void bulkWriteInts(WritableColumnVector c, int rowId,
+ long[] buf, int start, int count) {
+ if (intScratchBuffer == null) {
Review Comment:
Done. Moved `intScratchBuffer` allocation to `initFromPage` alongside
`unpackedValuesBuffer` and removed the lazy null-check from `bulkWriteInts`.
##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedReaderBase.java:
##########
@@ -27,6 +30,44 @@
*/
public class VectorizedReaderBase extends ValuesReader implements
VectorizedValuesReader {
+ /**
+ * Encodes an unsigned long as a minimal big-endian two's-complement byte
array
+ * compatible with {@link java.math.BigInteger} encoding. The result is
written into
+ * the backing array of {@code buf} (which must have capacity >= 9). Returns
the
+ * start offset; the valid bytes are {@code buf.array()[start .. 8]} (length
= 9 - start).
+ *
+ * <p>This avoids the per-value overhead of
+ * {@code new BigInteger(Long.toUnsignedString(v)).toByteArray()} which
allocates a
+ * String, a BigInteger, and a byte[] on every call.
+ */
+ static int encodeUnsignedLongBigEndian(long v, ByteBuffer buf) {
+ byte[] scratch = buf.array();
+ // ByteBuffer is big-endian by default; writes 8 bytes MSB-first at offset
1.
+ // Always write before the zero-check so that the buffer is current even
when reused.
+ buf.putLong(1, v);
+ if (v == 0L) {
+ scratch[0] = 0;
+ return 0;
Review Comment:
Good catch. Changed to `return 8` so the caller writes the single `0x00`
byte at `scratch[8]` (already zeroed by `putLong`), consistent with the minimal
encoding contract.
##########
sql/core/benchmarks/VectorizedDeltaReaderBenchmark-results.txt:
##########
@@ -2,92 +2,92 @@
DELTA_BINARY_PACKED INT32
================================================================================================
-OpenJDK 64-Bit Server VM 17.0.19+10-LTS on Linux 6.17.0-1011-azure
-AMD EPYC 7763 64-Core Processor
+OpenJDK 64-Bit Server VM 17.0.19+10 on Linux 7.0.0-1004-azure
+AMD EPYC 9V45 96-Core Processor
Review Comment:
Thanks for the instructions! I had to rebase onto upstream master to pick up
the updated benchmark workflow with JDK 25 support and the `create-commit`
option. Triggered all three runs (JDK 17, 21, 25) -- they should commit the
results directly to the branch once complete.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]