Re: [PR] [SPARK-56892][SQL] Bulk read optimization for Parquet DELTA_BINARY_PACKED decoding [spark]

via GitHub Thu, 11 Jun 2026 13:13:03 -0700


iemejia commented on code in PR #55919:
URL: https://github.com/apache/spark/pull/55919#discussion_r3398876801



##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedReaderBase.java:
##########
@@ -27,6 +30,43 @@
  */
 public class VectorizedReaderBase extends ValuesReader implements 
VectorizedValuesReader {
 
+  /**
+   * Encodes an unsigned long as a minimal big-endian two's-complement byte 
array
+   * compatible with {@link java.math.BigInteger} encoding. The result is 
written into
+   * the backing array of {@code buf} (which must have capacity >= 9). Returns 
the
+   * start offset; the valid bytes are {@code buf.array()[start .. 8]} (length 
= 9 - start).
+   *
+   * <p>This avoids the per-value overhead of
+   * {@code new BigInteger(Long.toUnsignedString(v)).toByteArray()} which 
allocates a
+   * String, a BigInteger, and a byte[] on every call.
+   */
+  static int encodeUnsignedLongBigEndian(long v, ByteBuffer buf) {

Review Comment:
   Done. Adopted this implementation — eliminates the ByteBuffer entirely and 
uses `Long.numberOfLeadingZeros` to compute the start offset directly. Also 
updated all call sites to pass `byte[]` instead of `ByteBuffer`.



##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedReaderBase.java:
##########
@@ -27,6 +30,43 @@
  */
 public class VectorizedReaderBase extends ValuesReader implements 
VectorizedValuesReader {
 
+  /**
+   * Encodes an unsigned long as a minimal big-endian two's-complement byte 
array
+   * compatible with {@link java.math.BigInteger} encoding. The result is 
written into
+   * the backing array of {@code buf} (which must have capacity >= 9). Returns 
the
+   * start offset; the valid bytes are {@code buf.array()[start .. 8]} (length 
= 9 - start).
+   *
+   * <p>This avoids the per-value overhead of
+   * {@code new BigInteger(Long.toUnsignedString(v)).toByteArray()} which 
allocates a
+   * String, a BigInteger, and a byte[] on every call.
+   */
+  static int encodeUnsignedLongBigEndian(long v, ByteBuffer buf) {
+    byte[] scratch = buf.array();
+    // ByteBuffer is big-endian by default; writes 8 bytes MSB-first at offset 
1.
+    // Always write before the zero-check so that the buffer is current even 
when reused.
+    buf.putLong(1, v);
+    if (v == 0L) {
+      return 8;  // scratch[8] is already 0x00; caller writes 9 - 8 = 1 byte: 
[0x00]
+    }
+    // Find the first non-zero byte (minimal encoding).
+    int start = 1;
+    while (scratch[start] == 0) start++;
+    // Prepend 0x00 sign byte if MSB of the first significant byte is set.
+    if ((scratch[start] & 0x80) != 0) scratch[--start] = 0;
+    return start;
+  }
+
+  /**
+   * Convenience: encodes an unsigned long and returns a new byte[] with the 
minimal
+   * BigInteger-compatible encoding. Use {@link 
#encodeUnsignedLongBigEndian(long, ByteBuffer)}
+   * with a reusable buffer in hot paths to avoid allocation.
+   */
+  static byte[] unsignedLongToBytesBigEndian(long v) {
+    ByteBuffer buf = ByteBuffer.allocate(9);

Review Comment:
   Done. Replaced with this approach — allocates only the minimal-length array 
directly without going through a 9-byte scratch buffer.



##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaBinaryPackedReader.java:
##########
@@ -328,4 +436,12 @@ private void skipValues(int total) {
     readValues(total, null, -1, (w, r, v) -> {});
   }
 
+  // Reusable buffer for unsigned long -> BigInteger encoding (9 bytes: sign + 
8 value bytes).
+  private final ByteBuffer unsignedLongBuf = ByteBuffer.allocate(9);

Review Comment:
   Done.



##########
sql/core/benchmarks/VectorizedDeltaReaderBenchmark-jdk21-results.txt:
##########
@@ -2,92 +2,92 @@
 DELTA_BINARY_PACKED INT32
 
================================================================================================
 
-OpenJDK 64-Bit Server VM 21.0.11+10-LTS on Linux 6.17.0-1011-azure
-AMD EPYC 7763 64-Core Processor
+OpenJDK 64-Bit Server VM 21.0.11+10-LTS on Linux 6.17.0-1015-azure
+AMD EPYC 9V74 80-Core Processor

Review Comment:
   Oh that's a mistake, let me retrigger the benchmark again to see if we match 
the AMD EPYC 7763 CPU for the diff.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-56892][SQL] Bulk read optimization for Parquet DELTA_BINARY_PACKED decoding [spark]

Reply via email to