iemejia commented on code in PR #55919:
URL: https://github.com/apache/spark/pull/55919#discussion_r3398876801
##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedReaderBase.java:
##########
@@ -27,6 +30,43 @@
*/
public class VectorizedReaderBase extends ValuesReader implements
VectorizedValuesReader {
+ /**
+ * Encodes an unsigned long as a minimal big-endian two's-complement byte
array
+ * compatible with {@link java.math.BigInteger} encoding. The result is
written into
+ * the backing array of {@code buf} (which must have capacity >= 9). Returns
the
+ * start offset; the valid bytes are {@code buf.array()[start .. 8]} (length
= 9 - start).
+ *
+ * <p>This avoids the per-value overhead of
+ * {@code new BigInteger(Long.toUnsignedString(v)).toByteArray()} which
allocates a
+ * String, a BigInteger, and a byte[] on every call.
+ */
+ static int encodeUnsignedLongBigEndian(long v, ByteBuffer buf) {
Review Comment:
Done. Adopted this implementation — eliminates the ByteBuffer entirely and
uses `Long.numberOfLeadingZeros` to compute the start offset directly. Also
updated all call sites to pass `byte[]` instead of `ByteBuffer`.
##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedReaderBase.java:
##########
@@ -27,6 +30,43 @@
*/
public class VectorizedReaderBase extends ValuesReader implements
VectorizedValuesReader {
+ /**
+ * Encodes an unsigned long as a minimal big-endian two's-complement byte
array
+ * compatible with {@link java.math.BigInteger} encoding. The result is
written into
+ * the backing array of {@code buf} (which must have capacity >= 9). Returns
the
+ * start offset; the valid bytes are {@code buf.array()[start .. 8]} (length
= 9 - start).
+ *
+ * <p>This avoids the per-value overhead of
+ * {@code new BigInteger(Long.toUnsignedString(v)).toByteArray()} which
allocates a
+ * String, a BigInteger, and a byte[] on every call.
+ */
+ static int encodeUnsignedLongBigEndian(long v, ByteBuffer buf) {
+ byte[] scratch = buf.array();
+ // ByteBuffer is big-endian by default; writes 8 bytes MSB-first at offset
1.
+ // Always write before the zero-check so that the buffer is current even
when reused.
+ buf.putLong(1, v);
+ if (v == 0L) {
+ return 8; // scratch[8] is already 0x00; caller writes 9 - 8 = 1 byte:
[0x00]
+ }
+ // Find the first non-zero byte (minimal encoding).
+ int start = 1;
+ while (scratch[start] == 0) start++;
+ // Prepend 0x00 sign byte if MSB of the first significant byte is set.
+ if ((scratch[start] & 0x80) != 0) scratch[--start] = 0;
+ return start;
+ }
+
+ /**
+ * Convenience: encodes an unsigned long and returns a new byte[] with the
minimal
+ * BigInteger-compatible encoding. Use {@link
#encodeUnsignedLongBigEndian(long, ByteBuffer)}
+ * with a reusable buffer in hot paths to avoid allocation.
+ */
+ static byte[] unsignedLongToBytesBigEndian(long v) {
+ ByteBuffer buf = ByteBuffer.allocate(9);
Review Comment:
Done. Replaced with this approach — allocates only the minimal-length array
directly without going through a 9-byte scratch buffer.
##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaBinaryPackedReader.java:
##########
@@ -328,4 +436,12 @@ private void skipValues(int total) {
readValues(total, null, -1, (w, r, v) -> {});
}
+ // Reusable buffer for unsigned long -> BigInteger encoding (9 bytes: sign +
8 value bytes).
+ private final ByteBuffer unsignedLongBuf = ByteBuffer.allocate(9);
Review Comment:
Done.
##########
sql/core/benchmarks/VectorizedDeltaReaderBenchmark-jdk21-results.txt:
##########
@@ -2,92 +2,92 @@
DELTA_BINARY_PACKED INT32
================================================================================================
-OpenJDK 64-Bit Server VM 21.0.11+10-LTS on Linux 6.17.0-1011-azure
-AMD EPYC 7763 64-Core Processor
+OpenJDK 64-Bit Server VM 21.0.11+10-LTS on Linux 6.17.0-1015-azure
+AMD EPYC 9V74 80-Core Processor
Review Comment:
Oh that's a mistake, let me retrigger the benchmark again to see if we match
the AMD EPYC 7763 CPU for the diff.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]