wombatu-kun opened a new pull request, #16718: URL: https://github.com/apache/iceberg/pull/16718
## What `FixedSizeBinaryReader` in the vectorized Parquet reader (the path for Iceberg `UUID`, `fixed[N]` and high-precision `decimal(p>18)`, all stored as Parquet `FIXED_LEN_BYTE_ARRAY`) materialized values one at a time: `readBinary()`, wrap in a `ByteBuffer`, copy into a scratch array, then `FixedSizeBinaryVector.set()`. The numeric readers beside it already blit an entire non-null run into the Arrow data buffer with a single `ArrowBuf.setBytes` memcpy (via `readLongs`/`readValues`). Plain `FIXED_LEN_BYTE_ARRAY` is stored contiguously with no length prefix, byte-identical to a `FixedSizeBinaryVector` data buffer, so the per-value loop was pure overhead. This adds the missing fixed-width sibling of those bulk reads: `readFixedLengthBytes` on `VectorizedValuesReader`, plus a `nextRleBatch` fast path in `FixedSizeBinaryReader` that copies a non-null run in one memcpy. A `supportsBulkFixedLengthRead()` capability flag (default false, true only for the plain reader) keeps non-contiguous encoders such as byte-stream-split on their existing per-value path. ## Benchmark `VectorizedReadFixedWidthBinaryBenchmark` (new): 1.5M required rows of `uuid` + `decimal(38,0)` + `fixed[16]`, read via `VectorizedTableScanIterable`. JMH AverageTime, `-prof gc`. | compression | baseline | this PR | speedup | | --- | --- | --- | --- | | gzip | 275.98 ms/op | 170.96 ms/op | 1.6x | | uncompressed | 171.30 ms/op | 65.91 ms/op | 2.6x | Heap allocation per op is unchanged: the per-value `Binary`/`ByteBuffer` wrappers were already being scalar-replaced by the JIT, so the win is CPU (the bulk memcpy replaces millions of per-value calls, bounds checks and copies), not allocation. ## Testing Output is byte-identical. Covered by the existing `TestArrowReader` (required and nullable `uuid`/`fixed`/`decimal` columns exercise both the bulk run and the per-value null fallback). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
