[PR] Arrow: bulk-copy fixed-size-binary vectorized reads [iceberg]

via GitHub Sun, 07 Jun 2026 23:37:32 -0700


wombatu-kun opened a new pull request, #16718:
URL: https://github.com/apache/iceberg/pull/16718


   ## What
   
   `FixedSizeBinaryReader` in the vectorized Parquet reader (the path for 
Iceberg `UUID`, `fixed[N]` and high-precision `decimal(p>18)`, all stored as 
Parquet `FIXED_LEN_BYTE_ARRAY`) materialized values one at a time: 
`readBinary()`, wrap in a `ByteBuffer`, copy into a scratch array, then 
`FixedSizeBinaryVector.set()`. The numeric readers beside it already blit an 
entire non-null run into the Arrow data buffer with a single 
`ArrowBuf.setBytes` memcpy (via `readLongs`/`readValues`). Plain 
`FIXED_LEN_BYTE_ARRAY` is stored contiguously with no length prefix, 
byte-identical to a `FixedSizeBinaryVector` data buffer, so the per-value loop 
was pure overhead.
   
   This adds the missing fixed-width sibling of those bulk reads: 
`readFixedLengthBytes` on `VectorizedValuesReader`, plus a `nextRleBatch` fast 
path in `FixedSizeBinaryReader` that copies a non-null run in one memcpy. A 
`supportsBulkFixedLengthRead()` capability flag (default false, true only for 
the plain reader) keeps non-contiguous encoders such as byte-stream-split on 
their existing per-value path.
   
   ## Benchmark
   
   `VectorizedReadFixedWidthBinaryBenchmark` (new): 1.5M required rows of 
`uuid` + `decimal(38,0)` + `fixed[16]`, read via `VectorizedTableScanIterable`. 
JMH AverageTime, `-prof gc`.
   
   | compression | baseline | this PR | speedup |
   | --- | --- | --- | --- |
   | gzip | 275.98 ms/op | 170.96 ms/op | 1.6x |
   | uncompressed | 171.30 ms/op | 65.91 ms/op | 2.6x |
   
   Heap allocation per op is unchanged: the per-value `Binary`/`ByteBuffer` 
wrappers were already being scalar-replaced by the JIT, so the win is CPU (the 
bulk memcpy replaces millions of per-value calls, bounds checks and copies), 
not allocation.
   
   ## Testing
   
   Output is byte-identical. Covered by the existing `TestArrowReader` 
(required and nullable `uuid`/`fixed`/`decimal` columns exercise both the bulk 
run and the per-value null fallback).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Arrow: bulk-copy fixed-size-binary vectorized reads [iceberg]

Reply via email to