[PR] Arrow: avoid per-value byte[] in dictionary-encoded fixed-binary decode [iceberg]

via GitHub Sun, 07 Jun 2026 23:38:00 -0700


wombatu-kun opened a new pull request, #16719:
URL: https://github.com/apache/iceberg/pull/16719


   ## What
   
   The dictionary-encoded fixed-binary readers 
(`FixedSizeBinaryDictEncodedReader` and the packed branch of 
`FixedSizeBinaryReader.nextDictEncodedVal`) allocated a `byte[typeWidth]` per 
value, copied the decoded bytes into it, then called 
`FixedSizeBinaryVector.set()`. The numeric dictionary readers in the same file 
write straight into the vector's data buffer. This switches the fixed-binary 
ones to the same pattern: `decodeToBinary(id).toByteBuffer()` plus 
`ArrowBuf.setBytes`, dropping the per-value array. `toByteBuffer()` also 
positions at the dictionary entry's backing offset (the same thing the 
var-width dictionary reader already does), so the decode is correct regardless 
of how the dictionary backs its `Binary` entries.
   
   This is an allocation cleanup, not a bug fix: with the current Parquet 
(1.17.1) the previous `getBytesUnsafe()` returns an exact-size copy, so the 
result was already correct.
   
   ## Benchmark
   
   `VectorizedDictFixedBinaryDecodeBenchmark` (new) drives the eager-dictionary 
fixed-binary decode directly, since end-to-end the path is masked by page 
reading. JMH AverageTime, `-prof gc`, 4096 16-byte values per op:
   
   | per value | before | after |
   | --- | --- | --- |
   | allocation | 32 B | 0 B |
   | time | 15.9 ns | 13.6 ns |
   
   The old path allocated a runtime-sized `byte[]` per value (which the JIT 
cannot scalar-replace); the `toByteBuffer()` plus `setBytes` form lets the JIT 
scalar-replace the wrapper, so per-value heap allocation drops to zero (about 
15% faster).
   
   ## Testing
   
   Adds two unit tests to `TestVectorizedParquetDefinitionLevelReader` that 
drive the eager-dictionary fixed-binary decode (the batch reader and the 
packed-inline branch) with a dictionary entry backed at a non-zero offset. This 
path previously had no direct coverage; the end-to-end tests only hit the lazy 
dictionary path.
   
   ## Note
   
   Companion to #16718 (the bulk-copy fixed-binary read optimization). Both 
touch `VectorizedParquetDefinitionLevelReader.java` in different methods and 
both add `iceberg-arrow` to the JMH `jmhProjects` list, so whichever merges 
second needs a trivial rebase.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Arrow: avoid per-value byte[] in dictionary-encoded fixed-binary decode [iceberg]

Reply via email to