pchintar opened a new issue, #10006:
URL: https://github.com/apache/arrow-rs/issues/10006
## Description
In `parquet/src/encodings/decoding/byte_stream_split_decoder.rs`,
`BYTE_STREAM_SPLIT` decoding currently reconstructs values using a nested
scalar loop over values and byte streams in `join_streams_const` and
`join_streams_variable`.
The current implementation iterates value-by-value while performing strided
reads across the split byte streams, which results in poor memory locality
during reconstruction of the original value layout.
This impacts the `BYTE_STREAM_SPLIT` decoding benchmarks in
`parquet/benches/encoding.rs`, particularly for floating-point and fixed-length
byte array decoding.
---
## Root Cause
The current reconstruction logic processes values in a scalar, value-major
order:
```rust
for i in 0..dst.len() / TYPE_SIZE {
for j in 0..TYPE_SIZE {
dst[i * TYPE_SIZE + j] = sub_src[i + j * stride];
}
}
```
This results in:
* strided memory access patterns across byte streams
* reduced cache locality
* limited compiler vectorization opportunities
* increased memory access overhead during reconstruction
The issue becomes more pronounced for larger fixed-width types such as `f32`
and `f64`.
---
## Proposed Solution
Rework the reconstruction logic to process values in contiguous blocks
instead of value-by-value scalar iteration.
This improves cache locality by reading contiguous regions from each byte
stream before writing reconstructed values back into the destination buffer.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]