pchintar opened a new pull request, #10007:
URL: https://github.com/apache/arrow-rs/pull/10007
# Which issue does this PR close?
- Closes #10006 .
# Rationale for this change
`BYTE_STREAM_SPLIT` decoding currently reconstructs values using a nested
scalar loop with strided reads across byte streams in:
```text
parquet/src/encodings/decoding/byte_stream_split_decoder.rs
```
Current logic:
```rust
for i in 0..dst.len() / TYPE_SIZE {
for j in 0..TYPE_SIZE {
dst[i * TYPE_SIZE + j] = sub_src[i + j * stride];
}
}
```
This results in poor cache locality and inefficient memory access patterns
during reconstruction of the original value layout.
# What changes are included in this PR?
This PR changes the reconstruction logic in `join_streams_const` and
`join_streams_variable` to process values in contiguous blocks instead of
value-by-value scalar iteration.
The updated implementation reads contiguous regions from each byte stream
before writing reconstructed values back into the destination buffer.
Example:
```rust
for base in (0..values).step_by(BLOCK) {
let len = (values - base).min(BLOCK);
for byte_idx in 0..TYPE_SIZE {
let src_start = byte_idx * stride + base;
let src_block = &src[src_start..src_start + len];
for (idx, value) in src_block.iter().copied().enumerate() {
dst[(base + idx) * TYPE_SIZE + byte_idx] = value;
}
}
}
```
# Are these changes tested?
Existing parquet tests pass:
```bash
cargo test -p parquet byte_stream_split -- --nocapture
cargo test -p parquet encoding -- --nocapture
```
Benchmarks from `parquet/benches/encoding.rs` show considerable improvements
for `BYTE_STREAM_SPLIT` decoding:
```text
cargo bench -p parquet --bench encoding --all-features -- "decoding:
dtype=f32, encoding=BYTE_STREAM_SPLIT"
cargo bench -p parquet --bench encoding --all-features -- "decoding:
dtype=f64, encoding=BYTE_STREAM_SPLIT"
```
# Are there any user-facing changes?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]