pchintar opened a new pull request, #10007:
URL: https://github.com/apache/arrow-rs/pull/10007

   # Which issue does this PR close?
   
   - Closes #10006 .
   
   # Rationale for this change
   
   `BYTE_STREAM_SPLIT` decoding currently reconstructs values using a nested 
scalar loop with strided reads across byte streams in:
   
   ```text
   parquet/src/encodings/decoding/byte_stream_split_decoder.rs
   ```
   
   Current logic:
   
   ```rust
   for i in 0..dst.len() / TYPE_SIZE {
       for j in 0..TYPE_SIZE {
           dst[i * TYPE_SIZE + j] = sub_src[i + j * stride];
       }
   }
   ```
   
   This results in poor cache locality and inefficient memory access patterns 
during reconstruction of the original value layout.
   
   # What changes are included in this PR?
   
   This PR changes the reconstruction logic in `join_streams_const` and 
`join_streams_variable` to process values in contiguous blocks instead of 
value-by-value scalar iteration.
   
   The updated implementation reads contiguous regions from each byte stream 
before writing reconstructed values back into the destination buffer.
   
   Example:
   
   ```rust
   for base in (0..values).step_by(BLOCK) {
       let len = (values - base).min(BLOCK);
   
       for byte_idx in 0..TYPE_SIZE {
           let src_start = byte_idx * stride + base;
           let src_block = &src[src_start..src_start + len];
   
           for (idx, value) in src_block.iter().copied().enumerate() {
               dst[(base + idx) * TYPE_SIZE + byte_idx] = value;
           }
       }
   }
   ```
   
   # Are these changes tested?
   
   Existing parquet tests pass:
   
   ```bash
   cargo test -p parquet byte_stream_split -- --nocapture
   cargo test -p parquet encoding -- --nocapture
   ```
   
   Benchmarks from `parquet/benches/encoding.rs` show considerable improvements 
for `BYTE_STREAM_SPLIT` decoding:
   
   ```text
   cargo bench -p parquet --bench encoding --all-features -- "decoding: 
dtype=f32, encoding=BYTE_STREAM_SPLIT"
   
   cargo bench -p parquet --bench encoding --all-features -- "decoding: 
dtype=f64, encoding=BYTE_STREAM_SPLIT"
   ```
   
   # Are there any user-facing changes?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to