pchintar opened a new issue, #10006:
URL: https://github.com/apache/arrow-rs/issues/10006

   ## Description
   
   In `parquet/src/encodings/decoding/byte_stream_split_decoder.rs`, 
`BYTE_STREAM_SPLIT` decoding currently reconstructs values using a nested 
scalar loop over values and byte streams in `join_streams_const` and 
`join_streams_variable`.
   
   The current implementation iterates value-by-value while performing strided 
reads across the split byte streams, which results in poor memory locality 
during reconstruction of the original value layout.
   
   This impacts the `BYTE_STREAM_SPLIT` decoding benchmarks in 
`parquet/benches/encoding.rs`, particularly for floating-point and fixed-length 
byte array decoding.
   
   ---
   
   ## Root Cause
   
   The current reconstruction logic processes values in a scalar, value-major 
order:
   
   ```rust
   for i in 0..dst.len() / TYPE_SIZE {
       for j in 0..TYPE_SIZE {
           dst[i * TYPE_SIZE + j] = sub_src[i + j * stride];
       }
   }
   ```
   
   This results in:
   
   * strided memory access patterns across byte streams
   * reduced cache locality
   * limited compiler vectorization opportunities
   * increased memory access overhead during reconstruction
   
   The issue becomes more pronounced for larger fixed-width types such as `f32` 
and `f64`.
   
   ---
   
   ## Proposed Solution
   
   Rework the reconstruction logic to process values in contiguous blocks 
instead of value-by-value scalar iteration.
   
   This improves cache locality by reading contiguous regions from each byte 
stream before writing reconstructed values back into the destination buffer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to