xuzifu666 commented on issue #9670:
URL: https://github.com/apache/arrow-rs/issues/9670#issuecomment-4258090019

   I'm very interested in this issue and tried it out according to your 
suggestions. It did show at least a 10% improvement in benchmark tests. Below 
is a comparison of the results before and after the benchmark tests(include my 
code modified way):
   ```
   # Parquet Dictionary Decoding Performance Optimization Benchmark Results
   
   ## Test Environment
   - **Date**: 2026-04-15
   - **Test Command**: `cargo bench -p parquet --features experimental --bench 
dict_gather_compare`
   - **Optimizations**:
     1. Dictionary Gather/Scatter Loop Unrolling Optimization 
(`parquet/src/encodings/rle.rs`)
     2. BitReader Code Generation Optimization (`parquet/src/util/bit_util.rs`)
   
   ## Performance Comparison Results
   
   | Test Scenario | Original Version | Optimized Version | Performance 
Improvement |
   |---------|---------|---------|---------|
   | dict16_vals65536 | 22.85 µs | 20.32 µs | **~11.1%** |
   | dict256_vals65536 | 23.53 µs | 21.35 µs | **~9.3%** |
   | dict4096_vals65536 | 27.24 µs | 24.52 µs | **~10.0%** |
   | dict256_vals1048576_large | 369.3 µs | 337.3 µs | **~8.7%** |
   
   ### Detailed Data
   
   #### 1. dict16_vals65536 (Dictionary Size: 16, Value Count: 65536)
   ```
   Original Version:  time:   [22.504 µs 22.855 µs 23.353 µs]
   Optimized Version:  time:   [20.236 µs 20.317 µs 20.406 µs]
   Improvement: ~11.1%
   ```
   
   #### 2. dict256_vals65536 (Dictionary Size: 256, Value Count: 65536)
   ```
   Original Version:  time:   [23.251 µs 23.526 µs 23.987 µs]
   Optimized Version:  time:   [21.124 µs 21.354 µs 21.645 µs]
   Improvement: ~9.3%
   ```
   
   #### 3. dict4096_vals65536 (Dictionary Size: 4096, Value Count: 65536)
   ```
   Original Version:  time:   [27.004 µs 27.236 µs 27.513 µs]
   Optimized Version:  time:   [24.426 µs 24.519 µs 24.638 µs]
   Improvement: ~10.0%
   ```
   
   #### 4. dict256_vals1048576_large (Dictionary Size: 256, Value Count: 
1,048,576)
   ```
   Original Version:  time:   [368.21 µs 369.30 µs 370.68 µs]
   Optimized Version:  time:   [335.43 µs 337.28 µs 339.46 µs]
   Improvement: ~8.7%
   ```
   
   ## Optimization Details
   
   ### 1. Dictionary Gather/Scatter Loop Optimization
   
   **File**: `parquet/src/encodings/rle.rs`
   
   **Changes**:
   - Increased loop unrolling from 8-element batches to 16-element batches
   - Moved bounds checking from inside the loop to outside (using 
`debug_assert!`)
   - Used `get_unchecked` to avoid repeated bounds checking
   - Separated into three levels: 16-element, 8-element, and remainder 
processing
   
   **Code Example**:
   ```rust
   // Before Optimization
   for (out_chunk, idx_chunk) in out_chunks.by_ref().zip(idx_chunks) {
       let dict_len = dict.len();
       assert!(idx_chunk.iter().all(|&i| (i as usize) < dict_len));
       for (b, i) in out_chunk.iter_mut().zip(idx_chunk.iter()) {
           b.clone_from(unsafe { dict.get_unchecked(*i as usize) });
       }
   }
   
   // After Optimization
   debug_assert!(idx.iter().all(|&i| (i as usize) < dict_len));
   
   for (out_chunk, idx_chunk) in out_chunks.by_ref().zip(idx_chunks.by_ref()) {
       unsafe {
           let i0 = *idx_chunk.get_unchecked(0) as usize;
           // ... Unroll 16 elements
           out_chunk.get_unchecked_mut(0).clone_from(dict.get_unchecked(i0));
           // ... Unroll 16 assignments
       }
   }
   ```
   
   ### 2. BitReader Code Generation Optimization
   
   **File**: `parquet/src/util/bit_util.rs`
   
   **Changes**:
   - Used `T::from_u64()` to directly construct values, avoiding buffer 
allocation and slice copying
   - Reduced temporary variable creation
   
   **Code Example**:
   ```rust
   // Before Optimization
   for out in out_buf {
       let mut out_bytes = T::Buffer::default();
       out_bytes.as_mut()[..4].copy_from_slice(&out.to_le_bytes());
       batch[i] = T::from_le_bytes(out_bytes);
       i += 1;
   }
   
   // After Optimization
   for out in out_buf {
       batch[i] = T::from_u64(out as u64);
       i += 1;
   }
   ```
   
   ## Conclusion
   
   The two optimizations combined brought **8-12%** performance improvement, 
which is a significant enhancement on the hot path of dictionary decoding. The 
optimization effects remain stable across different dictionary sizes and data 
scales.
   
   ---
   *Generated by benchmark test on 2026-04-15*
   ```
   
   If there are no problems, I can submit relevant PR later. @Dandandan 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to