[GitHub] [arrow-rs] jhorstmann opened a new issue, #1829: AVX512 + simd binary and/or kernels slower than autovectorized version

GitBox Thu, 09 Jun 2022 13:17:51 -0700


jhorstmann opened a new issue, #1829:
URL: https://github.com/apache/arrow-rs/issues/1829


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   Related to extending the tests for different features flags #1822, I wanted 
to take another look at the avx512 feature and its performance. Benchmarks were 
run on an i9-11900KB @ 3Ghz (turbo disabled) with
   
   ```
   RUSTFLAGS="-Ctarget-cpu=native -Ctarget-feature=-prefer-256-bit"
   ```
   
   (the second flag might require some explanation, it **disables** the 
`prefer-256-bit` feature, which makes llvm use the full 512 bit vectors)
   
   For some reason the second benchmark is always significantly slower when run 
together, running them separately gives the same (higher) performance and the 
assembly looks identical except for the and/or. I'm guessing branch predictor 
or allocator related.
   
   ## Default features
   
   ```
   $ cargo +nightly bench --bench buffer_bit_ops
   buffer_bit_ops/buffer_bit_ops and                                            
                                
                           time:   [134.57 ns 134.90 ns 135.28 ns]
                           thrpt:  [105.74 GiB/s 106.04 GiB/s 106.30 GiB/s]
   buffer_bit_ops/buffer_bit_ops or                                             
                               
                           time:   [275.55 ns 276.22 ns 277.03 ns]
                           thrpt:  [51.637 GiB/s 51.789 GiB/s 51.914 GiB/s]
   ```
   
   ## Simd feature
   
   ```
   $ cargo +nightly bench --features simd --bench buffer_bit_ops
   buffer_bit_ops/buffer_bit_ops and                                            
                                
                           time:   [168.90 ns 169.10 ns 169.32 ns]
                           thrpt:  [84.486 GiB/s 84.596 GiB/s 84.697 GiB/s]
   buffer_bit_ops/buffer_bit_ops or                                             
                               
                           time:   [303.13 ns 303.27 ns 303.45 ns]
                           thrpt:  [47.142 GiB/s 47.169 GiB/s 47.192 GiB/s]
   ```
   
   ## Avx512 feature
   
   ```
   $ cargo +nightly bench --features avx512 --bench buffer_bit_ops -- 
   buffer_bit_ops/buffer_bit_ops and                                            
                                
                           time:   [165.46 ns 165.95 ns 166.83 ns]
                           thrpt:  [85.745 GiB/s 86.203 GiB/s 86.458 GiB/s]
   buffer_bit_ops/buffer_bit_ops or                                             
                               
                           time:   [310.63 ns 311.32 ns 312.04 ns]
                           thrpt:  [45.844 GiB/s 45.950 GiB/s 46.052 GiB/s]
   ```
   
   Generated assembly for `simd` and `avx512` looks identical, the loop 
calculates 512bits/64 bytes. The auto-vectorized version instead gets unrolled 
4 times, which reduces the loop overhead, so each iteration processes 4x512bits.
   
   **Describe the solution you'd like**
   
   With these benchmark results it seems that we can remove the `avx512` 
feature and simplify the buffer code.
   
   **Describe alternatives you've considered**
   
   An `avx512` feature for other kernels would still be very useful. Avx512 for 
example has instructions that basically implement the filter kernel for 
primitives in a single instruction and it is unlikely that these will be 
supported in a portable way soon 
(https://github.com/rust-lang/portable-simd/issues/240).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] jhorstmann opened a new issue, #1829: AVX512 + simd binary and/or kernels slower than autovectorized version

Reply via email to