jorgecarleitao opened a new pull request #8191:
URL: https://github.com/apache/arrow/pull/8191


   This PR speeds-up arithmetic ops by leveraging vectorization of non-divide 
operations (in non-SIMD), as well as removing an un-needed operation in SIMD 
division.
   
   For non-SIMD, this yields about `[-30%,-45%]` for all operations (`+-*/`)
   For SIMD, this yields about `-30%` on division.
   
   The culprit in non-SIMD was that we required the operation to return 
`Result<T::Native>`, which was not allowing the compiler to vectorize the 
operation. Only the division requires `Result`. For divide, removing the 
operator further speed up the operation (I do not know the reason).
   
   The culprit in SIMD was primarily a `simd_load` too many that was not doing 
anything.
   
   ## Benchmarks
   
   The benchmark used:
   
   ```
   set -e
   git checkout 0852869d1a9b7da4a1b91fa7cb7d4ef48e99cdec
   cargo bench --bench arithmetic_kernels
   git checkout divide_simd_faster
   cargo bench --bench arithmetic_kernels
   echo "##################################"
   git checkout 0852869d1a9b7da4a1b91fa7cb7d4ef48e99cdec
   cargo bench --bench arithmetic_kernels --features simd
   git checkout divide_simd_faster
   cargo bench --bench arithmetic_kernels --features simd
   ```
   
   and below are the results for the execution of the second `bench`, which is 
the one that gives the differential, in my machine:
   
   ### Non-SIMD
   
   ```
   Previous HEAD position was 0852869d1 Improved benches for arithmetic.
   Switched to branch 'divide_simd_faster'
      Compiling arrow v2.0.0-SNAPSHOT 
(/Users/jorgecarleitao/projects/arrow/rust/arrow)
       Finished bench [optimized] target(s) in 37.24s
        Running 
/Users/jorgecarleitao/projects/arrow/rust/target/release/deps/arithmetic_kernels-d281862a43faaf38
   Gnuplot not found, using plotters backend
   add 512                 time:   [1.4714 us 1.4758 us 1.4803 us]              
       
                           change: [-44.446% -43.969% -43.522%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 5 outliers among 100 measurements (5.00%)
     5 (5.00%) high severe
   
   subtract 512            time:   [1.4825 us 1.4844 us 1.4866 us]              
            
                           change: [-45.351% -45.018% -44.686%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 9 outliers among 100 measurements (9.00%)
     5 (5.00%) high mild
     4 (4.00%) high severe
   
   multiply 512            time:   [1.4895 us 1.4936 us 1.4990 us]              
            
                           change: [-44.822% -44.135% -43.479%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 9 outliers among 100 measurements (9.00%)
     4 (4.00%) high mild
     5 (5.00%) high severe
   
   divide 512              time:   [1.9742 us 1.9773 us 1.9810 us]              
          
                           change: [-33.273% -32.688% -32.052%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 14 outliers among 100 measurements (14.00%)
     7 (7.00%) high mild
     7 (7.00%) high severe
   
   limit 512, 512          time:   [374.66 ns 375.64 ns 376.53 ns]              
             
                           change: [-0.1000% +0.4442% +0.9503%] (p = 0.10 > 
0.05)
                           No change in performance detected.
   Found 8 outliers among 100 measurements (8.00%)
     2 (2.00%) low severe
     2 (2.00%) low mild
     2 (2.00%) high mild
     2 (2.00%) high severe
   
   add_nulls_512           time:   [1.4880 us 1.4982 us 1.5115 us]              
             
                           change: [-44.084% -43.116% -42.111%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 16 outliers among 100 measurements (16.00%)
     3 (3.00%) high mild
     13 (13.00%) high severe
   
   divide_nulls_512        time:   [1.9731 us 1.9758 us 1.9790 us]              
                
                           change: [-33.404% -32.570% -31.416%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 8 outliers among 100 measurements (8.00%)
     2 (2.00%) high mild
     6 (6.00%) high severe
   ```
   
   ### SIMD
   
   divide is the only relevant
   
   ```
   Previous HEAD position was 0852869d1 Improved benches for arithmetic.
   Switched to branch 'divide_simd_faster'
      Compiling arrow v2.0.0-SNAPSHOT 
(/Users/jorgecarleitao/projects/arrow/rust/arrow)
       Finished bench [optimized] target(s) in 38.63s
        Running 
/Users/jorgecarleitao/projects/arrow/rust/target/release/deps/arithmetic_kernels-b8dc1739cfb5ae36
   Gnuplot not found, using plotters backend
   add 512                 time:   [879.31 ns 883.95 ns 889.17 ns]              
       
                           change: [-0.2041% +0.6502% +1.5484%] (p = 0.15 > 
0.05)
                           No change in performance detected.
   Found 16 outliers among 100 measurements (16.00%)
     5 (5.00%) high mild
     11 (11.00%) high severe
   
   subtract 512            time:   [864.99 ns 866.95 ns 868.95 ns]              
            
                           change: [-4.8531% -4.1561% -3.5163%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 7 outliers among 100 measurements (7.00%)
     2 (2.00%) high mild
     5 (5.00%) high severe
   
   multiply 512            time:   [862.85 ns 864.87 ns 867.71 ns]              
            
                           change: [-3.8532% -3.1774% -2.4459%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 5 outliers among 100 measurements (5.00%)
     5 (5.00%) high severe
   
   divide 512              time:   [1.9703 us 1.9771 us 1.9843 us]              
          
                           change: [-30.046% -29.457% -28.903%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high severe
   
   limit 512, 512          time:   [368.89 ns 369.96 ns 370.96 ns]              
             
                           change: [-1.9574% -1.0063% -0.0347%] (p = 0.04 < 
0.05)
                           Change within noise threshold.
   Found 26 outliers among 100 measurements (26.00%)
     5 (5.00%) low severe
     6 (6.00%) low mild
     9 (9.00%) high mild
     6 (6.00%) high severe
   
   add_nulls_512           time:   [871.97 ns 876.99 ns 883.57 ns]              
             
                           change: [-5.1106% -3.6889% -2.3080%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 8 outliers among 100 measurements (8.00%)
     2 (2.00%) high mild
     6 (6.00%) high severe
   
   divide_nulls_512        time:   [1.9582 us 1.9625 us 1.9678 us]              
                
                           change: [-34.188% -33.161% -32.136%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 8 outliers among 100 measurements (8.00%)
     2 (2.00%) high mild
     6 (6.00%) high severe
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to