jorgecarleitao opened a new pull request #8191:
URL: https://github.com/apache/arrow/pull/8191
This PR speeds-up arithmetic ops by leveraging vectorization of non-divide
operations (in non-SIMD), as well as removing an un-needed operation in SIMD
division.
For non-SIMD, this yields about `[-30%,-45%]` for all operations (`+-*/`)
For SIMD, this yields about `-30%` on division.
The culprit in non-SIMD was that we required the operation to return
`Result<T::Native>`, which was not allowing the compiler to vectorize the
operation. Only the division requires `Result`. For divide, removing the
operator further speed up the operation (I do not know the reason).
The culprit in SIMD was primarily a `simd_load` too many that was not doing
anything.
## Benchmarks
The benchmark used:
```
set -e
git checkout 0852869d1a9b7da4a1b91fa7cb7d4ef48e99cdec
cargo bench --bench arithmetic_kernels
git checkout divide_simd_faster
cargo bench --bench arithmetic_kernels
echo "##################################"
git checkout 0852869d1a9b7da4a1b91fa7cb7d4ef48e99cdec
cargo bench --bench arithmetic_kernels --features simd
git checkout divide_simd_faster
cargo bench --bench arithmetic_kernels --features simd
```
and below are the results for the execution of the second `bench`, which is
the one that gives the differential, in my machine:
### Non-SIMD
```
Previous HEAD position was 0852869d1 Improved benches for arithmetic.
Switched to branch 'divide_simd_faster'
Compiling arrow v2.0.0-SNAPSHOT
(/Users/jorgecarleitao/projects/arrow/rust/arrow)
Finished bench [optimized] target(s) in 37.24s
Running
/Users/jorgecarleitao/projects/arrow/rust/target/release/deps/arithmetic_kernels-d281862a43faaf38
Gnuplot not found, using plotters backend
add 512 time: [1.4714 us 1.4758 us 1.4803 us]
change: [-44.446% -43.969% -43.522%] (p = 0.00 <
0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
5 (5.00%) high severe
subtract 512 time: [1.4825 us 1.4844 us 1.4866 us]
change: [-45.351% -45.018% -44.686%] (p = 0.00 <
0.05)
Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
5 (5.00%) high mild
4 (4.00%) high severe
multiply 512 time: [1.4895 us 1.4936 us 1.4990 us]
change: [-44.822% -44.135% -43.479%] (p = 0.00 <
0.05)
Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
4 (4.00%) high mild
5 (5.00%) high severe
divide 512 time: [1.9742 us 1.9773 us 1.9810 us]
change: [-33.273% -32.688% -32.052%] (p = 0.00 <
0.05)
Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
7 (7.00%) high mild
7 (7.00%) high severe
limit 512, 512 time: [374.66 ns 375.64 ns 376.53 ns]
change: [-0.1000% +0.4442% +0.9503%] (p = 0.10 >
0.05)
No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
2 (2.00%) low severe
2 (2.00%) low mild
2 (2.00%) high mild
2 (2.00%) high severe
add_nulls_512 time: [1.4880 us 1.4982 us 1.5115 us]
change: [-44.084% -43.116% -42.111%] (p = 0.00 <
0.05)
Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
3 (3.00%) high mild
13 (13.00%) high severe
divide_nulls_512 time: [1.9731 us 1.9758 us 1.9790 us]
change: [-33.404% -32.570% -31.416%] (p = 0.00 <
0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
2 (2.00%) high mild
6 (6.00%) high severe
```
### SIMD
divide is the only relevant
```
Previous HEAD position was 0852869d1 Improved benches for arithmetic.
Switched to branch 'divide_simd_faster'
Compiling arrow v2.0.0-SNAPSHOT
(/Users/jorgecarleitao/projects/arrow/rust/arrow)
Finished bench [optimized] target(s) in 38.63s
Running
/Users/jorgecarleitao/projects/arrow/rust/target/release/deps/arithmetic_kernels-b8dc1739cfb5ae36
Gnuplot not found, using plotters backend
add 512 time: [879.31 ns 883.95 ns 889.17 ns]
change: [-0.2041% +0.6502% +1.5484%] (p = 0.15 >
0.05)
No change in performance detected.
Found 16 outliers among 100 measurements (16.00%)
5 (5.00%) high mild
11 (11.00%) high severe
subtract 512 time: [864.99 ns 866.95 ns 868.95 ns]
change: [-4.8531% -4.1561% -3.5163%] (p = 0.00 <
0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
2 (2.00%) high mild
5 (5.00%) high severe
multiply 512 time: [862.85 ns 864.87 ns 867.71 ns]
change: [-3.8532% -3.1774% -2.4459%] (p = 0.00 <
0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
5 (5.00%) high severe
divide 512 time: [1.9703 us 1.9771 us 1.9843 us]
change: [-30.046% -29.457% -28.903%] (p = 0.00 <
0.05)
Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high severe
limit 512, 512 time: [368.89 ns 369.96 ns 370.96 ns]
change: [-1.9574% -1.0063% -0.0347%] (p = 0.04 <
0.05)
Change within noise threshold.
Found 26 outliers among 100 measurements (26.00%)
5 (5.00%) low severe
6 (6.00%) low mild
9 (9.00%) high mild
6 (6.00%) high severe
add_nulls_512 time: [871.97 ns 876.99 ns 883.57 ns]
change: [-5.1106% -3.6889% -2.3080%] (p = 0.00 <
0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
2 (2.00%) high mild
6 (6.00%) high severe
divide_nulls_512 time: [1.9582 us 1.9625 us 1.9678 us]
change: [-34.188% -33.161% -32.136%] (p = 0.00 <
0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
2 (2.00%) high mild
6 (6.00%) high severe
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]