tyrelr commented on pull request #8973:
URL: https://github.com/apache/arrow/pull/8973#issuecomment-749581718


   Yep, I forgot that simd wasn't a default feature so I didn't test that 
locally.  I'll take a look at that tonight, hopefully.  This will touch a few 
lines already modified by PR #8975, but this change is simple enough it 
shouldn't cause concern.
   
   @Dandandan I ran master vs. current head over night (twice to try to 
distinguish environmental performance from code performance).  Using critcmp to 
filter to out anything smaller than a 10% difference, it looks like this:
   ```
   group                                    head-2c823df03-ARROW-10989          
   head-2c823df03-ARROW-10989-run2        master-ARROW-10989                    
 master-ARROW-10989-run2
   -----                                    --------------------------          
   -------------------------------        ------------------                    
 -----------------------
   add 512                                  1.30    408.5±3.37ns        ? B/sec 
   1.00    315.0±1.97ns        ? B/sec    3.46   1088.8±5.62ns        ? B/sec   
 3.64   1146.1±4.42ns        ? B/sec
   add_nulls_512                            1.09    423.8±3.18ns        ? B/sec 
   1.00    389.3±2.58ns        ? B/sec    2.98   1158.3±6.43ns        ? B/sec   
 2.95   1149.9±6.95ns        ? B/sec
   array_from_vec 128                       1.08    444.3±3.26ns        ? B/sec 
   1.11    458.3±1.97ns        ? B/sec    1.00    411.4±2.09ns        ? B/sec   
 1.05    432.2±2.82ns        ? B/sec
   bench_primitive                          1.00   1109.8±9.80µs     3.5 GB/sec 
   1.03   1146.2±6.90µs     3.4 GB/sec    3.01      3.3±0.01ms  1198.7 MB/sec   
 2.94      3.3±0.01ms  1225.9 MB/sec
   cast float64 to float32 512              1.01      2.8±0.03µs        ? B/sec 
   1.00      2.8±0.02µs        ? B/sec    1.11      3.1±0.03µs        ? B/sec   
 1.00      2.8±0.02µs        ? B/sec
   cast int32 to int32 512                  1.00     26.9±0.38ns        ? B/sec 
   1.00     26.8±0.19ns        ? B/sec    0.99     26.8±0.15ns        ? B/sec   
 1.11     30.0±0.16ns        ? B/sec
   cast time32s to time32ms 512             1.00   965.9±11.12ns        ? B/sec 
   1.05   1012.6±9.16ns        ? B/sec    1.75   1687.8±8.08ns        ? B/sec   
 1.68   1621.7±7.75ns        ? B/sec
   cast time64ns to time32s 512             1.10     11.1±0.12µs        ? B/sec 
   1.00     10.1±0.16µs        ? B/sec    1.00     10.1±0.04µs        ? B/sec   
 1.00     10.1±0.11µs        ? B/sec
   cast timestamp_ms to timestamp_ns 512    1.14  1481.4±11.07ns        ? B/sec 
   1.00  1304.2±10.03ns        ? B/sec    1.41   1840.6±9.37ns        ? B/sec   
 1.45  1894.2±10.44ns        ? B/sec
   divide 512                               1.04  1830.0±12.34ns        ? B/sec 
   1.00   1752.3±9.17ns        ? B/sec    1.03   1797.1±9.00ns        ? B/sec   
 1.27      2.2±0.01µs        ? B/sec
   eq scalar Float32                        1.00     64.2±0.52µs        ? B/sec 
   1.01     64.9±0.26µs        ? B/sec    1.07     68.9±0.29µs        ? B/sec   
 1.11     71.3±0.26µs        ? B/sec
   filter context f32 low selectivity       1.10    129.4±2.24µs        ? B/sec 
   1.00    117.1±0.63µs        ? B/sec    1.01    118.7±0.56µs        ? B/sec   
 1.01    118.7±0.51µs        ? B/sec
   min nulls 512                            1.13      2.1±0.02µs        ? B/sec 
   1.01  1872.1±12.28ns        ? B/sec    1.00  1858.1±25.37ns        ? B/sec   
 1.12      2.1±0.02µs        ? B/sec
   multiply 512                             1.00    403.0±2.35ns        ? B/sec 
   1.21    487.6±4.20ns        ? B/sec    2.86   1152.5±9.71ns        ? B/sec   
 2.86   1151.0±6.00ns        ? B/sec
   subtract 512                             1.00    385.3±4.18ns        ? B/sec 
   1.09    418.9±2.99ns        ? B/sec    3.24   1246.6±6.37ns        ? B/sec   
 3.00   1157.5±6.40ns        ? B/sec
   take bool nulls 1024                     1.02      2.6±0.02µs        ? B/sec 
   1.00      2.6±0.02µs        ? B/sec    1.91      4.9±0.05µs        ? B/sec   
 1.91      4.9±0.05µs        ? B/sec
   take bool nulls 512                      1.03  1449.2±18.90ns        ? B/sec 
   1.00  1412.8±11.94ns        ? B/sec    1.27  1788.1±14.12ns        ? B/sec   
 1.34  1899.4±32.11ns        ? B/sec
   take i32 512                             1.00    924.2±5.01ns        ? B/sec 
   1.00    922.8±6.28ns        ? B/sec    1.11   1022.4±5.66ns        ? B/sec   
 1.00    925.5±9.49ns        ? B/sec
   take i32 nulls 512                       1.09   1067.0±6.65ns        ? B/sec 
   1.00    975.3±5.77ns        ? B/sec    1.09   1065.0±5.21ns        ? B/sec   
 1.11   1080.0±6.80ns        ? B/sec
   take str 1024                            1.00      4.6±0.02µs        ? B/sec 
   1.04      4.8±0.03µs        ? B/sec    1.14      5.3±0.04µs        ? B/sec   
 1.06      4.9±0.03µs        ? B/sec
   take str 512                             1.00      2.8±0.01µs        ? B/sec 
   1.04      2.9±0.02µs        ? B/sec    1.14      3.2±0.03µs        ? B/sec   
 1.01      2.8±0.02µs        ? B/sec
   take str null indices 512                1.00      2.8±0.01µs        ? B/sec 
   1.04      2.9±0.02µs        ? B/sec    1.16      3.2±0.03µs        ? B/sec   
 1.03      2.9±0.02µs        ? B/sec
   ```
   A few sum benchmarks are significantly faster (finished in half to a third 
the time).
   I am surprised by a performance increase in take bool nulls 1025/512 and 
take str null indices 512...  I wouldn't expect those to use primitive arrays 
at all.  I'll look into why that changes at the same time I look into fixing 
the simd compilation.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to