zanmato1984 commented on PR #43832:
URL: https://github.com/apache/arrow/pull/43832#issuecomment-2326646353

   > This is on my other desktop (Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz, 
Coffee Lake), similar symptom (possibly because it is also Coffee Lake as my 
MPB).
   > 
   > The scalar version:
   > 
   > ```
   > ARROW_USER_SIMD_LEVEL=NONE ./arrow-acero-hash-join-benchmark 
--benchmark_filter="BM_RowArray"
   > 2024-09-01T00:32:49+08:00
   > Running ./arrow-acero-hash-join-benchmark
   > Run on (8 X 4900 MHz CPU s)
   > CPU Caches:
   >   L1 Data 32 KiB (x8)
   >   L1 Instruction 32 KiB (x8)
   >   L2 Unified 256 KiB (x8)
   >   L3 Unified 12288 KiB (x1)
   > Load Average: 0.46, 3.08, 2.34
   > ***WARNING*** CPU scaling is enabled, the benchmark real time measurements 
may be noisy and will incur extra overhead.
   > 
-----------------------------------------------------------------------------------------------------------------------------------------------------------
   > Benchmark                                                                  
                               Time             CPU   Iterations UserCounters...
   > 
-----------------------------------------------------------------------------------------------------------------------------------------------------------
   > BM_RowArray_Decode/"boolean"                                               
                          345809 ns       345761 ns         1896 
rows/sec=189.538M/s
   > BM_RowArray_Decode/"int8"                                                  
                          267577 ns       267553 ns         2678 
rows/sec=244.942M/s
   > BM_RowArray_Decode/"int16"                                                 
                          237106 ns       237094 ns         2872 
rows/sec=276.409M/s
   > BM_RowArray_Decode/"int32"                                                 
                          243701 ns       243697 ns         2874 
rows/sec=268.92M/s
   > BM_RowArray_Decode/"int64"                                                 
                          239891 ns       239886 ns         2709 
rows/sec=273.192M/s
   > BM_RowArray_DecodeFixedSizeBinary/fixed_size:3                             
                          316511 ns       316471 ns         2260 
rows/sec=207.081M/s
   > BM_RowArray_DecodeFixedSizeBinary/fixed_size:5                             
                          310797 ns       310759 ns         2165 
rows/sec=210.887M/s
   > BM_RowArray_DecodeFixedSizeBinary/fixed_size:6                             
                          324059 ns       324020 ns         2251 
rows/sec=202.256M/s
   > BM_RowArray_DecodeFixedSizeBinary/fixed_size:7                             
                          311799 ns       311753 ns         2244 
rows/sec=210.214M/s
   > BM_RowArray_DecodeFixedSizeBinary/fixed_size:9                             
                          364401 ns       364346 ns         2016 
rows/sec=179.87M/s
   > BM_RowArray_DecodeFixedSizeBinary/fixed_size:16                            
                          349918 ns       349868 ns         1997 
rows/sec=187.313M/s
   > BM_RowArray_DecodeFixedSizeBinary/fixed_size:42                            
                          507058 ns       506962 ns         1427 
rows/sec=129.27M/s
   > BM_RowArray_DecodeBinary/max_length:32                                     
                         1261872 ns      1261465 ns          554 
rows/sec=51.9515M/s
   > BM_RowArray_DecodeBinary/max_length:64                                     
                         1585243 ns      1584698 ns          462 
rows/sec=41.3549M/s
   > BM_RowArray_DecodeBinary/max_length:128                                    
                         1822727 ns      1822343 ns          384 
rows/sec=35.962M/s
   > 
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0
     379210 ns       379150 ns         1843 rows/sec=172.847M/s
   > 
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1
     275680 ns       275657 ns         2693 rows/sec=237.741M/s
   > 
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2
     599291 ns       599291 ns         1257 rows/sec=109.354M/s
   > 
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0
                   506824 ns       506710 ns         1376 rows/sec=129.334M/s
   > 
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1
                   360611 ns       360579 ns         2123 rows/sec=181.75M/s
   > 
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2
                  1182248 ns      1181939 ns          603 rows/sec=55.447M/s
   > 
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3
                  1395220 ns      1394817 ns          529 rows/sec=46.9847M/s
   > ```
   > 
   > The AVX2 version:
   > 
   > ```
   > ./arrow-acero-hash-join-benchmark --benchmark_filter="BM_RowArray"
   > 2024-09-01T00:33:14+08:00
   > Running ./arrow-acero-hash-join-benchmark
   > Run on (8 X 4900 MHz CPU s)
   > CPU Caches:
   >   L1 Data 32 KiB (x8)
   >   L1 Instruction 32 KiB (x8)
   >   L2 Unified 256 KiB (x8)
   >   L3 Unified 12288 KiB (x1)
   > Load Average: 0.64, 2.91, 2.31
   > ***WARNING*** CPU scaling is enabled, the benchmark real time measurements 
may be noisy and will incur extra overhead.
   > 
-----------------------------------------------------------------------------------------------------------------------------------------------------------
   > Benchmark                                                                  
                               Time             CPU   Iterations UserCounters...
   > 
-----------------------------------------------------------------------------------------------------------------------------------------------------------
   > BM_RowArray_Decode/"boolean"                                               
                          262395 ns       262341 ns         2665 
rows/sec=249.808M/s
   > BM_RowArray_Decode/"int8"                                                  
                          263405 ns       263397 ns         2716 
rows/sec=248.807M/s
   > BM_RowArray_Decode/"int16"                                                 
                          248155 ns       248106 ns         2821 
rows/sec=264.141M/s
   > BM_RowArray_Decode/"int32"                                                 
                          257523 ns       257519 ns         2825 
rows/sec=254.486M/s
   > BM_RowArray_Decode/"int64"                                                 
                          245070 ns       245020 ns         2824 
rows/sec=267.468M/s
   > BM_RowArray_DecodeFixedSizeBinary/fixed_size:3                             
                          330801 ns       330759 ns         1980 
rows/sec=198.135M/s
   > BM_RowArray_DecodeFixedSizeBinary/fixed_size:5                             
                          327874 ns       327839 ns         2134 
rows/sec=199.9M/s
   > BM_RowArray_DecodeFixedSizeBinary/fixed_size:6                             
                          331278 ns       331242 ns         1947 
rows/sec=197.846M/s
   > BM_RowArray_DecodeFixedSizeBinary/fixed_size:7                             
                          328647 ns       328611 ns         2112 
rows/sec=199.43M/s
   > BM_RowArray_DecodeFixedSizeBinary/fixed_size:9                             
                          335129 ns       335101 ns         1937 
rows/sec=195.568M/s
   > BM_RowArray_DecodeFixedSizeBinary/fixed_size:16                            
                          347641 ns       347601 ns         2097 
rows/sec=188.535M/s
   > BM_RowArray_DecodeFixedSizeBinary/fixed_size:42                            
                          408356 ns       408265 ns         1731 
rows/sec=160.521M/s
   > BM_RowArray_DecodeBinary/max_length:32                                     
                          985453 ns       985190 ns          716 
rows/sec=66.5202M/s
   > BM_RowArray_DecodeBinary/max_length:64                                     
                         1250078 ns      1249727 ns          560 
rows/sec=52.4394M/s
   > BM_RowArray_DecodeBinary/max_length:128                                    
                         1467264 ns      1466902 ns          474 
rows/sec=44.6758M/s
   > 
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0
     266468 ns       266456 ns         2365 rows/sec=245.95M/s
   > 
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1
     246552 ns       246557 ns         2803 rows/sec=265.8M/s
   > 
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2
     437251 ns       437236 ns         1504 rows/sec=149.885M/s
   > 
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0
                   455065 ns       455005 ns         1603 rows/sec=144.031M/s
   > 
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1
                   445927 ns       445798 ns         1560 rows/sec=147.006M/s
   > 
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2
                  1033287 ns      1032913 ns          702 rows/sec=63.4468M/s
   > 
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3
                  1193991 ns      1193373 ns          544 rows/sec=54.9158M/s
   > ```
   
   OK, got something new.
   
   The bad AVX2 gather performance seems strongly related to "Gather Data 
Sampling" vulnerability [1] (CVE-2022-40982, aka "Downfall") mitigation [2].
   
   The CPU in my quote is apparently in the affected model list, for which the 
mitigation updates the microcode and causes significant performance down. Lucky 
enough this mitigation can be easily disabled. The benchmark result showed that 
the gather performance w/o this mitigation is much better, and beats the scalar 
version almost always:
   ```
    ./arrow-acero-hash-join-benchmark --benchmark_filter="BM_RowArray"
   2024-09-03T21:52:08+08:00
   Running ./arrow-acero-hash-join-benchmark
   Run on (8 X 4900 MHz CPU s)
   CPU Caches:
     L1 Data 32 KiB (x8)
     L1 Instruction 32 KiB (x8)
     L2 Unified 256 KiB (x8)
     L3 Unified 12288 KiB (x1)
   Load Average: 0.56, 0.22, 0.08
   ***WARNING*** CPU scaling is enabled, the benchmark real time measurements 
may be noisy and will incur extra overhead.
   
-----------------------------------------------------------------------------------------------------------------------------------------------------------
   Benchmark                                                                    
                             Time             CPU   Iterations UserCounters...
   
-----------------------------------------------------------------------------------------------------------------------------------------------------------
   BM_RowArray_Decode/"boolean"                                                 
                        204759 ns       204759 ns         3398 
rows/sec=320.059M/s
   BM_RowArray_Decode/"int8"                                                    
                        198094 ns       198094 ns         3476 
rows/sec=330.827M/s
   BM_RowArray_Decode/"int16"                                                   
                        199424 ns       199445 ns         3490 
rows/sec=328.587M/s
   BM_RowArray_Decode/"int32"                                                   
                        201338 ns       201351 ns         3476 
rows/sec=325.477M/s
   BM_RowArray_Decode/"int64"                                                   
                        207006 ns       207010 ns         3406 
rows/sec=316.579M/s
   BM_RowArray_DecodeFixedSizeBinary/fixed_size:3                               
                        329304 ns       329258 ns         2137 
rows/sec=199.038M/s
   BM_RowArray_DecodeFixedSizeBinary/fixed_size:5                               
                        328043 ns       327986 ns         2116 
rows/sec=199.811M/s
   BM_RowArray_DecodeFixedSizeBinary/fixed_size:6                               
                        327691 ns       327650 ns         2137 
rows/sec=200.015M/s
   BM_RowArray_DecodeFixedSizeBinary/fixed_size:7                               
                        329935 ns       329892 ns         2133 
rows/sec=198.656M/s
   BM_RowArray_DecodeFixedSizeBinary/fixed_size:9                               
                        337341 ns       337283 ns         2085 
rows/sec=194.302M/s
   BM_RowArray_DecodeFixedSizeBinary/fixed_size:16                              
                        335654 ns       335592 ns         2066 
rows/sec=195.282M/s
   BM_RowArray_DecodeFixedSizeBinary/fixed_size:42                              
                        412375 ns       412278 ns         1698 
rows/sec=158.958M/s
   BM_RowArray_DecodeBinary/max_length:32                                       
                        859282 ns       858982 ns          815 
rows/sec=76.2938M/s
   BM_RowArray_DecodeBinary/max_length:64                                       
                       1126945 ns      1126548 ns          620 
rows/sec=58.1733M/s
   BM_RowArray_DecodeBinary/max_length:128                                      
                       1346772 ns      1346336 ns          521 
rows/sec=48.6766M/s
   
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0
     225688 ns       225646 ns         3105 rows/sec=290.433M/s
   
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1
     222248 ns       222233 ns         3148 rows/sec=294.894M/s
   
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2
     448432 ns       448380 ns         1564 rows/sec=146.159M/s
   
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0
                   289385 ns       289347 ns         2420 rows/sec=226.493M/s
   
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1
                   289905 ns       289839 ns         2413 rows/sec=226.109M/s
   
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2
                   874143 ns       873785 ns          801 rows/sec=75.0013M/s
   
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3
                  1037058 ns      1036678 ns          674 rows/sec=63.2164M/s
   ``` 
   
   [1] 
https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/gather-data-sampling.html
   [2] https://access.redhat.com/solutions/7027704


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to