zanmato1984 commented on PR #43832:
URL: https://github.com/apache/arrow/pull/43832#issuecomment-2320767085

   Before formalizing this PR, I think I can use some help from @pitrou 
@cyb70289 @wgtmac @mapleFU .
   
   I'm benchmarking `BM_RowArray_Decode*` (see the PR), using my 2019 Intel MBP 
(Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz, Coffee Lake), on my other branch 
https://github.com/zanmato1984/arrow/tree/swiss-join-avx2-for-maple. This 
branch is based on the PR branch, with some code paths commented out in order 
to solely compare the performance between `DecodeFixedLength + Visit` (the 
scalar version) and `DecodeFixedLength_avx2 + Visit_avx2` (the AVX2 version).
   
   The result surprises me. The scalar version:
   ```
   ARROW_USER_SIMD_LEVEL=NONE ./arrow-acero-hash-join-benchmark 
--benchmark_filter="BM_RowArray"
   The number of inputs is very large. BM_HashJoinBasic_PayloadSize will be 
repeated at least 125 times.
   2024-08-30T16:44:41+08:00
   Running ./arrow-acero-hash-join-benchmark
   Run on (16 X 2400 MHz CPU s)
   CPU Caches:
     L1 Data 32 KiB
     L1 Instruction 32 KiB
     L2 Unified 256 KiB (x8)
     L3 Unified 16384 KiB
   Load Average: 2.24, 2.35, 2.37
   
-----------------------------------------------------------------------------------------------------------------------------------------------------------
   Benchmark                                                                    
                             Time             CPU   Iterations UserCounters...
   
-----------------------------------------------------------------------------------------------------------------------------------------------------------
   BM_RowArray_Decode/"boolean"                                                 
                        335756 ns       335771 ns         2088 
rows/sec=195.178M/s
   BM_RowArray_Decode/"int8"                                                    
                        222946 ns       222964 ns         3218 
rows/sec=293.926M/s
   BM_RowArray_Decode/"int16"                                                   
                        206334 ns       206310 ns         3450 
rows/sec=317.653M/s
   BM_RowArray_Decode/"int32"                                                   
                        212043 ns       211908 ns         3243 
rows/sec=309.262M/s
   BM_RowArray_Decode/"int64"                                                   
                        214047 ns       213696 ns         3204 
rows/sec=306.674M/s
   BM_RowArray_DecodeFixedSizeBinary/fixed_size:3                               
                        325797 ns       325116 ns         2116 
rows/sec=201.574M/s
   BM_RowArray_DecodeFixedSizeBinary/fixed_size:5                               
                        332284 ns       331637 ns         2106 
rows/sec=197.611M/s
   BM_RowArray_DecodeFixedSizeBinary/fixed_size:6                               
                        332328 ns       331591 ns         2140 
rows/sec=197.638M/s
   BM_RowArray_DecodeFixedSizeBinary/fixed_size:7                               
                        331975 ns       331820 ns         2150 
rows/sec=197.501M/s
   BM_RowArray_DecodeFixedSizeBinary/fixed_size:9                               
                        367805 ns       367605 ns         1915 
rows/sec=178.275M/s
   BM_RowArray_DecodeFixedSizeBinary/fixed_size:16                              
                        368041 ns       367993 ns         1905 
rows/sec=178.088M/s
   BM_RowArray_DecodeFixedSizeBinary/fixed_size:42                              
                        527913 ns       527701 ns         1342 
rows/sec=124.19M/s
   BM_RowArray_DecodeBinary/max_length:32                                       
                       1645646 ns      1498623 ns          469 
rows/sec=43.7302M/s
   BM_RowArray_DecodeBinary/max_length:64                                       
                       1751193 ns      1750436 ns          408 
rows/sec=37.4392M/s
   BM_RowArray_DecodeBinary/max_length:128                                      
                       2076064 ns      2073780 ns          350 
rows/sec=31.6017M/s
   
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0
     353249 ns       353196 ns         1985 rows/sec=185.549M/s
   
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1
     226431 ns       226484 ns         3076 rows/sec=289.359M/s
   
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2
     583054 ns       583129 ns         1233 rows/sec=112.385M/s
   
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0
                   455519 ns       455603 ns         1507 rows/sec=143.842M/s
   
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1
                   288495 ns       288417 ns         2450 rows/sec=227.223M/s
   
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2
                  1335445 ns      1335260 ns          530 rows/sec=49.0803M/s
   
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3
                  1403463 ns      1403317 ns          502 rows/sec=46.7001M/s
   ```
   The AVX2 version:
   ```
   ./arrow-acero-hash-join-benchmark --benchmark_filter="BM_RowArray"
   The number of inputs is very large. BM_HashJoinBasic_PayloadSize will be 
repeated at least 125 times.
   2024-08-30T16:45:10+08:00
   Running ./arrow-acero-hash-join-benchmark
   Run on (16 X 2400 MHz CPU s)
   CPU Caches:
     L1 Data 32 KiB
     L1 Instruction 32 KiB
     L2 Unified 256 KiB (x8)
     L3 Unified 16384 KiB
   Load Average: 3.41, 2.64, 2.47
   
-----------------------------------------------------------------------------------------------------------------------------------------------------------
   Benchmark                                                                    
                             Time             CPU   Iterations UserCounters...
   
-----------------------------------------------------------------------------------------------------------------------------------------------------------
   BM_RowArray_Decode/"boolean"                                                 
                        275226 ns       275132 ns         2561 
rows/sec=238.195M/s
   BM_RowArray_Decode/"int8"                                                    
                        268566 ns       268506 ns         2654 
rows/sec=244.072M/s
   BM_RowArray_Decode/"int16"                                                   
                        265198 ns       265165 ns         2557 
rows/sec=247.148M/s
   BM_RowArray_Decode/"int32"                                                   
                        262050 ns       261924 ns         2709 
rows/sec=250.207M/s
   BM_RowArray_Decode/"int64"                                                   
                        260563 ns       260506 ns         2699 
rows/sec=251.568M/s
   BM_RowArray_DecodeFixedSizeBinary/fixed_size:3                               
                        355187 ns       348711 ns         2019 
rows/sec=187.935M/s
   BM_RowArray_DecodeFixedSizeBinary/fixed_size:5                               
                        350424 ns       350330 ns         2005 
rows/sec=187.066M/s
   BM_RowArray_DecodeFixedSizeBinary/fixed_size:6                               
                        350453 ns       350352 ns         2024 
rows/sec=187.055M/s
   BM_RowArray_DecodeFixedSizeBinary/fixed_size:7                               
                        349830 ns       349612 ns         1992 
rows/sec=187.451M/s
   BM_RowArray_DecodeFixedSizeBinary/fixed_size:9                               
                        349569 ns       349349 ns         2002 
rows/sec=187.592M/s
   BM_RowArray_DecodeFixedSizeBinary/fixed_size:16                              
                        351250 ns       351061 ns         1996 
rows/sec=186.677M/s
   BM_RowArray_DecodeFixedSizeBinary/fixed_size:42                              
                        421615 ns       421506 ns         1633 
rows/sec=155.478M/s
   BM_RowArray_DecodeBinary/max_length:32                                       
                       1029484 ns      1029140 ns          685 
rows/sec=63.6794M/s
   BM_RowArray_DecodeBinary/max_length:64                                       
                       1406671 ns      1406103 ns          495 
rows/sec=46.6075M/s
   BM_RowArray_DecodeBinary/max_length:128                                      
                       1657173 ns      1655898 ns          432 
rows/sec=39.5767M/s
   
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0
     276695 ns       276781 ns         2546 rows/sec=236.775M/s
   
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1
     263849 ns       263910 ns         2543 rows/sec=248.323M/s
   
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2
     473701 ns       473962 ns         1532 rows/sec=138.271M/s
   
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0
                   466305 ns       466162 ns         1447 rows/sec=140.584M/s
   
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1
                   471840 ns       471721 ns         1496 rows/sec=138.927M/s
   
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2
                  1065689 ns      1065185 ns          661 rows/sec=61.5245M/s
   
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3
                  1214506 ns      1213996 ns          562 rows/sec=53.9829M/s
   ```
   
   Take `int16` for example, the scalar version is about 30% faster than the 
vector version. This is the case as well for other fixed-length types that are 
less than 8 bytes. So I benchmarked (on the same machine) solely the memory 
access pattern (gather several integers from scattered addresses then store 
them together) between scalar and AVX2 using a much more compact benchmark 
(it's still in my local and not published in any branches). I put my 
conclusion, which I'm very unconfident of, in the code comment, quote:
   ```
     // Benchmarking shows that when the data element width is <= 8, the scalar 
code almost
     // always outperforms the vectorized code - about 2X~3X faster when the 
whole data set
     // falls into L1~LLC, and the ratio goes down to about 1:1 as the data 
size increases
     // when most of the accesses hit the main memory. This is possibly because 
that decoding
     // is mostly copying scattered pieces of data and there is not enough 
computation to pay
     // off the cost of the heavy gather instructions.
     // For fixed length 0 (boolean column), the vectorized code wins by 
batching 8 bits into
     // a single byte instead of modifying the same byte 8 times in the scalar 
code.
   ```
   
   What I need for help is that: Is my assumption reasonable? Or is it just the 
case on my hardware (I temporarily have difficulties on accessing other Intel 
machines)? Or is it the AVX2 code I wrote to be improved further? Or even is it 
simply the problem of the benchmark itself?
   
   BTW, if benchmarking using this PR (the slow AVX2 paths are intentionally 
avoided), we get positive results, about 50% improvement for the AVX version.
   
   Thanks in advance!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to