zanmato1984 commented on PR #43832: URL: https://github.com/apache/arrow/pull/43832#issuecomment-2320767085
Before formalizing this PR, I think I can use some help from @pitrou @cyb70289 @wgtmac @mapleFU . I'm benchmarking `BM_RowArray_Decode*` (see the PR), using my 2019 Intel MBP (Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz, Coffee Lake), on my other branch https://github.com/zanmato1984/arrow/tree/swiss-join-avx2-for-maple. This branch is based on the PR branch, with some code paths commented out in order to solely compare the performance between `DecodeFixedLength + Visit` (the scalar version) and `DecodeFixedLength_avx2 + Visit_avx2` (the AVX2 version). The result surprises me. The scalar version: ``` ARROW_USER_SIMD_LEVEL=NONE ./arrow-acero-hash-join-benchmark --benchmark_filter="BM_RowArray" The number of inputs is very large. BM_HashJoinBasic_PayloadSize will be repeated at least 125 times. 2024-08-30T16:44:41+08:00 Running ./arrow-acero-hash-join-benchmark Run on (16 X 2400 MHz CPU s) CPU Caches: L1 Data 32 KiB L1 Instruction 32 KiB L2 Unified 256 KiB (x8) L3 Unified 16384 KiB Load Average: 2.24, 2.35, 2.37 ----------------------------------------------------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ----------------------------------------------------------------------------------------------------------------------------------------------------------- BM_RowArray_Decode/"boolean" 335756 ns 335771 ns 2088 rows/sec=195.178M/s BM_RowArray_Decode/"int8" 222946 ns 222964 ns 3218 rows/sec=293.926M/s BM_RowArray_Decode/"int16" 206334 ns 206310 ns 3450 rows/sec=317.653M/s BM_RowArray_Decode/"int32" 212043 ns 211908 ns 3243 rows/sec=309.262M/s BM_RowArray_Decode/"int64" 214047 ns 213696 ns 3204 rows/sec=306.674M/s BM_RowArray_DecodeFixedSizeBinary/fixed_size:3 325797 ns 325116 ns 2116 rows/sec=201.574M/s BM_RowArray_DecodeFixedSizeBinary/fixed_size:5 332284 ns 331637 ns 2106 rows/sec=197.611M/s BM_RowArray_DecodeFixedSizeBinary/fixed_size:6 332328 ns 331591 ns 2140 rows/sec=197.638M/s BM_RowArray_DecodeFixedSizeBinary/fixed_size:7 331975 ns 331820 ns 2150 rows/sec=197.501M/s BM_RowArray_DecodeFixedSizeBinary/fixed_size:9 367805 ns 367605 ns 1915 rows/sec=178.275M/s BM_RowArray_DecodeFixedSizeBinary/fixed_size:16 368041 ns 367993 ns 1905 rows/sec=178.088M/s BM_RowArray_DecodeFixedSizeBinary/fixed_size:42 527913 ns 527701 ns 1342 rows/sec=124.19M/s BM_RowArray_DecodeBinary/max_length:32 1645646 ns 1498623 ns 469 rows/sec=43.7302M/s BM_RowArray_DecodeBinary/max_length:64 1751193 ns 1750436 ns 408 rows/sec=37.4392M/s BM_RowArray_DecodeBinary/max_length:128 2076064 ns 2073780 ns 350 rows/sec=31.6017M/s BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0 353249 ns 353196 ns 1985 rows/sec=185.549M/s BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1 226431 ns 226484 ns 3076 rows/sec=289.359M/s BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2 583054 ns 583129 ns 1233 rows/sec=112.385M/s BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0 455519 ns 455603 ns 1507 rows/sec=143.842M/s BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1 288495 ns 288417 ns 2450 rows/sec=227.223M/s BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2 1335445 ns 1335260 ns 530 rows/sec=49.0803M/s BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3 1403463 ns 1403317 ns 502 rows/sec=46.7001M/s ``` The AVX2 version: ``` ./arrow-acero-hash-join-benchmark --benchmark_filter="BM_RowArray" The number of inputs is very large. BM_HashJoinBasic_PayloadSize will be repeated at least 125 times. 2024-08-30T16:45:10+08:00 Running ./arrow-acero-hash-join-benchmark Run on (16 X 2400 MHz CPU s) CPU Caches: L1 Data 32 KiB L1 Instruction 32 KiB L2 Unified 256 KiB (x8) L3 Unified 16384 KiB Load Average: 3.41, 2.64, 2.47 ----------------------------------------------------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ----------------------------------------------------------------------------------------------------------------------------------------------------------- BM_RowArray_Decode/"boolean" 275226 ns 275132 ns 2561 rows/sec=238.195M/s BM_RowArray_Decode/"int8" 268566 ns 268506 ns 2654 rows/sec=244.072M/s BM_RowArray_Decode/"int16" 265198 ns 265165 ns 2557 rows/sec=247.148M/s BM_RowArray_Decode/"int32" 262050 ns 261924 ns 2709 rows/sec=250.207M/s BM_RowArray_Decode/"int64" 260563 ns 260506 ns 2699 rows/sec=251.568M/s BM_RowArray_DecodeFixedSizeBinary/fixed_size:3 355187 ns 348711 ns 2019 rows/sec=187.935M/s BM_RowArray_DecodeFixedSizeBinary/fixed_size:5 350424 ns 350330 ns 2005 rows/sec=187.066M/s BM_RowArray_DecodeFixedSizeBinary/fixed_size:6 350453 ns 350352 ns 2024 rows/sec=187.055M/s BM_RowArray_DecodeFixedSizeBinary/fixed_size:7 349830 ns 349612 ns 1992 rows/sec=187.451M/s BM_RowArray_DecodeFixedSizeBinary/fixed_size:9 349569 ns 349349 ns 2002 rows/sec=187.592M/s BM_RowArray_DecodeFixedSizeBinary/fixed_size:16 351250 ns 351061 ns 1996 rows/sec=186.677M/s BM_RowArray_DecodeFixedSizeBinary/fixed_size:42 421615 ns 421506 ns 1633 rows/sec=155.478M/s BM_RowArray_DecodeBinary/max_length:32 1029484 ns 1029140 ns 685 rows/sec=63.6794M/s BM_RowArray_DecodeBinary/max_length:64 1406671 ns 1406103 ns 495 rows/sec=46.6075M/s BM_RowArray_DecodeBinary/max_length:128 1657173 ns 1655898 ns 432 rows/sec=39.5767M/s BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0 276695 ns 276781 ns 2546 rows/sec=236.775M/s BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1 263849 ns 263910 ns 2543 rows/sec=248.323M/s BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2 473701 ns 473962 ns 1532 rows/sec=138.271M/s BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0 466305 ns 466162 ns 1447 rows/sec=140.584M/s BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1 471840 ns 471721 ns 1496 rows/sec=138.927M/s BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2 1065689 ns 1065185 ns 661 rows/sec=61.5245M/s BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3 1214506 ns 1213996 ns 562 rows/sec=53.9829M/s ``` Take `int16` for example, the scalar version is about 30% faster than the vector version. This is the case as well for other fixed-length types that are less than 8 bytes. So I benchmarked (on the same machine) solely the memory access pattern (gather several integers from scattered addresses then store them together) between scalar and AVX2 using a much more compact benchmark (it's still in my local and not published in any branches). I put my conclusion, which I'm very unconfident of, in the code comment, quote: ``` // Benchmarking shows that when the data element width is <= 8, the scalar code almost // always outperforms the vectorized code - about 2X~3X faster when the whole data set // falls into L1~LLC, and the ratio goes down to about 1:1 as the data size increases // when most of the accesses hit the main memory. This is possibly because that decoding // is mostly copying scattered pieces of data and there is not enough computation to pay // off the cost of the heavy gather instructions. // For fixed length 0 (boolean column), the vectorized code wins by batching 8 bits into // a single byte instead of modifying the same byte 8 times in the scalar code. ``` What I need for help is that: Is my assumption reasonable? Or is it just the case on my hardware (I temporarily have difficulties on accessing other Intel machines)? Or is it the AVX2 code I wrote to be improved further? Or even is it simply the problem of the benchmark itself? BTW, if benchmarking using this PR (the slow AVX2 paths are intentionally avoided), we get positive results, about 50% improvement for the AVX version. Thanks in advance! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
