zanmato1984 commented on PR #43832:
URL: https://github.com/apache/arrow/pull/43832#issuecomment-2326646353
> This is on my other desktop (Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz,
Coffee Lake), similar symptom (possibly because it is also Coffee Lake as my
MPB).
>
> The scalar version:
>
> ```
> ARROW_USER_SIMD_LEVEL=NONE ./arrow-acero-hash-join-benchmark
--benchmark_filter="BM_RowArray"
> 2024-09-01T00:32:49+08:00
> Running ./arrow-acero-hash-join-benchmark
> Run on (8 X 4900 MHz CPU s)
> CPU Caches:
> L1 Data 32 KiB (x8)
> L1 Instruction 32 KiB (x8)
> L2 Unified 256 KiB (x8)
> L3 Unified 12288 KiB (x1)
> Load Average: 0.46, 3.08, 2.34
> ***WARNING*** CPU scaling is enabled, the benchmark real time measurements
may be noisy and will incur extra overhead.
>
-----------------------------------------------------------------------------------------------------------------------------------------------------------
> Benchmark
Time CPU Iterations UserCounters...
>
-----------------------------------------------------------------------------------------------------------------------------------------------------------
> BM_RowArray_Decode/"boolean"
345809 ns 345761 ns 1896
rows/sec=189.538M/s
> BM_RowArray_Decode/"int8"
267577 ns 267553 ns 2678
rows/sec=244.942M/s
> BM_RowArray_Decode/"int16"
237106 ns 237094 ns 2872
rows/sec=276.409M/s
> BM_RowArray_Decode/"int32"
243701 ns 243697 ns 2874
rows/sec=268.92M/s
> BM_RowArray_Decode/"int64"
239891 ns 239886 ns 2709
rows/sec=273.192M/s
> BM_RowArray_DecodeFixedSizeBinary/fixed_size:3
316511 ns 316471 ns 2260
rows/sec=207.081M/s
> BM_RowArray_DecodeFixedSizeBinary/fixed_size:5
310797 ns 310759 ns 2165
rows/sec=210.887M/s
> BM_RowArray_DecodeFixedSizeBinary/fixed_size:6
324059 ns 324020 ns 2251
rows/sec=202.256M/s
> BM_RowArray_DecodeFixedSizeBinary/fixed_size:7
311799 ns 311753 ns 2244
rows/sec=210.214M/s
> BM_RowArray_DecodeFixedSizeBinary/fixed_size:9
364401 ns 364346 ns 2016
rows/sec=179.87M/s
> BM_RowArray_DecodeFixedSizeBinary/fixed_size:16
349918 ns 349868 ns 1997
rows/sec=187.313M/s
> BM_RowArray_DecodeFixedSizeBinary/fixed_size:42
507058 ns 506962 ns 1427
rows/sec=129.27M/s
> BM_RowArray_DecodeBinary/max_length:32
1261872 ns 1261465 ns 554
rows/sec=51.9515M/s
> BM_RowArray_DecodeBinary/max_length:64
1585243 ns 1584698 ns 462
rows/sec=41.3549M/s
> BM_RowArray_DecodeBinary/max_length:128
1822727 ns 1822343 ns 384
rows/sec=35.962M/s
>
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0
379210 ns 379150 ns 1843 rows/sec=172.847M/s
>
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1
275680 ns 275657 ns 2693 rows/sec=237.741M/s
>
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2
599291 ns 599291 ns 1257 rows/sec=109.354M/s
>
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0
506824 ns 506710 ns 1376 rows/sec=129.334M/s
>
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1
360611 ns 360579 ns 2123 rows/sec=181.75M/s
>
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2
1182248 ns 1181939 ns 603 rows/sec=55.447M/s
>
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3
1395220 ns 1394817 ns 529 rows/sec=46.9847M/s
> ```
>
> The AVX2 version:
>
> ```
> ./arrow-acero-hash-join-benchmark --benchmark_filter="BM_RowArray"
> 2024-09-01T00:33:14+08:00
> Running ./arrow-acero-hash-join-benchmark
> Run on (8 X 4900 MHz CPU s)
> CPU Caches:
> L1 Data 32 KiB (x8)
> L1 Instruction 32 KiB (x8)
> L2 Unified 256 KiB (x8)
> L3 Unified 12288 KiB (x1)
> Load Average: 0.64, 2.91, 2.31
> ***WARNING*** CPU scaling is enabled, the benchmark real time measurements
may be noisy and will incur extra overhead.
>
-----------------------------------------------------------------------------------------------------------------------------------------------------------
> Benchmark
Time CPU Iterations UserCounters...
>
-----------------------------------------------------------------------------------------------------------------------------------------------------------
> BM_RowArray_Decode/"boolean"
262395 ns 262341 ns 2665
rows/sec=249.808M/s
> BM_RowArray_Decode/"int8"
263405 ns 263397 ns 2716
rows/sec=248.807M/s
> BM_RowArray_Decode/"int16"
248155 ns 248106 ns 2821
rows/sec=264.141M/s
> BM_RowArray_Decode/"int32"
257523 ns 257519 ns 2825
rows/sec=254.486M/s
> BM_RowArray_Decode/"int64"
245070 ns 245020 ns 2824
rows/sec=267.468M/s
> BM_RowArray_DecodeFixedSizeBinary/fixed_size:3
330801 ns 330759 ns 1980
rows/sec=198.135M/s
> BM_RowArray_DecodeFixedSizeBinary/fixed_size:5
327874 ns 327839 ns 2134
rows/sec=199.9M/s
> BM_RowArray_DecodeFixedSizeBinary/fixed_size:6
331278 ns 331242 ns 1947
rows/sec=197.846M/s
> BM_RowArray_DecodeFixedSizeBinary/fixed_size:7
328647 ns 328611 ns 2112
rows/sec=199.43M/s
> BM_RowArray_DecodeFixedSizeBinary/fixed_size:9
335129 ns 335101 ns 1937
rows/sec=195.568M/s
> BM_RowArray_DecodeFixedSizeBinary/fixed_size:16
347641 ns 347601 ns 2097
rows/sec=188.535M/s
> BM_RowArray_DecodeFixedSizeBinary/fixed_size:42
408356 ns 408265 ns 1731
rows/sec=160.521M/s
> BM_RowArray_DecodeBinary/max_length:32
985453 ns 985190 ns 716
rows/sec=66.5202M/s
> BM_RowArray_DecodeBinary/max_length:64
1250078 ns 1249727 ns 560
rows/sec=52.4394M/s
> BM_RowArray_DecodeBinary/max_length:128
1467264 ns 1466902 ns 474
rows/sec=44.6758M/s
>
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0
266468 ns 266456 ns 2365 rows/sec=245.95M/s
>
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1
246552 ns 246557 ns 2803 rows/sec=265.8M/s
>
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2
437251 ns 437236 ns 1504 rows/sec=149.885M/s
>
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0
455065 ns 455005 ns 1603 rows/sec=144.031M/s
>
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1
445927 ns 445798 ns 1560 rows/sec=147.006M/s
>
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2
1033287 ns 1032913 ns 702 rows/sec=63.4468M/s
>
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3
1193991 ns 1193373 ns 544 rows/sec=54.9158M/s
> ```
OK, got something new.
The bad AVX2 gather performance seems strongly related to "Gather Data
Sampling" vulnerability [1] (CVE-2022-40982, aka "Downfall") mitigation [2].
The CPU in my quote is apparently in the affected model list, for which the
mitigation updates the microcode and causes significant performance down. Lucky
enough this mitigation can be easily disabled. The benchmark result showed that
the gather performance w/o this mitigation is much better, and beats the scalar
version almost always:
```
./arrow-acero-hash-join-benchmark --benchmark_filter="BM_RowArray"
2024-09-03T21:52:08+08:00
Running ./arrow-acero-hash-join-benchmark
Run on (8 X 4900 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 256 KiB (x8)
L3 Unified 12288 KiB (x1)
Load Average: 0.56, 0.22, 0.08
***WARNING*** CPU scaling is enabled, the benchmark real time measurements
may be noisy and will incur extra overhead.
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark
Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------------------------
BM_RowArray_Decode/"boolean"
204759 ns 204759 ns 3398
rows/sec=320.059M/s
BM_RowArray_Decode/"int8"
198094 ns 198094 ns 3476
rows/sec=330.827M/s
BM_RowArray_Decode/"int16"
199424 ns 199445 ns 3490
rows/sec=328.587M/s
BM_RowArray_Decode/"int32"
201338 ns 201351 ns 3476
rows/sec=325.477M/s
BM_RowArray_Decode/"int64"
207006 ns 207010 ns 3406
rows/sec=316.579M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:3
329304 ns 329258 ns 2137
rows/sec=199.038M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:5
328043 ns 327986 ns 2116
rows/sec=199.811M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:6
327691 ns 327650 ns 2137
rows/sec=200.015M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:7
329935 ns 329892 ns 2133
rows/sec=198.656M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:9
337341 ns 337283 ns 2085
rows/sec=194.302M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:16
335654 ns 335592 ns 2066
rows/sec=195.282M/s
BM_RowArray_DecodeFixedSizeBinary/fixed_size:42
412375 ns 412278 ns 1698
rows/sec=158.958M/s
BM_RowArray_DecodeBinary/max_length:32
859282 ns 858982 ns 815
rows/sec=76.2938M/s
BM_RowArray_DecodeBinary/max_length:64
1126945 ns 1126548 ns 620
rows/sec=58.1733M/s
BM_RowArray_DecodeBinary/max_length:128
1346772 ns 1346336 ns 521
rows/sec=48.6766M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:0
225688 ns 225646 ns 3105 rows/sec=290.433M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:1
222248 ns 222233 ns 3148 rows/sec=294.894M/s
BM_RowArray_DecodeOneOfColumns/"fixed_length_row:{boolean,int32,fixed_size_binary(64)}"/column:2
448432 ns 448380 ns 1564 rows/sec=146.159M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:0
289385 ns 289347 ns 2420 rows/sec=226.493M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:1
289905 ns 289839 ns 2413 rows/sec=226.109M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:2
874143 ns 873785 ns 801 rows/sec=75.0013M/s
BM_RowArray_DecodeOneOfColumns/"var_length_row:{boolean,int32,utf8,utf8}"/column:3
1037058 ns 1036678 ns 674 rows/sec=63.2164M/s
```
[1]
https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/gather-data-sampling.html
[2] https://access.redhat.com/solutions/7027704
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]