pitrou commented on PR #47573: URL: https://github.com/apache/arrow/pull/47573#issuecomment-3401219805
I'm testing performance of this PR locally on a AMD Zen 2 CPU on Ubuntu, and I'm seeing large regressions on unpack32 performance. 1) scalar unpack32 regresses on all input sizes: * before: ``` BM_UnpackUint32/ScalarUnaligned/1/64 23.3 ns 23.3 ns 30099581 items_per_second=2.74685G/s BM_UnpackUint32/ScalarUnaligned/2/64 15.1 ns 15.1 ns 45528078 items_per_second=4.22707G/s BM_UnpackUint32/ScalarUnaligned/8/64 16.2 ns 16.1 ns 43418304 items_per_second=3.96297G/s BM_UnpackUint32/ScalarUnaligned/20/64 22.4 ns 22.4 ns 31217249 items_per_second=2.85244G/s BM_UnpackUint32/ScalarUnaligned/1/1024 310 ns 310 ns 2251073 items_per_second=3.29834G/s BM_UnpackUint32/ScalarUnaligned/2/1024 213 ns 213 ns 3288295 items_per_second=4.80898G/s BM_UnpackUint32/ScalarUnaligned/8/1024 357 ns 357 ns 1957044 items_per_second=2.86556G/s BM_UnpackUint32/ScalarUnaligned/20/1024 339 ns 339 ns 2066262 items_per_second=3.01942G/s BM_UnpackUint32/ScalarUnaligned/1/32768 9829 ns 9828 ns 71098 items_per_second=3.33403G/s BM_UnpackUint32/ScalarUnaligned/2/32768 6543 ns 6543 ns 107368 items_per_second=5.00848G/s BM_UnpackUint32/ScalarUnaligned/8/32768 11410 ns 11410 ns 60956 items_per_second=2.87199G/s BM_UnpackUint32/ScalarUnaligned/20/32768 10737 ns 10736 ns 65168 items_per_second=3.05219G/s ``` * after: ``` BM_UnpackUint32/ScalarUnaligned/1/64 26.3 ns 26.3 ns 26597461 items_per_second=2.43124G/s BM_UnpackUint32/ScalarUnaligned/2/64 25.9 ns 25.9 ns 27099769 items_per_second=2.47115G/s BM_UnpackUint32/ScalarUnaligned/8/64 23.0 ns 23.0 ns 30377366 items_per_second=2.78236G/s BM_UnpackUint32/ScalarUnaligned/20/64 41.1 ns 41.1 ns 17029900 items_per_second=1.55726G/s BM_UnpackUint32/ScalarUnaligned/1/1024 368 ns 368 ns 1903394 items_per_second=2.78384G/s BM_UnpackUint32/ScalarUnaligned/2/1024 342 ns 342 ns 2047462 items_per_second=2.99373G/s BM_UnpackUint32/ScalarUnaligned/8/1024 358 ns 358 ns 1956688 items_per_second=2.86192G/s BM_UnpackUint32/ScalarUnaligned/20/1024 538 ns 538 ns 1306908 items_per_second=1.90488G/s BM_UnpackUint32/ScalarUnaligned/1/32768 11647 ns 11646 ns 59931 items_per_second=2.81379G/s BM_UnpackUint32/ScalarUnaligned/2/32768 10844 ns 10843 ns 64537 items_per_second=3.02197G/s BM_UnpackUint32/ScalarUnaligned/8/32768 11303 ns 11301 ns 61943 items_per_second=2.89948G/s BM_UnpackUint32/ScalarUnaligned/20/32768 16915 ns 16915 ns 41375 items_per_second=1.93724G/s ``` 2) SIMD unpack32 regresses, but only a small input sizes: * before: ``` BM_UnpackUint32/Avx2Unaligned/1/64 6.10 ns 6.10 ns 115307711 items_per_second=10.4919G/s BM_UnpackUint32/Avx2Unaligned/2/64 5.80 ns 5.80 ns 119616924 items_per_second=11.0319G/s BM_UnpackUint32/Avx2Unaligned/8/64 7.84 ns 7.83 ns 89310891 items_per_second=8.16917G/s BM_UnpackUint32/Avx2Unaligned/20/64 30.1 ns 30.1 ns 23168069 items_per_second=2.12844G/s BM_UnpackUint32/Avx2Unaligned/1/1024 61.3 ns 61.3 ns 11549890 items_per_second=16.7084G/s BM_UnpackUint32/Avx2Unaligned/2/1024 60.4 ns 60.4 ns 11462262 items_per_second=16.9551G/s BM_UnpackUint32/Avx2Unaligned/8/1024 118 ns 118 ns 5936723 items_per_second=8.68313G/s BM_UnpackUint32/Avx2Unaligned/20/1024 426 ns 426 ns 1628900 items_per_second=2.40438G/s BM_UnpackUint32/Avx2Unaligned/1/32768 1916 ns 1916 ns 364744 items_per_second=17.1002G/s BM_UnpackUint32/Avx2Unaligned/2/32768 2071 ns 2071 ns 340164 items_per_second=15.8259G/s BM_UnpackUint32/Avx2Unaligned/8/32768 3814 ns 3814 ns 183339 items_per_second=8.59169G/s BM_UnpackUint32/Avx2Unaligned/20/32768 13674 ns 13672 ns 51167 items_per_second=2.39671G/s ``` * after: ``` BM_UnpackUint32/Avx2Unaligned/1/64 11.7 ns 11.7 ns 60043243 items_per_second=5.47447G/s BM_UnpackUint32/Avx2Unaligned/2/64 11.7 ns 11.7 ns 59912039 items_per_second=5.47921G/s BM_UnpackUint32/Avx2Unaligned/8/64 13.4 ns 13.3 ns 52435550 items_per_second=4.79412G/s BM_UnpackUint32/Avx2Unaligned/20/64 37.6 ns 37.6 ns 18551367 items_per_second=1.69999G/s BM_UnpackUint32/Avx2Unaligned/1/1024 68.2 ns 68.2 ns 10297857 items_per_second=15.0158G/s BM_UnpackUint32/Avx2Unaligned/2/1024 68.2 ns 68.2 ns 10265410 items_per_second=15.0135G/s BM_UnpackUint32/Avx2Unaligned/8/1024 122 ns 122 ns 5723819 items_per_second=8.37532G/s BM_UnpackUint32/Avx2Unaligned/20/1024 460 ns 460 ns 1521877 items_per_second=2.2277G/s BM_UnpackUint32/Avx2Unaligned/1/32768 1958 ns 1957 ns 357945 items_per_second=16.7407G/s BM_UnpackUint32/Avx2Unaligned/2/32768 1939 ns 1939 ns 361274 items_per_second=16.9008G/s BM_UnpackUint32/Avx2Unaligned/8/32768 3831 ns 3830 ns 182683 items_per_second=8.55518G/s BM_UnpackUint32/Avx2Unaligned/20/32768 14430 ns 14429 ns 48513 items_per_second=2.27106G/s ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
