pitrou commented on PR #13334: URL: https://github.com/apache/arrow/pull/13334#issuecomment-1191713393
Ok, so I implemented the algorithm which was originally envisioned in https://issues.apache.org/jira/browse/ARROW-14314?focusedCommentId=17445377&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17445377 The performance numbers are indeed quite better now: * on integers: ``` ArraySortIndicesInt64Narrow/32768/10000 12096 ns 12093 ns 58332 bytes_per_second=2.52347G/s items_per_second=338.694M/s null_percent=0.01 size=32.768k ArraySortIndicesInt64Narrow/32768/100 12100 ns 12098 ns 57680 bytes_per_second=2.5226G/s items_per_second=338.578M/s null_percent=1 size=32.768k ArraySortIndicesInt64Narrow/32768/10 13387 ns 13384 ns 50920 bytes_per_second=2.28012G/s items_per_second=306.032M/s null_percent=10 size=32.768k ArraySortIndicesInt64Narrow/32768/2 23478 ns 23473 ns 28378 bytes_per_second=1.3001G/s items_per_second=174.496M/s null_percent=50 size=32.768k ArraySortIndicesInt64Narrow/32768/1 6422 ns 6420 ns 109333 bytes_per_second=4.75339G/s items_per_second=637.989M/s null_percent=100 size=32.768k ArraySortIndicesInt64Narrow/32768/0 12042 ns 12040 ns 57830 bytes_per_second=2.53471G/s items_per_second=340.203M/s null_percent=0 size=32.768k ArraySortIndicesInt64Narrow/1048576/100 400335 ns 400263 ns 1721 bytes_per_second=2.4398G/s items_per_second=327.465M/s null_percent=1 size=1048.58k ArraySortIndicesInt64Narrow/8388608/100 5230098 ns 5226130 ns 134 bytes_per_second=1.49489G/s items_per_second=200.641M/s null_percent=1 size=8.38861M ArraySortIndicesInt64Wide/32768/10000 196031 ns 195992 ns 3553 bytes_per_second=159.446M/s items_per_second=20.8989M/s null_percent=0.01 size=32.768k ArraySortIndicesInt64Wide/32768/100 199523 ns 199488 ns 3493 bytes_per_second=156.651M/s items_per_second=20.5325M/s null_percent=1 size=32.768k ArraySortIndicesInt64Wide/32768/10 189211 ns 189180 ns 3692 bytes_per_second=165.187M/s items_per_second=21.6513M/s null_percent=10 size=32.768k ArraySortIndicesInt64Wide/32768/2 117105 ns 117085 ns 5907 bytes_per_second=266.899M/s items_per_second=34.983M/s null_percent=50 size=32.768k ArraySortIndicesInt64Wide/32768/1 6399 ns 6398 ns 109441 bytes_per_second=4.76962G/s items_per_second=640.168M/s null_percent=100 size=32.768k ArraySortIndicesInt64Wide/32768/0 195661 ns 195629 ns 3571 bytes_per_second=159.741M/s items_per_second=20.9376M/s null_percent=0 size=32.768k ArraySortIndicesInt64Wide/1048576/100 9867426 ns 9865950 ns 71 bytes_per_second=101.359M/s items_per_second=13.2853M/s null_percent=1 size=1048.58k ArraySortIndicesInt64Wide/8388608/100 104777802 ns 104736080 ns 7 bytes_per_second=76.3825M/s items_per_second=10.0116M/s null_percent=1 size=8.38861M ArraySortIndicesInt64Dict/32768/10000 21406 ns 21401 ns 32998 bytes_per_second=1.42602G/s items_per_second=191.397M/s null_percent=0.01 size=32.768k ArraySortIndicesInt64Dict/32768/100 23328 ns 23320 ns 29746 bytes_per_second=1.30864G/s items_per_second=175.643M/s null_percent=1 size=32.768k ArraySortIndicesInt64Dict/32768/10 32835 ns 32828 ns 20994 bytes_per_second=951.943M/s items_per_second=124.773M/s null_percent=10 size=32.768k ArraySortIndicesInt64Dict/32768/2 63866 ns 63851 ns 10790 bytes_per_second=489.417M/s items_per_second=64.1489M/s null_percent=50 size=32.768k ArraySortIndicesInt64Dict/32768/1 51498 ns 51488 ns 13377 bytes_per_second=606.943M/s items_per_second=79.5532M/s null_percent=100 size=32.768k ArraySortIndicesInt64Dict/32768/0 20983 ns 20978 ns 33428 bytes_per_second=1.45471G/s items_per_second=195.248M/s null_percent=0 size=32.768k ArraySortIndicesInt64Dict/1048576/100 618425 ns 618258 ns 1150 bytes_per_second=1.57954G/s items_per_second=212.002M/s null_percent=1 size=1048.58k ArraySortIndicesInt64Dict/8388608/100 9134358 ns 9127689 ns 77 bytes_per_second=876.454M/s items_per_second=114.879M/s null_percent=1 size=8.38861M ``` * on strings: ``` ArraySortIndicesStrings/32768/10000 380932 ns 380866 ns 1833 items_per_second=10.7544M/s null_percent=0.01 size=4.096k ArraySortIndicesStrings/32768/100 381913 ns 381849 ns 1831 items_per_second=10.7267M/s null_percent=1 size=4.096k ArraySortIndicesStrings/32768/10 353224 ns 353159 ns 1978 items_per_second=11.5982M/s null_percent=10 size=4.096k ArraySortIndicesStrings/32768/2 198191 ns 198155 ns 3485 items_per_second=20.6707M/s null_percent=50 size=4.096k ArraySortIndicesStrings/32768/1 6516 ns 6514 ns 107935 items_per_second=628.767M/s null_percent=100 size=4.096k ArraySortIndicesStrings/32768/0 376185 ns 376122 ns 1866 items_per_second=10.8901M/s null_percent=0 size=4.096k ArraySortIndicesStrings/1048576/100 17554424 ns 17551747 ns 40 items_per_second=7.46775M/s null_percent=1 size=131.072k ArraySortIndicesStrings/8388608/100 218960374 ns 218848644 ns 3 items_per_second=4.79133M/s null_percent=1 size=1048.58k ArraySortIndicesStringsDict/32768/10000 21847 ns 21842 ns 32054 items_per_second=187.525M/s null_percent=0.01 size=4.096k ArraySortIndicesStringsDict/32768/100 23700 ns 23695 ns 29360 items_per_second=172.865M/s null_percent=1 size=4.096k ArraySortIndicesStringsDict/32768/10 33740 ns 33732 ns 20298 items_per_second=121.428M/s null_percent=10 size=4.096k ArraySortIndicesStringsDict/32768/2 64616 ns 64602 ns 10581 items_per_second=63.4031M/s null_percent=50 size=4.096k ArraySortIndicesStringsDict/32768/1 53657 ns 53645 ns 12715 items_per_second=76.3539M/s null_percent=100 size=4.096k ArraySortIndicesStringsDict/32768/0 21505 ns 21500 ns 32619 items_per_second=190.512M/s null_percent=0 size=4.096k ArraySortIndicesStringsDict/1048576/100 629995 ns 629868 ns 1113 items_per_second=208.094M/s null_percent=1 size=131.072k ArraySortIndicesStringsDict/8388608/100 9055816 ns 9045824 ns 77 items_per_second=115.918M/s null_percent=1 size=1048.58k ``` We see that sorting a dictionary array is now much faster (potentially 10x faster) than sorting an equivalent dense array, if the dictionary size is small (which is a very common case). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
