ZhangHuiGui commented on PR #41234: URL: https://github.com/apache/arrow/pull/41234#issuecomment-2121606281
> I have a question about the necessity of this fix. > > IIUC, though `RowTableImpl` supports both usages of columns "in encoding order" and "not in encoding order", the user (e.g. `Grouper` or `SwissJoin`) is free to choose either, in other words, the user is not mandatory to support both. For example, the current `SwissJoin` is using it the way that all columns are assumed "not in encoding order" and it is perfectly fine because there isn't a case that requires `SwissJoin` to do it the other way. The same goes to `Grouper` as well. > > Is there a reason that `Grouper` must assume the columns are not in encoding order, or that `Grouper` can benefit in terms of performance/complexity from treating the columns not in encoding order? > > Thanks. This PR try to support column "sorted" and "not-sorted" modes for grouper. - "sorted" mode means do column sort for RowTable and compare with "are_cols_in_encoding_order=true" - "not-sorted" mode, contrary to the above logic, do not sort for RowTable and compare with "are_cols_in_encoding_order=false" The performance of grouper in some "not-sorted" scenarios tested by benchmark(https://github.com/apache/arrow/pull/41036) is better than that of "sorted" mode. > The columns in this scenario are already sorted. If you use sorted-mode again, the performance will be worse. **not-sorted** ```shell GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/10000 31779 us 31771 us 22 items_per_second=1.03138M/s null_percent=0.01 num_groups=32.063k size=32.768k uniqueness=0.0305777 GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/100 32104 us 32095 us 22 items_per_second=1.02096M/s null_percent=1 num_groups=32.044k size=32.768k uniqueness=0.0305595 GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/10 34213 us 34205 us 21 items_per_second=957.999k/s null_percent=10 num_groups=31.015k size=32.768k uniqueness=0.0295782 GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/2 40191 us 40179 us 18 items_per_second=815.542k/s null_percent=50 num_groups=21.011k size=32.768k uniqueness=0.0200377 GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/1 31498 us 31492 us 22 items_per_second=1.04052M/s null_percent=100 num_groups=1 size=32.768k uniqueness=953.674n GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/0 28087 us 28081 us 25 items_per_second=1.16689M/s null_percent=0 num_groups=32.063k size=32.768k uniqueness=0.0305777 GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/10000 14359 us 14356 us 47 items_per_second=2.28251M/s null_percent=0.01 num_groups=32.768k size=32.768k uniqueness=0.03125 GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/100 14625 us 14622 us 48 items_per_second=2.24103M/s null_percent=1 num_groups=32.768k size=32.768k uniqueness=0.03125 GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/10 15353 us 15349 us 46 items_per_second=2.13491M/s null_percent=10 num_groups=32.765k size=32.768k uniqueness=0.0312471 GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/2 18351 us 18347 us 38 items_per_second=1.786M/s null_percent=50 num_groups=30.796k size=32.768k uniqueness=0.0293694 GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/1 16070 us 16067 us 44 items_per_second=2.03942M/s null_percent=100 num_groups=1 size=32.768k uniqueness=953.674n GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/0 11155 us 11153 us 63 items_per_second=2.93807M/s null_percent=0 num_groups=32.768k size=32.768k uniqueness=0.03125 ``` **sorted-mode**: ```shell GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/10000 32762 us 32749 us 22 items_per_second=1.00057M/s null_percent=0.01 num_groups=32.063k size=32.768k uniqueness=0.0305777 GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/100 33081 us 33068 us 21 items_per_second=990.941k/s null_percent=1 num_groups=32.044k size=32.768k uniqueness=0.0305595 GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/10 35627 us 35612 us 19 items_per_second=920.136k/s null_percent=10 num_groups=31.015k size=32.768k uniqueness=0.0295782 GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/2 41915 us 41892 us 17 items_per_second=782.197k/s null_percent=50 num_groups=21.011k size=32.768k uniqueness=0.0200377 GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/1 31307 us 31300 us 22 items_per_second=1.04692M/s null_percent=100 num_groups=1 size=32.768k uniqueness=953.674n GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/0 28466 us 28460 us 25 items_per_second=1.15138M/s null_percent=0 num_groups=32.063k size=32.768k uniqueness=0.0305777 GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/10000 14997 us 14993 us 47 items_per_second=2.1856M/s null_percent=0.01 num_groups=32.768k size=32.768k uniqueness=0.03125 GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/100 14937 us 14933 us 46 items_per_second=2.19428M/s null_percent=1 num_groups=32.768k size=32.768k uniqueness=0.03125 GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/10 15596 us 15592 us 45 items_per_second=2.10155M/s null_percent=10 num_groups=32.765k size=32.768k uniqueness=0.0312471 GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/2 18663 us 18658 us 37 items_per_second=1.75624M/s null_percent=50 num_groups=30.796k size=32.768k uniqueness=0.0293694 GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/1 16221 us 16217 us 43 items_per_second=2.0206M/s null_percent=100 num_groups=1 size=32.768k uniqueness=953.674n GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/0 11065 us 11062 us 63 items_per_second=2.96212M/s null_percent=0 num_groups=32.768k size=32.768k uniqueness=0.03125 ``` Other scenarios require column sorting, and the grouper operation performance after sorting is better. **not-sorted** ```shell GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/10000 29802 us 29793 us 23 items_per_second=1.09987M/s null_percent=0.01 num_groups=32.768k size=32.768k uniqueness=0.0625 GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/100 30674 us 30659 us 24 items_per_second=1.06878M/s null_percent=1 num_groups=32.768k size=32.768k uniqueness=0.0625 GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/10 31777 us 31765 us 22 items_per_second=1.03159M/s null_percent=10 num_groups=32.764k size=32.768k uniqueness=0.0624924 GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/2 34246 us 34232 us 21 items_per_second=957.24k/s null_percent=50 num_groups=30.393k size=32.768k uniqueness=0.05797 GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/1 17872 us 17867 us 39 items_per_second=1.83396M/s null_percent=100 num_groups=1 size=32.768k uniqueness=1.90735u GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/0 28033 us 28017 us 26 items_per_second=1.16959M/s null_percent=0 num_groups=32.768k size=32.768k uniqueness=0.0625 ``` **sorted** ```shell GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/10000 24494 us 24485 us 28 items_per_second=1.33829M/s null_percent=0.01 num_groups=32.768k size=32.768k uniqueness=0.0625 GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/100 25044 us 25036 us 29 items_per_second=1.30883M/s null_percent=1 num_groups=32.768k size=32.768k uniqueness=0.0625 GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/10 25805 us 25795 us 27 items_per_second=1.27032M/s null_percent=10 num_groups=32.764k size=32.768k uniqueness=0.0624924 GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/2 29097 us 29084 us 24 items_per_second=1.12666M/s null_percent=50 num_groups=30.393k size=32.768k uniqueness=0.05797 GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/1 17850 us 17845 us 39 items_per_second=1.83627M/s null_percent=100 num_groups=1 size=32.768k uniqueness=1.90735u GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/0 22352 us 22341 us 31 items_per_second=1.46673M/s null_percent=0 num_groups=32.768k size=32.768k uniqueness=0.0625 ``` In general, in scenarios where the sorted mode is better, the grouper performance is improved significantly after sorting (15% improvement); in scenarios where the not-sorted mode is better, the grouper performance is improved by 2%-4%. Do you think we should support not-sorted mode in grouper for user? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
