ZhangHuiGui commented on PR #41234:
URL: https://github.com/apache/arrow/pull/41234#issuecomment-2121606281

   > I have a question about the necessity of this fix.
   > 
   > IIUC, though `RowTableImpl` supports both usages of columns "in encoding 
order" and "not in encoding order", the user (e.g. `Grouper` or `SwissJoin`) is 
free to choose either, in other words, the user is not mandatory to support 
both. For example, the current `SwissJoin` is using it the way that all columns 
are assumed "not in encoding order" and it is perfectly fine because there 
isn't a case that requires `SwissJoin` to do it the other way. The same goes to 
`Grouper` as well.
   > 
   > Is there a reason that `Grouper` must assume the columns are not in 
encoding order, or that `Grouper` can benefit in terms of 
performance/complexity from treating the columns not in encoding order?
   > 
   > Thanks.
   
   This PR try to support column "sorted" and "not-sorted" modes for grouper.
   - "sorted" mode means do column sort for RowTable and compare with 
"are_cols_in_encoding_order=true"
   - "not-sorted" mode, contrary to the above logic, do not sort for RowTable 
and compare with "are_cols_in_encoding_order=false"
   
   The performance of grouper in some "not-sorted" scenarios tested by 
benchmark(https://github.com/apache/arrow/pull/41036) is better than that of 
"sorted" mode.
   > The columns in this scenario are already sorted. If you use sorted-mode 
again, the performance will be worse.
   
   **not-sorted**
   ```shell
   GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/10000      
31779 us        31771 us           22 items_per_second=1.03138M/s 
null_percent=0.01 num_groups=32.063k size=32.768k uniqueness=0.0305777
   GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/100        
32104 us        32095 us           22 items_per_second=1.02096M/s 
null_percent=1 num_groups=32.044k size=32.768k uniqueness=0.0305595
   GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/10         
34213 us        34205 us           21 items_per_second=957.999k/s 
null_percent=10 num_groups=31.015k size=32.768k uniqueness=0.0295782
   GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/2          
40191 us        40179 us           18 items_per_second=815.542k/s 
null_percent=50 num_groups=21.011k size=32.768k uniqueness=0.0200377
   GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/1          
31498 us        31492 us           22 items_per_second=1.04052M/s 
null_percent=100 num_groups=1 size=32.768k uniqueness=953.674n
   GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/0          
28087 us        28081 us           25 items_per_second=1.16689M/s 
null_percent=0 num_groups=32.063k size=32.768k uniqueness=0.0305777
   GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/10000        
14359 us        14356 us           47 items_per_second=2.28251M/s 
null_percent=0.01 num_groups=32.768k size=32.768k uniqueness=0.03125
   GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/100          
14625 us        14622 us           48 items_per_second=2.24103M/s 
null_percent=1 num_groups=32.768k size=32.768k uniqueness=0.03125
   GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/10           
15353 us        15349 us           46 items_per_second=2.13491M/s 
null_percent=10 num_groups=32.765k size=32.768k uniqueness=0.0312471
   GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/2            
18351 us        18347 us           38 items_per_second=1.786M/s null_percent=50 
num_groups=30.796k size=32.768k uniqueness=0.0293694
   GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/1            
16070 us        16067 us           44 items_per_second=2.03942M/s 
null_percent=100 num_groups=1 size=32.768k uniqueness=953.674n
   GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/0            
11155 us        11153 us           63 items_per_second=2.93807M/s 
null_percent=0 num_groups=32.768k size=32.768k uniqueness=0.03125
   ```
   **sorted-mode**:
   ```shell
   GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/10000      
32762 us        32749 us           22 items_per_second=1.00057M/s 
null_percent=0.01 num_groups=32.063k size=32.768k uniqueness=0.0305777
   GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/100        
33081 us        33068 us           21 items_per_second=990.941k/s 
null_percent=1 num_groups=32.044k size=32.768k uniqueness=0.0305595
   GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/10         
35627 us        35612 us           19 items_per_second=920.136k/s 
null_percent=10 num_groups=31.015k size=32.768k uniqueness=0.0295782
   GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/2          
41915 us        41892 us           17 items_per_second=782.197k/s 
null_percent=50 num_groups=21.011k size=32.768k uniqueness=0.0200377
   GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/1          
31307 us        31300 us           22 items_per_second=1.04692M/s 
null_percent=100 num_groups=1 size=32.768k uniqueness=953.674n
   GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/0          
28466 us        28460 us           25 items_per_second=1.15138M/s 
null_percent=0 num_groups=32.063k size=32.768k uniqueness=0.0305777
   GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/10000        
14997 us        14993 us           47 items_per_second=2.1856M/s 
null_percent=0.01 num_groups=32.768k size=32.768k uniqueness=0.03125
   GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/100          
14937 us        14933 us           46 items_per_second=2.19428M/s 
null_percent=1 num_groups=32.768k size=32.768k uniqueness=0.03125
   GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/10           
15596 us        15592 us           45 items_per_second=2.10155M/s 
null_percent=10 num_groups=32.765k size=32.768k uniqueness=0.0312471
   GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/2            
18663 us        18658 us           37 items_per_second=1.75624M/s 
null_percent=50 num_groups=30.796k size=32.768k uniqueness=0.0293694
   GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/1            
16221 us        16217 us           43 items_per_second=2.0206M/s 
null_percent=100 num_groups=1 size=32.768k uniqueness=953.674n
   GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/0            
11065 us        11062 us           63 items_per_second=2.96212M/s 
null_percent=0 num_groups=32.768k size=32.768k uniqueness=0.03125
   ```
   
   Other scenarios require column sorting, and the grouper operation 
performance after sorting is better.
   **not-sorted**
   ```shell
   GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), 
boolean}"/32768/10000      29802 us        29793 us           23 
items_per_second=1.09987M/s null_percent=0.01 num_groups=32.768k size=32.768k 
uniqueness=0.0625
   GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), 
boolean}"/32768/100        30674 us        30659 us           24 
items_per_second=1.06878M/s null_percent=1 num_groups=32.768k size=32.768k 
uniqueness=0.0625
   GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), 
boolean}"/32768/10         31777 us        31765 us           22 
items_per_second=1.03159M/s null_percent=10 num_groups=32.764k size=32.768k 
uniqueness=0.0624924
   GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), 
boolean}"/32768/2          34246 us        34232 us           21 
items_per_second=957.24k/s null_percent=50 num_groups=30.393k size=32.768k 
uniqueness=0.05797
   GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), 
boolean}"/32768/1          17872 us        17867 us           39 
items_per_second=1.83396M/s null_percent=100 num_groups=1 size=32.768k 
uniqueness=1.90735u
   GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), 
boolean}"/32768/0          28033 us        28017 us           26 
items_per_second=1.16959M/s null_percent=0 num_groups=32.768k size=32.768k 
uniqueness=0.0625
   ```
   
   **sorted**
   ```shell
   GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), 
boolean}"/32768/10000      24494 us        24485 us           28 
items_per_second=1.33829M/s null_percent=0.01 num_groups=32.768k size=32.768k 
uniqueness=0.0625
   GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), 
boolean}"/32768/100        25044 us        25036 us           29 
items_per_second=1.30883M/s null_percent=1 num_groups=32.768k size=32.768k 
uniqueness=0.0625
   GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), 
boolean}"/32768/10         25805 us        25795 us           27 
items_per_second=1.27032M/s null_percent=10 num_groups=32.764k size=32.768k 
uniqueness=0.0624924
   GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), 
boolean}"/32768/2          29097 us        29084 us           24 
items_per_second=1.12666M/s null_percent=50 num_groups=30.393k size=32.768k 
uniqueness=0.05797
   GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), 
boolean}"/32768/1          17850 us        17845 us           39 
items_per_second=1.83627M/s null_percent=100 num_groups=1 size=32.768k 
uniqueness=1.90735u
   GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), 
boolean}"/32768/0          22352 us        22341 us           31 
items_per_second=1.46673M/s null_percent=0 num_groups=32.768k size=32.768k 
uniqueness=0.0625
   ```
   
   In general, in scenarios where the sorted mode is better, the grouper 
performance is improved significantly after sorting (15% improvement); in 
scenarios where the not-sorted mode is better, the grouper performance is 
improved by 2%-4%.
   
   Do you think we should support not-sorted mode in grouper for user?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to