marin-ma commented on issue #8833:
URL: 
https://github.com/apache/incubator-gluten/issues/8833#issuecomment-2690158829

   @boneanxs Thanks for the update. Sort-based shuffle is a new feature added 
in recent gluten release, and it's disabled by default unless explicitly 
configured by users. 
   
   We have benchmarked the performance of hash vs. sort shuffle based on evenly 
partitioned data and observed that sort-based shuffle only improves performance 
when the number of columns is greater than 8 or the number of partitions 
exceeds 8K. However, it seems that the case you provided involves data skew. 
Therefore, our benchmark values do not apply to your case (only 2 columns and 
100 partitions). Perhaps we need to conduct some tests for skewed data and 
update the documentation with the suggested values. cc: @FelixYBW 
   
   Note that the configuration will enable sort-based shuffle across the entire 
job. Currently, it cannot be controlled for individual shuffle.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to