marin-ma commented on issue #8833: URL: https://github.com/apache/incubator-gluten/issues/8833#issuecomment-2690158829
@boneanxs Thanks for the update. Sort-based shuffle is a new feature added in recent gluten release, and it's disabled by default unless explicitly configured by users. We have benchmarked the performance of hash vs. sort shuffle based on evenly partitioned data and observed that sort-based shuffle only improves performance when the number of columns is greater than 8 or the number of partitions exceeds 8K. However, it seems that the case you provided involves data skew. Therefore, our benchmark values do not apply to your case (only 2 columns and 100 partitions). Perhaps we need to conduct some tests for skewed data and update the documentation with the suggested values. cc: @FelixYBW Note that the configuration will enable sort-based shuffle across the entire job. Currently, it cannot be controlled for individual shuffle. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
