NEUpanning commented on issue #10104: URL: https://github.com/apache/incubator-gluten/issues/10104#issuecomment-3027467974
@marin-ma > Could you share other metrics for the ColumnarShuffleExchange operator? ``` ColumnarExchange shuffle records written: 39,407,858,231 shuffle write time total (min, med, max (stageId: taskId)) 17.86 h (0 ms, 27.4 s, 17.1 m (stage 0.0: task 499)) time to compress total (min, med, max (stageId: taskId)) 31.52 h (0 ms, 29.6 s, 20.3 m (stage 0.0: task 1191)) time to split total (min, med, max (stageId: taskId)) 1478.09 h (0 ms, 45.5 m, 1.71 h (stage 0.0: task 86)) shuffle wall time total (min, med, max (stageId: taskId)) 1527.48 h (0 ms, 46.5 m, 1.92 h (stage 0.0: task 86)) number of input rows: 39,407,858,231 time to spill total (min, med, max (stageId: taskId)) 15.77 h (0 ms, 25.2 s, 16.3 m (stage 0.0: task 499)) shuffle bytes spilled total (min, med, max (stageId: taskId)) 3.0 TiB (0.0 B, 1516.5 MiB, 1944.8 MiB (stage 0.0: task 909)) number of input batches: 9,788,389 Native.mergeSpillsTime total (min, med, max (stageId: taskId)) 34.92 h (0 ms, 33.8 s, 23.1 m (stage 0.0: task 1191)) data size total (min, med, max (stageId: taskId)) 75.6 TiB (0.0 B, 38.8 GiB, 38.9 GiB (stage 0.0: task 1)) peak bytes allocated total (min, med, max (stageId: taskId)) 7.5 TiB (0.0 B, 5.1 GiB, 5.6 GiB (stage 0.0: task 1418)) number of partitions: 2,000 shuffle bytes written total (min, med, max (stageId: taskId)) 3.7 TiB (0.0 B, 1944.5 MiB, 1953.5 MiB (stage 0.0: task 1)) ``` > You may also try enabling sort-based shuffle and see if it can get better performance. The number of shuffle partitions is only 2000, which is not greater than `spark.gluten.sql.columnar.shuffle.sort.partitions.threshold`. I'll try to force enable it. And I adjusted merge buffer size from 4096 to 2048, resulting in a ~30-fold reduction in split time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
