NEUpanning commented on issue #10104:
URL: 
https://github.com/apache/incubator-gluten/issues/10104#issuecomment-3027467974

   @marin-ma 
   
   > Could you share other metrics for the ColumnarShuffleExchange operator?
   
   ```
   ColumnarExchange
   
   shuffle records written: 39,407,858,231
   shuffle write time total (min, med, max (stageId: taskId))
   17.86 h (0 ms, 27.4 s, 17.1 m (stage 0.0: task 499))
   time to compress total (min, med, max (stageId: taskId))
   31.52 h (0 ms, 29.6 s, 20.3 m (stage 0.0: task 1191))
   time to split total (min, med, max (stageId: taskId))
   1478.09 h (0 ms, 45.5 m, 1.71 h (stage 0.0: task 86))
   shuffle wall time total (min, med, max (stageId: taskId))
   1527.48 h (0 ms, 46.5 m, 1.92 h (stage 0.0: task 86))
   number of input rows: 39,407,858,231
   time to spill total (min, med, max (stageId: taskId))
   15.77 h (0 ms, 25.2 s, 16.3 m (stage 0.0: task 499))
   shuffle bytes spilled total (min, med, max (stageId: taskId))
   3.0 TiB (0.0 B, 1516.5 MiB, 1944.8 MiB (stage 0.0: task 909))
   number of input batches: 9,788,389
   Native.mergeSpillsTime total (min, med, max (stageId: taskId))
   34.92 h (0 ms, 33.8 s, 23.1 m (stage 0.0: task 1191))
   data size total (min, med, max (stageId: taskId))
   75.6 TiB (0.0 B, 38.8 GiB, 38.9 GiB (stage 0.0: task 1))
   peak bytes allocated total (min, med, max (stageId: taskId))
   7.5 TiB (0.0 B, 5.1 GiB, 5.6 GiB (stage 0.0: task 1418))
   number of partitions: 2,000
   shuffle bytes written total (min, med, max (stageId: taskId))
   3.7 TiB (0.0 B, 1944.5 MiB, 1953.5 MiB (stage 0.0: task 1))
   ```
   
   > You may also try enabling sort-based shuffle and see if it can get better 
performance.
   
   The number of shuffle partitions is only 2000, which is not greater than 
`spark.gluten.sql.columnar.shuffle.sort.partitions.threshold`. I'll try to 
force enable it.
   
   And I adjusted merge buffer size from 4096 to 2048, resulting in a ~30-fold 
reduction in split time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to