Hi angers.zhu, Thanks for pointing me towards that PR, I think the main issue there is that the coalesce operation requires an additional computation which in this case is undesirable. It also approximates the row size rather than just directly using the partition size. Thus it has the potential to produce Executor OOMs or suboptimal partitionings.
I'm thinking of something along the lines of what CoalesceShufflePartitions <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CoalesceShufflePartitions.scala> does during adaptive query execution. This class just uses the `mapStats` from the shuffle stages to find out the partition sizes and I'm wondering if we can do something similar "on demand" i.e. a user can request an adaptively run coalesce by calling `dataset.adaptiveCoalesce()` or similar. Does that make sense? I'm hoping that one of the spark core contributors can weigh in on the feasibility of this and whether this is the way they would think about solving the issue described in my first post? -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org