Hi angers.zhu,

Thanks for pointing me towards that PR, I think the main issue there is that
the coalesce operation requires an additional computation which in this case
is undesirable. It also approximates the row size rather than just directly
using the partition size. Thus it has the potential to produce Executor OOMs
or suboptimal partitionings.

I'm thinking of something along the lines of what  CoalesceShufflePartitions
<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CoalesceShufflePartitions.scala>
  
does during adaptive query execution. This class just uses the `mapStats`
from the shuffle stages to find out the partition sizes and I'm wondering if
we can do something similar "on demand" i.e. a user can request an
adaptively run coalesce by calling `dataset.adaptiveCoalesce()` or similar.

Does that make sense? I'm hoping that one of the spark core contributors can
weigh in on the feasibility of this and whether this is the way they would
think about solving the issue described in my first post?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to