cloud-fan commented on code in PR #37451:
URL: https://github.com/apache/spark/pull/37451#discussion_r948649576
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CoalesceShufflePartitions.scala:
##########
@@ -163,6 +163,16 @@ case class CoalesceShufflePartitions(session:
SparkSession) extends AQEShuffleRe
}.getOrElse(plan)
case other => other.mapChildren(updateShuffleReads(_, specsMap))
}
+
+ // Adjust advisory partition size in bytes based on following operators.
Review Comment:
I agree that this rule is a bit too simple, as it assumes the final data
size is the sum of all leaf shuffle data sizes. This is not true in many cases,
e.g. with Join, Expand, Aggregate.
I think we should fix this assumption instead of adjusting the advisory
size. We should transform the query plan and understand the lineage between
leaf shuffles and the final result. The goal is to make the result partition
size 64 mb, and we need an algorithm to determine the target shuffle partition
size for each leaf shuffle node, considering several special nodes in the query
plan (join, expand, aggregate, etc.)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]