cloud-fan commented on code in PR #45234:
URL: https://github.com/apache/spark/pull/45234#discussion_r1575677485
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala:
##########
@@ -51,13 +51,30 @@ abstract class QueryStageExec extends LeafExecNode {
*/
val plan: SparkPlan
+ /**
+ * Name of this query stage which is unique in the entire query plan.
+ */
+ val name: String = s"${this.getClass.getSimpleName}-$id"
+
+ /**
+ * This flag aims to detect if the stage materialization is started. This
helps
+ * to avoid unnecessary stage materialization when the stage is canceled.
+ */
+ private val materializationStarted = new AtomicBoolean()
Review Comment:
sorry for the last-minute proposal, but I'm wondering if it's more efficient
to push this cancelation optimization into shuffle and broadcast nodes.
It looks a bit fragile to operate on the `shuffleFuture` directly in
`ShuffleQueryStageExec.cancel`. I think we should let `ShuffleExchangeLike`
provide clear APIs to do it. Today it provides `submitShuffleJob`, and it
should also provide `cancelShuffleJob`.
Within `ShuffleExchangeLike`, we can do more optimizations. e.g. even if we
cancel the shuffle stage after the shuffle stage is submitted, we can still
avoid submitting the shuffle job, as the shuffle node might be doing other
preparation work: generating the shuffle dependency, waiting for subqueries to
finish, etc. It's more efficient to check the isCanceled flag at the last
minute, right before submitting the shuffle job.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]