cloud-fan commented on code in PR #44321:
URL: https://github.com/apache/spark/pull/44321#discussion_r1429775174
##########
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala:
##########
@@ -1086,7 +1133,7 @@ private[spark] class TaskSetManager(
addPendingTask(index)
// Tell the DAGScheduler that this task was resubmitted so that it
doesn't think our
// stage finishes when a total of tasks.size tasks finish.
- sched.dagScheduler.taskEnded(
+ emptyTaskInfoAccumulablesAndNotifyDagScheduler(tid,
tasks(index), Resubmitted, null, Seq.empty, Array.empty, info)
Review Comment:
TBH, I think this API is not well-designed and we don't have a clear
definition anyway. @JoshRosen 's comment explains the details: if a task is
running, listeners won't get the `Resubmitted` event, but the task failure
event. The `Resubmitted` event is only issued if the task is completed and its
map output is lost. In this case, I think the current behavior to send the task
info of the previously completed task info is a bit weird. It's likely not by
design but just a coincidence. Why would people care the previously completed
task's metrics when Spark resubmits it? Even if they care, they can get the
metrics in the previous task completion event.
To me, the memory issue is more important to most users, and a listener that
will be broken by this change should be very rare. How about we put it in the
migration guide and ask people to set the legacy flag to get the legacy
behavior?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]