Github user markhamstra commented on the issue:
https://github.com/apache/spark/pull/16620
The way that I am thinking about this right now is that @kayousterhout is
on the right track with the early return at
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1141
, but that her proposed `stage...attemptId != task.stageAttemptId` is broader
than it needs to be. My idea is that we want to be throwing away task results
from earlier attempts that were run on executors that failed (on the
presumption that one fetch failure means that other fetches from there are also
going to fail), but that if the executor didn't fail, then the outputs from
earlier attempts of tasks that complete late but successfully on still-good
executors should still be valid and available, so we should accept them as
though they were successful task completions for the current attempt.
What you end up with is that if-statement now looking like:
```scala
val stageHasBeenCancelled = !stageIdToStage.contains(task.stageId)
val shuffleMapTaskIsFromFailedExecutor = task match {
case smt: ShuffleMapTask =>
val status = event.result.asInstanceOf[MapStatus]
val execId = status.location.executorId
failedEpoch.contains(execId) && smt.epoch <= failedEpoch(execId)
case _ => false
}
if (stageHasBeenCancelled || shuffleMapTaskIsFromFailedExecutor) {
return
}
```
...and then the `failedEpoch.contains(execId) && smt.epoch <=
failedEpoch(execId)` check can be removed from `case smt: ShuffleMapTask =>`.
If we can do it cleanly, I think we should be avoiding re-running Tasks
that complete successfully and should still be available. This is a bit
different from the intent of SPARK-14649, which I am reading as an effort not
to ignore the results of long-running tasks that start and eventually complete
on an executor on which some other tasks actually run into fetch failures. I'm
really only trying to preserve the results of successful tasks run on executors
that haven't failed.
Unfortunately, the DAGSchedulerSuite doesn't agree with my intentions,
because the above change actually leads to multiple test failures.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]