Github user kayousterhout commented on the issue:
https://github.com/apache/spark/pull/16620
@squito and @jinxing64 You're right -- with the existing code, if a task
from an old attempt succeeded *and* didn't run on an executor where things
already failed, the DAGScheduler will count the result (just realizing this
based on [this
if-statement](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1189)).
That being said, I think this behavior is broken, because it leads to
inconsistent state between the DAGScheduler (which thinks the stage is done and
submits the next ones) and the TaskSetManager for the most recent version of
the stage (which is still waiting on the more recent version of tasks to
complete). When the TaskSetManager for most recent version of the stage
finishes all of its tasks, it will tell the DAGScheduler -- again -- that the
stage has finished, causing the DAGScheduler to update the finish time for the
stage and send another (duplicate) SparkListenerStageCompleted message to the
listeners (I think this will result in stages in the UI that appear to be
finished yet still have running tasks), and re-update the outputs for the map
stage. None of these things are obviously buggy (from a cursory look) but they
violate a bunch of invariants in the scheduler, and I wouldn't be surprised if
there were bugs lurking in this code path. Given the amount of debugging a
nd reviewer time that gets dedicated to these subtle bugs, I'm in favor of the
simpler solution that maintains consistent state between the DAGScheduler and
TaskSetManager.
@squito where has this behavior been argued against in the past? My
understanding is that a bunch of the scheduler code is based on an assumption
that once some tasks in a stage fail with a FetchFailure, we ignore future
successes from that stage because it makes the code much simpler (it's also
hard, in some cases, to know whether the successes are "real", or delayed
messages from machines that later failed). There was a bigger effort to fix
that issue in [SPARK-14649](https://issues.apache.org/jira/browse/SPARK-14649),
but there were a bunch of subtleties in getting that right, so for now effort
on that has stopped. If someone wants to re-start the effort on that, it seems
useful, but I think should be de-coupled from fixing this bug.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]