[
https://issues.apache.org/jira/browse/SPARK-14649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-14649:
---------------------------------
Labels: bulk-closed (was: )
> DagScheduler re-starts all running tasks on fetch failure
> ---------------------------------------------------------
>
> Key: SPARK-14649
> URL: https://issues.apache.org/jira/browse/SPARK-14649
> Project: Spark
> Issue Type: Bug
> Components: Scheduler
> Reporter: Sital Kedia
> Priority: Major
> Labels: bulk-closed
>
> When a fetch failure occurs, the DAGScheduler re-launches the previous stage
> (to re-generate output that was missing), and then re-launches all tasks in
> the stage with the fetch failure that hadn't *completed* when the fetch
> failure occurred (the DAGScheduler re-lanches all of the tasks whose output
> data is not available -- which is equivalent to the set of tasks that hadn't
> yet completed).
> The assumption when this code was originally written was that when a fetch
> failure occurred, the output from at least one of the tasks in the previous
> stage was no longer available, so all of the tasks in the current stage would
> eventually fail due to not being able to access that output. This assumption
> does not hold for some large-scale, long-running workloads. E.g., there's
> one use case where a job has ~100k tasks that each run for about 1 hour, and
> only the first 5-10 minutes are spent fetching data. Because of the large
> number of tasks, it's very common to see a few tasks fail in the fetch phase,
> and it's wasteful to re-run other tasks that had finished fetching data so
> aren't affected by the fetch failure (and may be most of the way through
> their hour-long execution). The DAGScheduler should not re-start these tasks.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]