[jira] [Updated] (SPARK-14649) DagScheduler re-starts all running tasks on fetch failure

Hyukjin Kwon (JIRA) Mon, 20 May 2019 21:00:35 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-14649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon updated SPARK-14649:
---------------------------------
    Labels: bulk-closed  (was: )

> DagScheduler re-starts all running tasks on fetch failure
> ---------------------------------------------------------
>
>                 Key: SPARK-14649
>                 URL: https://issues.apache.org/jira/browse/SPARK-14649
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>            Reporter: Sital Kedia
>            Priority: Major
>              Labels: bulk-closed
>
> When a fetch failure occurs, the DAGScheduler re-launches the previous stage 
> (to re-generate output that was missing), and then re-launches all tasks in 
> the stage with the fetch failure that hadn't *completed* when the fetch 
> failure occurred (the DAGScheduler re-lanches all of the tasks whose output 
> data is not available -- which is equivalent to the set of tasks that hadn't 
> yet completed).
> The assumption when this code was originally written was that when a fetch 
> failure occurred, the output from at least one of the tasks in the previous 
> stage was no longer available, so all of the tasks in the current stage would 
> eventually fail due to not being able to access that output.  This assumption 
> does not hold for some large-scale, long-running workloads.  E.g., there's 
> one use case where a job has ~100k tasks that each run for about 1 hour, and 
> only the first 5-10 minutes are spent fetching data.  Because of the large 
> number of tasks, it's very common to see a few tasks fail in the fetch phase, 
> and it's wasteful to re-run other tasks that had finished fetching data so 
> aren't affected by the fetch failure (and may be most of the way through 
> their hour-long execution).  The DAGScheduler should not re-start these tasks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-14649) DagScheduler re-starts all running tasks on fetch failure

Reply via email to