[jira] [Comment Edited] (SPARK-20178) Improve Scheduler fetch failures

Josh Rosen (JIRA) Mon, 22 May 2017 13:55:40 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-20178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16020166#comment-16020166
 ]


Josh Rosen edited comment on SPARK-20178 at 5/22/17 8:54 PM:
-------------------------------------------------------------

Sure, let me clarify:

When a FetchFailure occurs, the DAGScheduler receives a fetch failure message 
of the form {{FetchFailed(bmAddress, shuffleId, mapId, reduceId, 
failureMessage)}}. As of today's Spark master branch, the DAGScheduler handles 
this failure by marking that individual output as unavailable and by marking 
all outputs on that executor as unavailable (see 
https://github.com/apache/spark/blob/9b09101938399a3490c3c9bde9e5f07031140fdf/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1339).
 

As a shorthand, let's call the current behavior {{remove(shuffleId, mapId)}} 
followed by {{remove(blockManagerId)}}. My proposal was to replace this by 
{{remove(shuffleId, blockManagerId)}} to remove all outputs from the 
fetch-failed shuffle on that block manager.

{quote}
I think this is basically what you are proposing except waiting for a 
configurable amount of failures rather then doing it immediately. Thoughts?
{quote}

My understanding of today's code is that a single FetchFailed task will trigger 
a stage failure and parent stage retry and that the task which experienced the 
fetch failure will not be retried within the same task set that scheduled it. 
I'm basing this off the comment at 
https://github.com/apache/spark/blob/9b09101938399a3490c3c9bde9e5f07031140fdf/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L77
 and the code at 
https://github.com/apache/spark/blob/9b09101938399a3490c3c9bde9e5f07031140fdf/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L770
 where the TSM prevents re-attempts of FetchFailed tasks.


was (Author: joshrosen):
Sure, let me clarify:

When a FetchFailure occurs, the DAGScheduler receives a fetch failure message 
of the form {{FetchFailed(bmAddress, shuffleId, mapId, reduceId, 
failureMessage)}}. As of today's Spark master branch, the DAGScheduler handles 
this failure by marking that individual output as unavailable ( and by marking 
all outputs on that executor as unavailable (see 
https://github.com/apache/spark/blob/9b09101938399a3490c3c9bde9e5f07031140fdf/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1339).
 

As a shorthand, let's call the current behavior {{remove(shuffleId, mapId)}} 
followed by {{remove(blockManagerId)}}. My proposal was to replace this by 
{{remove(shuffleId, blockManagerId)}} to remove all outputs from the 
fetch-failed shuffle on that block manager.

{quote}
I think this is basically what you are proposing except waiting for a 
configurable amount of failures rather then doing it immediately. Thoughts?
{quote}

My understanding of today's code is that a single FetchFailed task will trigger 
a stage failure and parent stage retry and that the task which experienced the 
fetch failure will not be retried within the same task set that scheduled it. 
I'm basing this off the comment at 
https://github.com/apache/spark/blob/9b09101938399a3490c3c9bde9e5f07031140fdf/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L77
 and the code at 
https://github.com/apache/spark/blob/9b09101938399a3490c3c9bde9e5f07031140fdf/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L770
 where the TSM prevents re-attempts of FetchFailed tasks.

> Improve Scheduler fetch failures
> --------------------------------
>
>                 Key: SPARK-20178
>                 URL: https://issues.apache.org/jira/browse/SPARK-20178
>             Project: Spark
>          Issue Type: Epic
>          Components: Scheduler
>    Affects Versions: 2.1.0
>            Reporter: Thomas Graves
>
> We have been having a lot of discussions around improving the handling of 
> fetch failures.  There are 4 jira currently related to this.  
> We should try to get a list of things we want to improve and come up with one 
> cohesive design.
> SPARK-20163,  SPARK-20091,  SPARK-14649 , and SPARK-19753
> I will put my initial thoughts in a follow on comment.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-20178) Improve Scheduler fetch failures

Reply via email to