[jira] [Commented] (SPARK-13369) Number of consecutive fetch failures for a stage before the job is aborted should be configurable

Imran Rashid (JIRA) Fri, 17 Mar 2017 07:45:18 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-13369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15930063#comment-15930063
 ]


Imran Rashid commented on SPARK-13369:
--------------------------------------

Thanks for fixing this [~sitalke...@gmail.com].

I just noticed that earlier in this ticket, there was a discussion about the 
need to set this config for streaming.  I don't believe that is true, the way 
this works it actually should be fine for occasional fetch failures in a 
long-lived streaming job.  The maximum number of fetch failures is per-stage, 
and the count is reset when the stage is run successfully.  Can you explain why 
you'd need to modify this config for a streaming job?

(The large cluster case at Facebook makes sense to me, as we discussed on the 
pr, and I updated the jira description accordingly.)

>  Number of consecutive fetch failures for a stage before the job is aborted 
> should be configurable
> --------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-13369
>                 URL: https://issues.apache.org/jira/browse/SPARK-13369
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.6.0
>            Reporter: Sital Kedia
>            Assignee: Sital Kedia
>            Priority: Minor
>             Fix For: 2.2.0
>
>
> The previously hardcoded max 4 retries per stage is not suitable for all 
> cluster configurations. Since spark retries a stage at the sign of the first 
> fetch failure, you can easily end up with many stage retries to discover all 
> the failures. In particular, two scenarios this value should change are (1) 
> if there are more than 4 executors per node; in that case, it may take 4 
> retries to discover the problem with each executor on the node and (2) during 
> cluster maintenance on large clusters, where multiple machines are serviced 
> at once, but you also cannot afford total cluster downtime. By making this 
> value configurable, cluster managers can tune this value to something more 
> appropriate to their cluster configuration.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13369) Number of consecutive fetch failures for a stage before the job is aborted should be configurable

Reply via email to