[ 
https://issues.apache.org/jira/browse/SPARK-13369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-13369:
---------------------------------
    Description: The previously hardcoded max 4 retries per stage is not 
suitable for all cluster configurations. Since spark retries a stage at the 
sign of the first fetch failure, you can easily end up with many stage retries 
to discover all the failures. In particular, two scenarios this value should 
change are (1) if there are more than 4 executors per node; in that case, it 
may take 4 retries to discover the problem with each executor on the node and 
(2) during cluster maintenance on large clusters, where multiple machines are 
serviced at once, but you also cannot afford total cluster downtime. By making 
this value configurable, cluster managers can tune this value to something more 
appropriate to their cluster configuration.  (was: Currently it is hardcode 
inside code. We need to make it configurable because for long running jobs, the 
chances of fetch failures due to machine reboot is high and we need a 
configuration parameter to bump up that number. )

>  Number of consecutive fetch failures for a stage before the job is aborted 
> should be configurable
> --------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-13369
>                 URL: https://issues.apache.org/jira/browse/SPARK-13369
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.6.0
>            Reporter: Sital Kedia
>            Assignee: Sital Kedia
>            Priority: Minor
>             Fix For: 2.2.0
>
>
> The previously hardcoded max 4 retries per stage is not suitable for all 
> cluster configurations. Since spark retries a stage at the sign of the first 
> fetch failure, you can easily end up with many stage retries to discover all 
> the failures. In particular, two scenarios this value should change are (1) 
> if there are more than 4 executors per node; in that case, it may take 4 
> retries to discover the problem with each executor on the node and (2) during 
> cluster maintenance on large clusters, where multiple machines are serviced 
> at once, but you also cannot afford total cluster downtime. By making this 
> value configurable, cluster managers can tune this value to something more 
> appropriate to their cluster configuration.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to