[
https://issues.apache.org/jira/browse/SPARK-13369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Imran Rashid updated SPARK-13369:
---------------------------------
Description: The previously hardcoded max 4 retries per stage is not
suitable for all cluster configurations. Since spark retries a stage at the
sign of the first fetch failure, you can easily end up with many stage retries
to discover all the failures. In particular, two scenarios this value should
change are (1) if there are more than 4 executors per node; in that case, it
may take 4 retries to discover the problem with each executor on the node and
(2) during cluster maintenance on large clusters, where multiple machines are
serviced at once, but you also cannot afford total cluster downtime. By making
this value configurable, cluster managers can tune this value to something more
appropriate to their cluster configuration. (was: Currently it is hardcode
inside code. We need to make it configurable because for long running jobs, the
chances of fetch failures due to machine reboot is high and we need a
configuration parameter to bump up that number. )
> Number of consecutive fetch failures for a stage before the job is aborted
> should be configurable
> --------------------------------------------------------------------------------------------------
>
> Key: SPARK-13369
> URL: https://issues.apache.org/jira/browse/SPARK-13369
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 1.6.0
> Reporter: Sital Kedia
> Assignee: Sital Kedia
> Priority: Minor
> Fix For: 2.2.0
>
>
> The previously hardcoded max 4 retries per stage is not suitable for all
> cluster configurations. Since spark retries a stage at the sign of the first
> fetch failure, you can easily end up with many stage retries to discover all
> the failures. In particular, two scenarios this value should change are (1)
> if there are more than 4 executors per node; in that case, it may take 4
> retries to discover the problem with each executor on the node and (2) during
> cluster maintenance on large clusters, where multiple machines are serviced
> at once, but you also cannot afford total cluster downtime. By making this
> value configurable, cluster managers can tune this value to something more
> appropriate to their cluster configuration.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]