[
https://issues.apache.org/jira/browse/SPARK-12411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Or updated SPARK-12411:
------------------------------
Fix Version/s: 1.6.1
> Reconsider executor heartbeats rpc timeout
> ------------------------------------------
>
> Key: SPARK-12411
> URL: https://issues.apache.org/jira/browse/SPARK-12411
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Reporter: Nong Li
> Assignee: Nong Li
> Fix For: 1.6.1, 2.0.0
>
>
> Currently, the timeout for checking when an executor is failed is the same as
> the timeout of the sender ("spark.network.timeout") which defaults to 120s.
> This means if there is a network issue, the executor only gets one try to
> heartbeat which probably causes the failure detection to be flaky.
> The executor has a config to control how often to heartbeat
> (spark.executor.heartbeatInterval) which defaults to 10s. This combination of
> configs doesn't seem to make sense. The heartbeat rpc timeout should probably
> be less than or equal to the heartbeatInterval.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]