GitHub user aarondav opened a pull request:
https://github.com/apache/spark/pull/2784
[SPARK-3923] Decrease Akka heartbeat interval below heartbeat pause
Something about the 2.3.4 upgrade seems to have made the issue manifest
where all the services disconnect from each other after exactly 1000 seconds
(which is the heartbeat interval). [This
post](https://groups.google.com/forum/#!topic/akka-user/X3xzpTCbEFs) suggests
that heartbeat pause should be less than heartbeat interval, and decreasing the
interval from 1000s to below the 600s of the heartbeat pause seems to have
rectified the issue. My current cluster has now exceeded 1400s of uptime
without failure!
I do not know why this fixed it, because the threshold we have set for the
failure detector is the exponent of a timeout, and 300 is extremely large.
Perhaps the default failure detector changed in 2.3.4 and now ignores threshold.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/aarondav/spark fix-timeout
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/2784.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2784
----
commit 9cb03722d689de4da6f46e45609b1e1c6d40d130
Author: Aaron Davidson <[email protected]>
Date: 2014-10-13T18:14:03Z
[SPARK-3923] Decrease Akka heartbeat interval below heartbeat pause
Something about the 2.3.4 upgrade seems to have made the issue manifest
where
all the services disconnect from each other after exactly 1000 seconds
(which
is the heartbeat interval). [This
post](https://groups.google.com/forum/#!topic/akka-user/X3xzpTCbEFs)
suggests that heartbeat pause should be less than heartbeat interval, and
decreasing
the interval from 1000s to below the 600s of the heartbeat pause seems to
have
rectified the issue. My current cluster has now exceeded 1400s of uptime
without
failure!
I do not know why this fixed it, because the threshold we have set for the
failure detector is the exponent of a timeout, and 300 is extremely large.
Perhaps the default failure detector changed in 2.3.4 and now ignores
threshold.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]