GitHub user aarondav opened a pull request:

    https://github.com/apache/spark/pull/2784

    [SPARK-3923] Decrease Akka heartbeat interval below heartbeat pause

    Something about the 2.3.4 upgrade seems to have made the issue manifest 
where all the services disconnect from each other after exactly 1000 seconds 
(which is the heartbeat interval). [This 
post](https://groups.google.com/forum/#!topic/akka-user/X3xzpTCbEFs) suggests 
that heartbeat pause should be less than heartbeat interval, and decreasing the 
interval from 1000s to below the 600s of the heartbeat pause seems to have 
rectified the issue. My current cluster has now exceeded 1400s of uptime 
without failure!
    
    I do not know why this fixed it, because the threshold we have set for the 
failure detector is the exponent of a timeout, and 300 is extremely large. 
Perhaps the default failure detector changed in 2.3.4 and now ignores threshold.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/aarondav/spark fix-timeout

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2784.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2784
    
----
commit 9cb03722d689de4da6f46e45609b1e1c6d40d130
Author: Aaron Davidson <[email protected]>
Date:   2014-10-13T18:14:03Z

    [SPARK-3923] Decrease Akka heartbeat interval below heartbeat pause
    
    Something about the 2.3.4 upgrade seems to have made the issue manifest 
where
    all the services disconnect from each other after exactly 1000 seconds 
(which
    is the heartbeat interval). [This 
post](https://groups.google.com/forum/#!topic/akka-user/X3xzpTCbEFs)
    suggests that heartbeat pause should be less than heartbeat interval, and 
decreasing
    the interval from 1000s to below the 600s of the heartbeat pause seems to 
have
    rectified the issue. My current cluster has now exceeded 1400s of uptime 
without
    failure!
    
    I do not know why this fixed it, because the threshold we have set for the
    failure detector is the exponent of a timeout, and 300 is extremely large.
    Perhaps the default failure detector changed in 2.3.4 and now ignores
    threshold.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to