Repository: spark
Updated Branches:
  refs/heads/master 2fe0ba956 -> 7f7b50ed9


[SPARK-3923] Increase Akka heartbeat pause above heartbeat interval

Something about the 2.3.4 upgrade seems to have made the issue manifest where 
all the services disconnect from each other after exactly 1000 seconds (which 
is the heartbeat interval). [This 
post](https://groups.google.com/forum/#!topic/akka-user/X3xzpTCbEFs) suggests 
that heartbeat pause should be greater than heartbeat interval, and increasing 
the pause from 600s to 6000s seems to have rectified the issue. My current 
cluster has now exceeded 1400s of uptime without failure!

I do not know why this fixed it, because the threshold we have set for the 
failure detector is the exponent of a timeout, and 300 is extremely large. 
Perhaps the default failure detector changed in 2.3.4 and now ignores threshold.

Author: Aaron Davidson <[email protected]>

Closes #2784 from aarondav/fix-timeout and squashes the following commits:

bd1151a [Aaron Davidson] Increase pause, don't decrease interval
9cb0372 [Aaron Davidson] [SPARK-3923] Decrease Akka heartbeat interval below 
heartbeat pause


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7f7b50ed
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7f7b50ed
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7f7b50ed

Branch: refs/heads/master
Commit: 7f7b50ed9d4ffdd6b23e0faa56b068a049da67f7
Parents: 2fe0ba9
Author: Aaron Davidson <[email protected]>
Authored: Thu Oct 16 18:58:18 2014 -0700
Committer: Andrew Or <[email protected]>
Committed: Thu Oct 16 18:58:18 2014 -0700

----------------------------------------------------------------------
 core/src/main/scala/org/apache/spark/util/AkkaUtils.scala | 2 +-
 docs/configuration.md                                     | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/7f7b50ed/core/src/main/scala/org/apache/spark/util/AkkaUtils.scala
----------------------------------------------------------------------
diff --git a/core/src/main/scala/org/apache/spark/util/AkkaUtils.scala 
b/core/src/main/scala/org/apache/spark/util/AkkaUtils.scala
index e2d32c8..f41c8d0 100644
--- a/core/src/main/scala/org/apache/spark/util/AkkaUtils.scala
+++ b/core/src/main/scala/org/apache/spark/util/AkkaUtils.scala
@@ -77,7 +77,7 @@ private[spark] object AkkaUtils extends Logging {
 
     val logAkkaConfig = if (conf.getBoolean("spark.akka.logAkkaConfig", 
false)) "on" else "off"
 
-    val akkaHeartBeatPauses = conf.getInt("spark.akka.heartbeat.pauses", 600)
+    val akkaHeartBeatPauses = conf.getInt("spark.akka.heartbeat.pauses", 6000)
     val akkaFailureDetector =
       conf.getDouble("spark.akka.failure-detector.threshold", 300.0)
     val akkaHeartBeatInterval = conf.getInt("spark.akka.heartbeat.interval", 
1000)

http://git-wip-us.apache.org/repos/asf/spark/blob/7f7b50ed/docs/configuration.md
----------------------------------------------------------------------
diff --git a/docs/configuration.md b/docs/configuration.md
index f311f0d..8515ee0 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -725,7 +725,7 @@ Apart from these, the following properties are also 
available, and may be useful
 </tr>
 <tr>
   <td><code>spark.akka.heartbeat.pauses</code></td>
-  <td>600</td>
+  <td>6000</td>
   <td>
      This is set to a larger value to disable failure detector that comes 
inbuilt akka. It can be
      enabled again, if you plan to use this feature (Not recommended). 
Acceptable heart beat pause


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to