Hello, I am struggling with a problem I have spent days trying to resolve. 
I was hoping someone here may have some input that could help me look in 
the right direction.

I am running a small cluster with 3 nodes. Two nodes reside on one machine, 
while the third resides on a separate machine. This cluster is formed 
between two applications. Call them Web and DataDig. DataDig and Web 
co-reside on Machine1 and Web is duplicated on machine two.

Both use Akka 2.4.4, with Web's dependencies being transitive through Play 
2.5.4 

My problem is that after sometime of running without issue, the nodes start 
having trouble communicating with each other. Within 24 hours of bringing 
the cluster members online, the logs start to display the following:

[WARN] [12/16/2016 21:07:32.645] [a.r.ReliableDeliverySupervisor] 
[akka.tcp://application@host1:2552/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fapplication%40host2%3A2552-588]
 
Association with remote system [akka.tcp://host2:2552] has failed, address 
is now gated for [5000] ms. Reason: [Disassociated]

A service is used to monitor cluster health, at this time it starts to 
report that cluster members are unreachable from each other.

Obviously this starts to cause problems with cluster behavior, and also 
results in messages stating the Leader can currently not perform its duties:

[INFO] [12/16/2016 21:05:48.440] [a.c.Cluster(akka://application)] 
[akka.cluster.Cluster(akka://application)] Cluster Node 
[akka.tcp://application@host1:2552] - Leader can currently not perform its 
duties, reachability status: [akka.tcp://application@host1:2552 -> 
akka.tcp://application@host2:2552: Unreachable [Unreachable] (328), 
akka.tcp://application@host1:37770 -> akka.tcp://application@host2:2552: 
Unreachable [Unreachable] (1), akka.tcp://application@host2:2552 -> 
akka.tcp://application@host1:2552: Reachable [Reachable] (616), 
akka.tcp://application@host2:2552 -> akka.tcp://application@host1:37770: 
Unreachable [Unreachable] (617)], member status: 
[akka.tcp://application@host1:2552 Up seen=true, 
akka.tcp://application@host1:37770 Up seen=false, 
akka.tcp://application@host2:2552 Leaving seen=false]


I have turned on Akka debug logging but the only further messages around 
the time of Disassociation I see are:

[DEBUG] [12/16/2016 21:42:58.893] [application-akka.actor.default-dispatcher
-24] [akka.tcp://application@host1:37770/system/cluster/core/daemon] 
Cluster Node [akka.tcp://application@host1:37770] - Receiving gossip from 
[UniqueAddress(akka.tcp://application@host2:2552,921200398)]

and

[DEBUG] [12/16/2016 21:06:03.310] [a.r.EndpointWriter] 
[akka.tcp://application@host1:2552/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fapplication%40host2%3A2552-588/endpointWriter]
 
Drained buffer with maxWriteCount: 50, fullBackoffCount: 1, 
smallBackoffCount: 0, noBackoffCount: 0 , adaptiveBackoff: 1000



Here is the configuration being used for Web:

akka {
  actor {
    provider = "akka.cluster.ClusterActorRefProvider"
  }

  remote {
    secure-cookie = "9C7BBB890AB2C39691FC7B2A34F616C1D87FCC5B"
    require-cookie = on
    netty.tcp {
      hostname = "localhost"
      hostname = *$*{?HOSTNAME}
      port = 2552
    }
    log-remote-lifecycle-events = off
  }

  cluster {
    failure-detector.threshold = 10
    pub-sub {
      name = distributedPubSubMediator
      routing-logic = round-robin
      gossip-interval = 1s
      removed-time-to-live = 60s
      max-delta-elements = 3000
    }

    roles = ["Web"]

    seed-nodes = "akka.tcp://application@host1:2552"

  }

  loglevel = "DEBUG"
  log-dead-letters-during-shutdown = off
  log-dead-letters = off

  extensions = ["akka.cluster.pubsub.DistributedPubSub"]
}

with JAVA_OPTS="
...

-XX:HeapDumpPath=$HOME/log/ \
-XX:+UseG1GC \
-XX:MaxGCPauseMillis=300 \
-XX:G1HeapWastePercent=20 \
-XX:InitiatingHeapOccupancyPercent=75 \
-XX:ConcGCThreads=32 \
-XX:ParallelGCThreads=48 \
-XX:NewRatio=1 \
-verbose:gc \
-XX:+UseGCLogFileRotation \
-XX:NumberOfGCLogFiles=1 \
-XX:GCLogFileSize=512M \
-XX:+PrintGCDetails \
-XX:+PrintGCTimeStamps \
-Xloggc:$HOME/log/services_web_gc.log
"

With *very* similar config for DataDig.

These hosts are very powerful machines that are not running any other 
resource heavy processes (in fact they're barely running anything else at 
all). There are a few GC pauses that are longer than I would expect.


Any help is appreciated, and I can provide any further context/information.

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Reply via email to