How long network partitions do you have? You have increased acceptable-heartbeat-pause of the cluster failure detector, which is good. You use auto-down-unreachable-after. In total those timeouts would mean that it should be able to survive a network partition of 70 seconds before removing and quarantining the unreachable members. Do you see quarantine earlier than that?
Regards, Patrik On Sat, Oct 29, 2016 at 1:59 AM, Eugene Dzhurinsky <[email protected]> wrote: > I have some not really stable network across different geo locations > (different hemispheres actually) and from time to time my actor cluster > falls apart. > I wrote bunch of event interceptors for quarantine events and the system > is more or less operable (less than 10% of nodes are off-cluster at any > given mode, they can detect the failures and quarantine events and restart > themselves), and task recovery is performed really well. But recently I > started to observe a lots of restarts for the systems. Is there any way to > make the cluster more stable? Below is my config: > > akka { > > actor { > provider = "akka.cluster.ClusterActorRefProvider" > } > > remote { > log-remote-lifecycle-events = on > log-sent-messages = off > log-received-messages = off > > netty.tcp { > hostname = "127.0.0.1" > port = 0 > } > > watch-failure-detector { > threshold = 12 > heartbeat-interval = 10 s > acceptable-heartbeat-pause = 60 s > } > > } > > cluster { > seed-nodes = ["akka.tcp://[email protected]:12551"] > auto-down-unreachable-after = 10s > > failure-detector { > threshold = 12 > acceptable-heartbeat-pause = 60 s > heartbeat-interval = 10 s > } > > use-dispatcher = cluster-dispatcher > > } > > contrib.cluster.pub-sub { > > name = distributedPubSubMediator > role = "" > routing-logic = broadcast > gossip-interval = 1s > removed-time-to-live = 120s > max-delta-elements = 3000 > > } > > loggers = ["akka.event.slf4j.Slf4jLogger"] > loglevel = "DEBUG" > > } > > > singleton-dispatcher { > fork-join-executor.parallelism-min = 1 > fork-join-executor.parallelism-max = 1 > } > > > cluster-dispatcher { > type = "Dispatcher" > executor = "fork-join-executor" > fork-join-executor { > parallelism-min = 2 > parallelism-max = 4 > } > } > > Thanks! > > -- > >>>>>>>>>> Read the docs: http://akka.io/docs/ > >>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/ > current/additional/faq.html > >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user > --- > You received this message because you are subscribed to the Google Groups > "Akka User List" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/akka-user. > For more options, visit https://groups.google.com/d/optout. > -- Patrik Nordwall Akka Tech Lead Lightbend <http://www.lightbend.com/> - Reactive apps on the JVM Twitter: @patriknw -- >>>>>>>>>> Read the docs: http://akka.io/docs/ >>>>>>>>>> Check the FAQ: >>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user --- You received this message because you are subscribed to the Google Groups "Akka User List" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/akka-user. For more options, visit https://groups.google.com/d/optout.
