Hi group,
I am working on an AKKA cluster with four nodes. In my setup, each of the
four nodes has different functionality, but they are all equal in position
in the cluster, with very similar configuration (except for the host:port
things) and all being seeds of the cluster. My configuration file looks
something like this:
cluster-conf.akka {
log-dead-letters-during-shutdown = false
actor.provider = "akka.cluster.ClusterActorRefProvider"
remote {
netty.tcp {
hostname = ${rep.ep.httpd-game-1.int_ip}
}
watch-failure-detector.acceptable-heartbeat-pause = 15 s
}
cluster {
seed-nodes = [
"akka.tcp://"${cluster-conf.clu.name}"@"${rep.ep.httpd-game-
1.int_ip}":"${cluster-conf.akka.remote.netty.tcp.port},
"akka.tcp://"${cluster-conf.clu.name}"@"${rep.ep.httpd-game-
2.int_ip}":"${cluster-conf.akka.remote.netty.tcp.port},
"akka.tcp://"${cluster-conf.clu.name}"@"${rep.ep.httpd-sso.
int_ip}":"${cluster-conf.akka.remote.netty.tcp.port},
"akka.tcp://"${cluster-conf.clu.name}"@"${rep.ep.misc.int_ip
}":"${cluster-conf.akka.remote.netty.tcp.port},
]
auto-down-unreachable-after = 10s
metrics.native-library-extract-folder=${user.dir}/target/native
}
}
In normal conditions, it works fine. Now my goal is to achieve maximum
failure tolerance. As far as I can think of, I want to ensure one node can
(automatically) rejoin the cluster, when
1. it crashes and is brought up by a daemon program automatically,
2. the network fails (for example, the NIC used to clustering with other
nodes fails) for a short time and recovers.
For 1, my setup works when one node crashes and after a while, rejoins to
the cluster. But if it crashes and restarts too quickly (before
auto-down-unreachable-after runs up), then it somehow causes the cluster to
scatter, i.e. all nodes are removed from the cluster and become isolated.
Am I doing something wrong? How can I fix that other than adding a delay to
my daemon program?
For 2, if the network failure recovers after auto-down-unreachable-after
runs up, the node will no longer be able to rejoin unless manual
interventions be taken. Can someone shed some light on how to make the
cluster automatically down the node when it tries to rejoin in such a
situation?
And any suggestion regarding fault tolerance in a cluster setup is welcome.
Thanks in advance.
--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ:
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.