[akka-user] Cluster failure tolerance

Kai Yu Sun, 26 Apr 2015 03:21:30 -0700

Hi group,

I am working on an AKKA cluster with four nodes. In my setup, each of the 
four nodes has different functionality, but they are all equal in position 
in the cluster, with very similar configuration (except for the host:port 
things) and all being seeds of the cluster. My configuration file looks 
something like this:


cluster-conf.akka {
    log-dead-letters-during-shutdown = false


    actor.provider = "akka.cluster.ClusterActorRefProvider"


    remote {
        netty.tcp {
            hostname = ${rep.ep.httpd-game-1.int_ip}
        }


        watch-failure-detector.acceptable-heartbeat-pause = 15 s
    }


    cluster {
        seed-nodes = [
                "akka.tcp://"${cluster-conf.clu.name}"@"${rep.ep.httpd-game-
1.int_ip}":"${cluster-conf.akka.remote.netty.tcp.port},
                "akka.tcp://"${cluster-conf.clu.name}"@"${rep.ep.httpd-game-
2.int_ip}":"${cluster-conf.akka.remote.netty.tcp.port},
                "akka.tcp://"${cluster-conf.clu.name}"@"${rep.ep.httpd-sso.
int_ip}":"${cluster-conf.akka.remote.netty.tcp.port},
                "akka.tcp://"${cluster-conf.clu.name}"@"${rep.ep.misc.int_ip
}":"${cluster-conf.akka.remote.netty.tcp.port},
        ]


        auto-down-unreachable-after = 10s


        metrics.native-library-extract-folder=${user.dir}/target/native
    }
}


 In normal conditions, it works fine. Now my goal is to achieve maximum 
failure tolerance. As far as I can think of, I want to ensure one node can 
(automatically) rejoin the cluster, when

 1. it crashes and is brought up by a daemon program automatically,
 2. the network fails (for example, the NIC used to clustering with other 
nodes fails)  for a short time and recovers.

For 1, my setup works when one node crashes and after a while, rejoins to 
the cluster. But if it crashes and restarts too quickly (before 
auto-down-unreachable-after runs up), then it somehow causes the cluster to 
scatter, i.e. all nodes are removed from the cluster and become isolated. 
Am I doing something wrong? How can I fix that other than adding a delay to 
my daemon program?

For 2, if the network failure recovers after auto-down-unreachable-after 
runs up, the node will no longer be able to rejoin unless manual 
interventions be taken. Can someone shed some light on how to make the 
cluster automatically down the node when it tries to rejoin in such a 
situation?

And any suggestion regarding fault tolerance in a cluster setup is welcome. 
Thanks in advance.

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

[akka-user] Cluster failure tolerance

Reply via email to