If I'm understanding you correctly, that's not really any better than the broken auto-down system built into Scala. You really need something smarter for production.
*Conceptually*, there is a very straightforward strategy: when a node sees nodes become unreachable, it checks whether it can see more than half of the nodes it expects. If it can, it assumes that things are otherwise okay, and marks the unreachable node as down; if not, then it assumes that it is on the "losing" side of a network partition, and self-destructs. That's easy to explain, but implementing all the details properly is non-trivial. (My own implementation <https://github.com/jducoeur/Querki/blob/master/querki/scalajvm/app/querki/cluster/QuerkiNodeManager.scala> has been in production for a while, but I suspect still has problems with some edge cases, and I plan to replace it with something more AWS-aware.) You want some delay before doing this, so that brief network hiccups don't cause your nodes to self-destruct, and figuring out the correct definition of "half" can be complex if the network isn't fixed-size. But something along those lines is pretty necessary if you want a decently stable cluster for production... On Wed, Sep 13, 2017 at 2:54 PM, Sebastian Oliveri <[email protected]> wrote: > Hi, > > I have a cluster with a few nodes running clustered sharding persistent > actors that I am close to deploy in prod > I tested that once a node is unreachable all the persistent actors inside > it are unreachable as well until human intervention takes places to Down > that unreachable node for the cluster to restore those actors in other > nodes. > I am not considering the commercial split brain strategies so I have no > other option than doing it manual. > The worst case would be for the human intervention to take long and so for > clients to be unable to interact with the actors in a considerable time > window. > I was thinking about having an actor in each node just like the > SimpleClusterListener written in the akka cluster docs and somehow to > mark as Down the unreachable node when handling the command: > UnreachableMember using the akka management API through HTTP (no matter > if it was network partition or crash) > I don“t know if this is possible because I didn't read as a solution so > far. The worst case would be that if it was actually a network partition > that will be eventually restore the unreachable node will be Down and > removed from the cluster and all its persistent actors (maybe thousands, > millions) would be restored again in other nodes, am I right? > > def receive = { case UnreachableMember(member) => > akkaManagement.markAsDown(member.address) // something like this } > > > Is this OK as a solution to avoid human intervention? > > Thanks, > Sebastian > > -- > >>>>>>>>>> Read the docs: http://akka.io/docs/ > >>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/ > current/additional/faq.html > >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user > --- > You received this message because you are subscribed to the Google Groups > "Akka User List" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/akka-user. > For more options, visit https://groups.google.com/d/optout. > -- >>>>>>>>>> Read the docs: http://akka.io/docs/ >>>>>>>>>> Check the FAQ: >>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user --- You received this message because you are subscribed to the Google Groups "Akka User List" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/akka-user. For more options, visit https://groups.google.com/d/optout.
