Re: [akka-user] How to recover from network partition quarantine

Morten Kjetland Wed, 05 Aug 2015 04:58:41 -0700

Hi,

We have solved these issues like this:


We have a ClusterListener on each node that "pings" the database - As long
as it is "online" and a happy member of the cluster it updates a timestamp
in the database.

To detect split-brain scenarios, we do this:
The ClusterListener on each node keeps track of all members of the cluster
in memory.
Periodically we check if there are more alive nodes in the database than we
know is member of our cluster.
If we see more alive nodes than we have in the cluster, we know we have a
split brain scenario.

To recover from it, the node waits a random amount of seconds, then trigger
itself to restart (we spawn a process that executes "./application.sh
restart")

When the node is starting up (again) we use the same "alive" mechanism in
the database to find seed-nodes - so we actually join the existing cluster.
If no one is alive, we know we are the first one starting up, so we're
going to be our own seed node.

If it decided to join a cluster but failed to do so, it starts over again
with a new restart.

This solution has, at least for us, turned out to be a robust solution
which supports

* staged or instant startup of multiple nodes.
* auto-restarting multiple nodes when deploying new version.
* auto-healing when something odd happens in our data-center (like
network-glitches or something causing the cpu to stall for too long)

We're planning to opensouce this code soon.

I hope this info was helpful.

Regards,
Morten

On Wed, Aug 5, 2015 at 1:17 PM Patrik Nordwall <[email protected]>
wrote:

> Hi Tom,
>
> On Fri, Jul 24, 2015 at 4:24 AM, Tom Pantelis <[email protected]>
> wrote:
>
>> During a network partition, the partitioned node is removed from the
>> cluster after auto-down occurs and quarantined such that it must restarted
>> in order to rejoin the cluster once the partition heals. A manual restart
>> due to a temporary network outage is problematic when one is developing a
>> commercial product with end users who will expect automatic recovery (and
>> rightly so).
>>
>> One option is disable auto-down but that introduces another issue. In
>> lieu of that,
>>
>> 1) is there any way to disable the quarantine behavior?
>>
>
> No, because when it has been decided that it is not part of the cluster
> any more we don't want it to show up again. This is important for correct
> semantics of watch. We don't allow zombies.
>
>
>>
>> 2) is there any way for code to node know or get notified that it has
>> been quarantined and must be restarted so it can be handled automatically?
>>
>
> Subscribe to cluster event MemberRemoved, but then the problem is that the
> auto-down has downed the nodes on the other side of the partition and you
> end up with two separate clusters. auto-down can handle crashed nodes, but
> it doesn't handle network partitions well. That is why we don't have it
> turned on by default and recommend against it when using cluster singleton
> and persistence.
>
> It's possible to implement smarter downing strategies, but it is rather
> difficult to implement it correctly. We are working on something for
> improving this. Stay tuned.
>
> Regards,
> Patrik
>
>
>>
>> Thanks,
>> Tom
>>
>> --
>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>> >>>>>>>>>> Check the FAQ:
>> http://doc.akka.io/docs/akka/current/additional/faq.html
>> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "Akka User List" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at http://groups.google.com/group/akka-user.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> --
>
> Patrik Nordwall
> Typesafe <http://typesafe.com/> -  Reactive apps on the JVM
> Twitter: @patriknw
>
> --
> >>>>>>>>>> Read the docs: http://akka.io/docs/
> >>>>>>>>>> Check the FAQ:
> http://doc.akka.io/docs/akka/current/additional/faq.html
> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
> ---
> You received this message because you are subscribed to the Google Groups
> "Akka User List" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/akka-user.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Re: [akka-user] How to recover from network partition quarantine

Reply via email to