Re: [akka-user] How to recover from network partition quarantine

Patrik Nordwall Tue, 22 Sep 2015 02:24:08 -0700

For the archives (and I promised to get back):

Akka Split Brain Resolver (Akka SBR) is a new commercial feature available
exclusively to Typesafe PSS subscribers.


It's part of the Typesafe Reactive Platform and implements a number of
strategies on how downing can be performed more safely than just timeouts
(auto-downing). The strategies are for example "static quorum" or "keep
majority" etc. Each of them has specific trade-offs, i.e. scenarios where
they work well, and failure scenarios where the strategy would make a
decision consistent with how it's working, but maybe not what you need.

The docs are available here:
http://doc.akka.io/docs/akka/rp-15v09p01/scala/split-brain-resolver.html and
go pretty in-depth about how it all works.

Konrad did a webinar about new features in Akka 2.4 and Reactive Platform
and it also covered the Split Brain Resolver a bit:
https://youtu.be/D3mPl8OUrjs?t=9m11s (9 minute mark is about SBR).

In order to use this in production you'll need to obtain a Reactive
Platform subscription, more details here:
http://www.typesafe.com/products/typesafe-reactive-platform (it also
explains on the bottom how you can try it out).

On Thu, Aug 6, 2015 at 9:38 AM, Morten Kjetland <[email protected]> wrote:

> …
> On Wed, Aug 5, 2015 at 9:45 PM <[email protected]> wrote:
>
>> Hi Morten,
>>
>> We'd like to implement a solution similar to yours, can you elaborate on
>> some details of your solution?
>>
>
> First let me say that our solution is not optimal, but it works.
> It was improved over time to work with what we experienced in production,
> and has ended up to be robust enough - at least for now.
> …
>
>
>> On Wednesday, August 5, 2015 at 2:58:19 PM UTC+3, Morten Kjetland wrote:
>>>
>>> Hi,
>>>
>>> We have solved these issues like this:
>>>
>>> We have a ClusterListener on each node that "pings" the database - As
>>> long as it is "online" and a happy member of the cluster it updates a
>>> timestamp in the database.
>>>
>>> To detect split-brain scenarios, we do this:
>>> The ClusterListener on each node keeps track of all members of the
>>> cluster in memory.
>>> Periodically we check if there are more alive nodes in the database than
>>> we know is member of our cluster.
>>> If we see more alive nodes than we have in the cluster, we know we have
>>> a split brain scenario.
>>>
>>
>> I think there might be a possible race condition here, what if one or
>> more new node join the cluster and update the DB before all other nodes
>> learned about the new nodes? In this case other nodes might think they are
>> in a split brain situation and restart themselves, right? How do you
>> prevent this?
>>
>
> I think you have a point here..
> A node writes that it is alive when it self knows that it has joined the
> cluster.
> But you are right that a different node might see this alive message
> before itself is aware of the odder node having joined the cluster.
>
> I'm adding a todo to improve it. thanks :)
> …
>
>
>>
>>
>>>
>>> To recover from it, the node waits a random amount of seconds, then
>>> trigger itself to restart (we spawn a process that executes
>>> "./application.sh restart")
>>>
>>
>>  While waiting a random amount of seconds, is the node still part of the
>> cluster or has it left the cluster already?
>>
>
> As we know, we do not want these error-situations to happen, but when it
> does happen, we would like to recover from it, and we do it by restarting
> our app.
> But to prevent a theoretical problem where all of our cluster nodes
> restarts at the same time over and over again, I have introduces a random
> delay, to make sure not everything happens at the same time (*if* multiple
> nodes detect an error at the same time).
> I guess this could be improved by trying to leave the cluster right away,
> then wait some time before restart.
> …
>
>
>>
>>
>>> When the node is starting up (again) we use the same "alive" mechanism
>>> in the database to find seed-nodes - so we actually join the existing
>>> cluster. If no one is alive, we know we are the first one starting up, so
>>> we're going to be our own seed node.
>>>
>>
>> "If no one is alive, we know we are the first one starting up" - have you
>> implemented this with some atomic operation, like "check and set", to
>> prevent starting 2 clusters? What DB do you use for this?
>>
>>
> It is not an atomic operation at this time, but should this situation you
> describes happen, then the error-detection would detect and fix it.
>
> I guess this could also be improved.
>
> We use Oracle (Company decision)
> …
>
>
>>
>>> If it decided to join a cluster but failed to do so, it starts over
>>> again with a new restart.
>>>
>>> This solution has, at least for us, turned out to be a robust solution
>>> which supports
>>>
>>> * staged or instant startup of multiple nodes.
>>> * auto-restarting multiple nodes when deploying new version.
>>> * auto-healing when something odd happens in our data-center (like
>>> network-glitches or something causing the cpu to stall for too long)
>>>
>>
>> Do you use akka persistence/cluster sharding? I'm asking because we do
>> use both and have found them to be sensitive to split brain.
>>
>>
>
> Yes, we have multiple micro-service applications all using akka
> persistence with sharding.
> We also experienced that it was sensible to these problems, so what I have
> described is our way of getting around these problems.
> …
>
> --
> >>>>>>>>>> Read the docs: http://akka.io/docs/
> >>>>>>>>>> Check the FAQ:
> http://doc.akka.io/docs/akka/current/additional/faq.html
> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
> ---
> You received this message because you are subscribed to the Google Groups
> "Akka User List" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/akka-user.
> For more options, visit https://groups.google.com/d/optout.
>



-- 

Patrik Nordwall
Typesafe <http://typesafe.com/> -  Reactive apps on the JVM
Twitter: @patriknw

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Re: [akka-user] How to recover from network partition quarantine

Reply via email to