Re: [akka-user] How to recover from network partition quarantine

Morten Kjetland Thu, 06 Aug 2015 00:38:35 -0700

…
On Wed, Aug 5, 2015 at 9:45 PM <[email protected]> wrote:

> Hi Morten,
>
> We'd like to implement a solution similar to yours, can you elaborate on
> some details of your solution?
>


First let me say that our solution is not optimal, but it works.
It was improved over time to work with what we experienced in production,
and has ended up to be robust enough - at least for now.
…


> On Wednesday, August 5, 2015 at 2:58:19 PM UTC+3, Morten Kjetland wrote:
>>
>> Hi,
>>
>> We have solved these issues like this:
>>
>> We have a ClusterListener on each node that "pings" the database - As
>> long as it is "online" and a happy member of the cluster it updates a
>> timestamp in the database.
>>
>> To detect split-brain scenarios, we do this:
>> The ClusterListener on each node keeps track of all members of the
>> cluster in memory.
>> Periodically we check if there are more alive nodes in the database than
>> we know is member of our cluster.
>> If we see more alive nodes than we have in the cluster, we know we have a
>> split brain scenario.
>>
>
> I think there might be a possible race condition here, what if one or more
> new node join the cluster and update the DB before all other nodes learned
> about the new nodes? In this case other nodes might think they are in a
> split brain situation and restart themselves, right? How do you prevent
> this?
>

I think you have a point here..
A node writes that it is alive when it self knows that it has joined the
cluster.
But you are right that a different node might see this alive message before
itself is aware of the odder node having joined the cluster.

I'm adding a todo to improve it. thanks :)
…


>
>
>>
>> To recover from it, the node waits a random amount of seconds, then
>> trigger itself to restart (we spawn a process that executes
>> "./application.sh restart")
>>
>
>  While waiting a random amount of seconds, is the node still part of the
> cluster or has it left the cluster already?
>

As we know, we do not want these error-situations to happen, but when it
does happen, we would like to recover from it, and we do it by restarting
our app.
But to prevent a theoretical problem where all of our cluster nodes
restarts at the same time over and over again, I have introduces a random
delay, to make sure not everything happens at the same time (*if* multiple
nodes detect an error at the same time).
I guess this could be improved by trying to leave the cluster right away,
then wait some time before restart.
…


>
>
>> When the node is starting up (again) we use the same "alive" mechanism in
>> the database to find seed-nodes - so we actually join the existing cluster.
>> If no one is alive, we know we are the first one starting up, so we're
>> going to be our own seed node.
>>
>
> "If no one is alive, we know we are the first one starting up" - have you
> implemented this with some atomic operation, like "check and set", to
> prevent starting 2 clusters? What DB do you use for this?
>
>
It is not an atomic operation at this time, but should this situation you
describes happen, then the error-detection would detect and fix it.

I guess this could also be improved.

We use Oracle (Company decision)
…


>
>> If it decided to join a cluster but failed to do so, it starts over again
>> with a new restart.
>>
>> This solution has, at least for us, turned out to be a robust solution
>> which supports
>>
>> * staged or instant startup of multiple nodes.
>> * auto-restarting multiple nodes when deploying new version.
>> * auto-healing when something odd happens in our data-center (like
>> network-glitches or something causing the cpu to stall for too long)
>>
>
> Do you use akka persistence/cluster sharding? I'm asking because we do use
> both and have found them to be sensitive to split brain.
>
>

Yes, we have multiple micro-service applications all using akka persistence
with sharding.
We also experienced that it was sensible to these problems, so what I have
described is our way of getting around these problems.
…

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Re: [akka-user] How to recover from network partition quarantine

Reply via email to