Re: [akka-user] How to recover from network partition quarantine

leonidb Wed, 05 Aug 2015 13:29:51 -0700

Hi Morten,

We'd like to implement a solution similar to yours, can you elaborate on 
some details of your solution?


On Wednesday, August 5, 2015 at 2:58:19 PM UTC+3, Morten Kjetland wrote:
>
> Hi,
>
> We have solved these issues like this:
>
> We have a ClusterListener on each node that "pings" the database - As long 
> as it is "online" and a happy member of the cluster it updates a timestamp 
> in the database.
>
> To detect split-brain scenarios, we do this:
> The ClusterListener on each node keeps track of all members of the cluster 
> in memory.
> Periodically we check if there are more alive nodes in the database than 
> we know is member of our cluster.
> If we see more alive nodes than we have in the cluster, we know we have a 
> split brain scenario.
>

I think there might be a possible race condition here, what if one or more 
new node join the cluster and update the DB before all other nodes learned 
about the new nodes? In this case other nodes might think they are in a 
split brain situation and restart themselves, right? How do you prevent 
this?
 

>
> To recover from it, the node waits a random amount of seconds, then 
> trigger itself to restart (we spawn a process that executes 
> "./application.sh restart")
>

 While waiting a random amount of seconds, is the node still part of the 
cluster or has it left the cluster already?


> When the node is starting up (again) we use the same "alive" mechanism in 
> the database to find seed-nodes - so we actually join the existing cluster. 
> If no one is alive, we know we are the first one starting up, so we're 
> going to be our own seed node.
>

"If no one is alive, we know we are the first one starting up" - have you 
implemented this with some atomic operation, like "check and set", to 
prevent starting 2 clusters? What DB do you use for this?


> If it decided to join a cluster but failed to do so, it starts over again 
> with a new restart.
>
> This solution has, at least for us, turned out to be a robust solution 
> which supports
>
> * staged or instant startup of multiple nodes.
> * auto-restarting multiple nodes when deploying new version.
> * auto-healing when something odd happens in our data-center (like 
> network-glitches or something causing the cpu to stall for too long)
>

Do you use akka persistence/cluster sharding? I'm asking because we do use 
both and have found them to be sensitive to split brain.
 

>
> We're planning to opensouce this code soon.
>
> I hope this info was helpful.
>
> Regards,
> Morten
>

Thanks,
Leonid 

-- 
This message may contain confidential and/or privileged information. 
If you are not the addressee or authorized to receive this on behalf of the 
addressee you must not use, copy, disclose or take action based on this 
message or any information herein. 
If you have received this message in error, please advise the sender 
immediately by reply email and delete this message. Thank you.

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Re: [akka-user] How to recover from network partition quarantine

Reply via email to