Avoid losing data on ZK connection-loss/session-timeout

Per Steffensen Tue, 21 Aug 2012 00:25:28 -0700

Hi

Accidently started a discussion around SUBJECT on issue SOLR-3721. Notto mix things up too much I will encourage that we continue thediscussion here. It is an important issue (at least for myorganization), and I believe the current solution in Solr 4.x is notsolid enough (have seen it in practice on high-load/high-concurrencysetups). I will start by quoting the sessential stuff around SUBJECTfrom SOLR-3721. Hope you Solr devs (and other interested folkes) willjoin the discussion.


Regards Per Steffensen

------------------------ quotings from SOLR-3721----------------------------

*Per Steffensen:*

What if two Solrs, respectively running leader and replica for the sameslice (only one replica), lose their ZK connection at about the sametime. Then there will be no active shard that either of them can recoverfrom. This scenario shouldnt end in a situation where the slice is justdead. The two shards in the same slice ought to find out who has thenewest version of the shard-data (will probably be the one that wasleader last), make that shard the leader (without recovering) and letthe other shard recover from it. Is this scenarios handled (in the way Isuggest or in another way) already in Solr 4.0 (beta - tip of branch) oris that a future thing (e.g. on 4.1 or 5.0)?


*Mark Miller:*
*1)* No recovery will be started if a node cannot talk to zookeeper.

So nothing would happen until one or both of the nodes reconnected toZooKeeper. That would trigger a leader election, that leader node wouldattempt to sync up with all the other nodes for that shard and anyrecoveries would procede against him.


A little more detail on the "leader attempts to sync up":

*2)* When a new node is elected as a leader by ZooKeeper it first triesto do a peer sync against every other live node. So lets say the firstnode in your two node situation comes back and he is behind the othernode, but he comes back first and is elected leader. The second node hasthe latest updates, but is second in line to be leader and a few updatesahead. The potential leader will try and peer sync with the other nodeand get those missing updates if it's fewer than 100 or fail because theother node is ahead by too much.

If the peer sync is a fail, the potential leader will give up his leaderrole, realizing that it seems there is a better candidate. The othernode, being the next in line to be leader, will now try and peer syncwith the other nodes in the shard. In this case, that will be a successsince he is ahead of the first node. He will then ask the other nodes topeer sync to him. If they are less than 100 docs behind, it willsucceed. If any sync back attempts fail, the leader tries to ask them torecover and they will replicate. Only after this sync process iscompleted does the leader advertise that he is now the leader in thecloud state.

That is the current process - we will continually be hardening andimproving it I'm sure.


*Per Steffensen:*

*Ad 1) *Well, I knew that. I meant that the two Solrs where disconnectedfrom ZK at the same time, but of course both got their connectionreestablished - after session timeout (believe (kinda hope) that asession timeout has to have happened before Solr needs to go intorecovery after a ZK connection loss)

*Ad 2)* When the "behind" node has reconnected and become leader and theone with the latest updates does not come back live right away, isnt thenew leader (which is behind) allowed to start handling update-requests.If yes, then it will be possible that both shards have documents/updatesthat the other one doesnt, and it is possible to come up with scenarioswhere there is no good algorithm for generating the "correct" mergedunion of the data in both shards. So what to do when the other shard(which used to have a later version than the current leader) comes live?

*3)* Believe there is nothing solid to do!

How to avoid that? I was thinking about keeping the latest version forevery slice in ZK, so that a "behind" shard will know if it has thelatest version of a slice, and therefore if it is allowed to take therole as leader. Of course the writing of this "latest version" to ZK andthe writing of the corresponding update in leaders transaction-log wouldhave to be atomic (like the A in ACID) as much as possible. And it wouldbe nice if writing of the update in replica transaction-log would alsobe atomic with the leader-writing and the ZK writing, in order toincrease the chance that a replica is actually allowed to take over theleader role if the leader dies (or both dies and replica comes backfirst, and "old" leader comes back minutes later). But all that is justan idea on top of my head.Do you already have a solution implemented or a solution on the drawingboard or how do you/we prevent such a problem? As far as I understand"the drill" during leader-election/recovery (whether its peer-sync orfile-copy-replication) from the little code-reading I have done and fromwhat you explain, there is not a current solution. But I might be wrong?


*Mark Miller:*

*Ad 3)* Well, we can do some practical things right? I don't think weneed to support a node coming back from the dead a year later and it hadsome updates the cluster doesn't have. A node coming up 2 minutes lateris something we want to worry about though.So basically we either need something timing based or admin commandbased that lets you start a cold shard (slice :-) ) and each node waitsaround for X amount of time or until command X is received, and thenleader election begins.


*Jan Høydahl:*

ElasticSearch has some settings to control when recovery starts aftercluster restart, see Guide. This approach looks reasonable. If we knowthat we expect N nodes in our cluster we can start recovery when we seeN nodes up. If fewer than N nodes up, we wait for X time (running onlocal data, not accepting new updates) before recovery and leaderelection starts.

Avoid losing data on ZK connection-loss/session-timeout

Reply via email to