Re: Avoid losing data on ZK connection-loss/session-timeout

Per Steffensen Tue, 21 Aug 2012 07:19:29 -0700

Per Steffensen wrote:

*Mark Miller:*
*Ad 3)* Well, we can do some practical things right? I don't think weneed to support a node coming back from the dead a year later and ithad some updates the cluster doesn't have. A node coming up 2 minuteslater is something we want to worry about though.

A year, no, but 2 minutes is way to low a limit.


There can be many reasons for running with replica - e.g.
a) High Availability wrt updating docs and wrt search
b) Handling of higher search volumen
c) Have a "live" backup so that you dont (as easily) lose data

Dont know the Solr design-criteria behind the new "4.0 kind of"replication, but if there is a c)-criteria hidden there somewhere, 2minutes is not enough.

A valid scenario is that you run with 1 replica (2 shards per slice) andexpect not to lose data as long as no more than one disk is crashed (inoverlapping periods). So lets say the disk on the machine running theleader of this slice crashes. Allowing a "behind" replica to continue asleader shortly after, and therefore opening up for new updates to theslice, data only on the old leader will be lost forever. Depending onpreferences between "accepting data loss" and "accepting down time(where new updates cannot take place)" (basically between a) and c)above) an admin of such a system might expect to have a fair chance ofmaking the disk work again or at least dig the data out of it and put itonto a new machine and configure that to participate in the Solrcluster. In such a case 2 minutes is way to little.

Another valid scenario. Same setup a first scenario, but this time theSolr JVM running the leader just crashes or the motherboard/CPU burns orsomething. You are now in a posistion where you still have the newestdata, it is just not "online". Again an admin, depending on preferences,might want a "behind" replica not to take over leadership and allowupdates. As soon as you allow updates to the new leader (old replica)and there where data on the old leader that was not yet replicated tothe replica, you are dead - you havent necessarily lost data but youhave put yourself in a position where you cannot ever reconstructcorrect dataset (and thats basically the same as losing data).

So basically we either need something timing based or admin commandbased that lets you start a cold shard (slice :-) ) and each nodewaits around for X amount of time or until command X is received, andthen leader election begins.

I like this approach, where you, depending on preferences, can setup thesystem so that a replica is allowed to take over leadership after Xminutes even though it knows that it is "behind" (a replica is alwaysallowed to take over leadership immidiately if it knows it is not"behind"), but where you can also setup up your system so that itrequires an admin "acceptance" for this to happen.

Some systems (potentially including the one Im currently working on :-)) might not prefere HA over "not losing data".


I think the following should be done

- Increase the likelihood of replica not being behind. With the currentimplementation, in case of a sudden crash, the likelihood of a replicabeing behind is way too big. Some kind of atomicity between leader andreplica writing to transaction-log or at least "committing" the changesto the transaction-log is needed- A common knowledge among shards in the same slice about "currentnewest version of slice" would be very beneficial. E.g. leader writes"newest version" to ZK every time he writes to (or commits to)transaction-log. The writing/committing to leader transaction-log andwriting to ZK also needs to be as atomic as possible.

- With the two steps above

-- a replica will be able to know if it is behind and therefore if itshould wait (a period of time or for an admin "acceptance") beforetaking over leadership-- and the chance that such a situation where a replica is actuallybehind has been minimized.

*Jan Høydahl:*
ElasticSearch has some settings to control when recovery starts aftercluster restart, see Guide. This approach looks reasonable. If we knowthat we expect N nodes in our cluster we can start recovery when wesee N nodes up. If fewer than N nodes up, we wait for X time (runningon local data, not accepting new updates) before recovery and leaderelection starts.

As you might know we used to use ElasticSearch in my current project.For political reasons we decided to stop using ES and move to Solr. Iwas very much against that decision, not because of Solr (I didnt knowmuch about it at that point in time), but because ES was actually verycool and functioning very well (for a "before v1.0 piece of software").But that particular feature/approach that Jan mentions was not one ofthe "cool things" about ES (along with its automatic shard re-location -uhhhh still having nightmares).

Re: Avoid losing data on ZK connection-loss/session-timeout

Reply via email to