Per Steffensen wrote:
*Mark Miller:*
*Ad 3)* Well, we can do some practical things right? I don't think we need to support a node coming back from the dead a year later and it had some updates the cluster doesn't have. A node coming up 2 minutes later is something we want to worry about though.
A year, no, but 2 minutes is way to low a limit.

There can be many reasons for running with replica - e.g.
a) High Availability wrt updating docs and wrt search
b) Handling of higher search volumen
c) Have a "live" backup so that you dont (as easily) lose data

Dont know the Solr design-criteria behind the new "4.0 kind of" replication, but if there is a c)-criteria hidden there somewhere, 2 minutes is not enough.

A valid scenario is that you run with 1 replica (2 shards per slice) and expect not to lose data as long as no more than one disk is crashed (in overlapping periods). So lets say the disk on the machine running the leader of this slice crashes. Allowing a "behind" replica to continue as leader shortly after, and therefore opening up for new updates to the slice, data only on the old leader will be lost forever. Depending on preferences between "accepting data loss" and "accepting down time (where new updates cannot take place)" (basically between a) and c) above) an admin of such a system might expect to have a fair chance of making the disk work again or at least dig the data out of it and put it onto a new machine and configure that to participate in the Solr cluster. In such a case 2 minutes is way to little.

Another valid scenario. Same setup a first scenario, but this time the Solr JVM running the leader just crashes or the motherboard/CPU burns or something. You are now in a posistion where you still have the newest data, it is just not "online". Again an admin, depending on preferences, might want a "behind" replica not to take over leadership and allow updates. As soon as you allow updates to the new leader (old replica) and there where data on the old leader that was not yet replicated to the replica, you are dead - you havent necessarily lost data but you have put yourself in a position where you cannot ever reconstruct correct dataset (and thats basically the same as losing data).
So basically we either need something timing based or admin command based that lets you start a cold shard (slice :-) ) and each node waits around for X amount of time or until command X is received, and then leader election begins.
I like this approach, where you, depending on preferences, can setup the system so that a replica is allowed to take over leadership after X minutes even though it knows that it is "behind" (a replica is always allowed to take over leadership immidiately if it knows it is not "behind"), but where you can also setup up your system so that it requires an admin "acceptance" for this to happen.

Some systems (potentially including the one Im currently working on :-) ) might not prefere HA over "not losing data".

I think the following should be done
- Increase the likelihood of replica not being behind. With the current implementation, in case of a sudden crash, the likelihood of a replica being behind is way too big. Some kind of atomicity between leader and replica writing to transaction-log or at least "committing" the changes to the transaction-log is needed - A common knowledge among shards in the same slice about "current newest version of slice" would be very beneficial. E.g. leader writes "newest version" to ZK every time he writes to (or commits to) transaction-log. The writing/committing to leader transaction-log and writing to ZK also needs to be as atomic as possible.
- With the two steps above
-- a replica will be able to know if it is behind and therefore if it should wait (a period of time or for an admin "acceptance") before taking over leadership -- and the chance that such a situation where a replica is actually behind has been minimized.

*Jan Høydahl:*
ElasticSearch has some settings to control when recovery starts after cluster restart, see Guide. This approach looks reasonable. If we know that we expect N nodes in our cluster we can start recovery when we see N nodes up. If fewer than N nodes up, we wait for X time (running on local data, not accepting new updates) before recovery and leader election starts.

As you might know we used to use ElasticSearch in my current project. For political reasons we decided to stop using ES and move to Solr. I was very much against that decision, not because of Solr (I didnt know much about it at that point in time), but because ES was actually very cool and functioning very well (for a "before v1.0 piece of software"). But that particular feature/approach that Jan mentions was not one of the "cool things" about ES (along with its automatic shard re-location - uhhhh still having nightmares).

Reply via email to