What do you think of giving an extra leeway for shard-server  failover
cases?

Ex: Whenever a shard-server process gets killed, the controller-node does
not immediately update-layout, but rather mark it as a suspect.

When we have a read-only back-up of shard, searches can continue
unhindered. Indexing during this time can be diverted to a queue, which
will store and retry-ops, when shard-server comes online again.

Over configured number of attempts/time, if the shard-server does not come
up, then one controller-server can authoritatively mark it as down and
update the layout.

--
Ravi

Reply via email to