I came to know about zk.session.timeout variable just now, while reading more about this problem.
This will only trigger dead-node notification after the configured timeout exceeds. Setting it to 3-4 mins must be fine for OOMs and rolling-restarts. Only extra stuff I am looking for, is to divert search calls to a read-only shard instance during this 3-4 mins time to avoid mini-outages -- Ravi On Thu, Mar 6, 2014 at 3:34 PM, Ravikumar Govindarajan < [email protected]> wrote: > What do you think of giving an extra leeway for shard-server failover > cases? > > Ex: Whenever a shard-server process gets killed, the controller-node does > not immediately update-layout, but rather mark it as a suspect. > > When we have a read-only back-up of shard, searches can continue > unhindered. Indexing during this time can be diverted to a queue, which > will store and retry-ops, when shard-server comes online again. > > Over configured number of attempts/time, if the shard-server does not come > up, then one controller-server can authoritatively mark it as down and > update the layout. > > -- > Ravi > >
