Update:
Whereas my previous tries to optimize for recovery failed miserably, the 
"gateway.recover_after_nodes" setting in elasticsearch.yml worked... To a 
point. 

I noticed
- No ES node was responsive at all after nodes were brought online until the 
quorum was met.
- It can take a long time for the ES cluster to agree to a quorum, on my tiny 5 
node cluster, it took approx 10 minutes after the nodes were brought online 
until one started responding to es-head. I poked all the nodes up to that 
moment so it does seem like the cluster starts up all at once.
- But, at least in this early case, shard re-allocation and thrashing is not 
avoided. Before shutting down I didn't carefully retain the shard mapping 
across nodes but I did notice that once indexing settled down, for most indexes 
there were as expected 10 shards evenly distributed across the nodes (2/node 
because for every primary shard there is a replica). On restart, I observed 
high concentrations of shards on certain nodes and fewer on others, not an even 
distribution.
- For approx 9GB of indexed metadata (800mb raw data), it has taken a little 
over 40 minutes for the cluster to recover to "green" state.

So, mixed and some disappointing results. Since shard re-allocation seems to 
happen although perhaps less when the gateway_recover_after_nodes setting is 
enabled and configured, I'm still hoping for something to decrease recovery 
time further.

Perhaps recovery isn't being done as efficiently as it might.
1. My impression is that shard content is being evaluated in its full form. If 
it is, I imagine shard content and its integrity can be evaluated far faster 
and better by hash.
2. If hashes are used, I would suggest that they be saved as part of the 
"flush" command or a separate "flush, snapshot and shutdown ES" command. When a 
cluster restarts, perhaps the hash table can be used to quickly "snapshot" the 
existing node and "local data on disk" layout before commencing recovery and 
moving around shards.
3. Speaking of which, maybe sometime it could be useful to detail what ES is 
doing on startup and/or recovery so that we can tinker more intelligently.

Thx,
Tony

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/3302e002-cd0d-433a-9f7f-9f6d92c095a6%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to