Update: Whereas my previous tries to optimize for recovery failed miserably, the "gateway.recover_after_nodes" setting in elasticsearch.yml worked... To a point.
I noticed - No ES node was responsive at all after nodes were brought online until the quorum was met. - It can take a long time for the ES cluster to agree to a quorum, on my tiny 5 node cluster, it took approx 10 minutes after the nodes were brought online until one started responding to es-head. I poked all the nodes up to that moment so it does seem like the cluster starts up all at once. - But, at least in this early case, shard re-allocation and thrashing is not avoided. Before shutting down I didn't carefully retain the shard mapping across nodes but I did notice that once indexing settled down, for most indexes there were as expected 10 shards evenly distributed across the nodes (2/node because for every primary shard there is a replica). On restart, I observed high concentrations of shards on certain nodes and fewer on others, not an even distribution. - For approx 9GB of indexed metadata (800mb raw data), it has taken a little over 40 minutes for the cluster to recover to "green" state. So, mixed and some disappointing results. Since shard re-allocation seems to happen although perhaps less when the gateway_recover_after_nodes setting is enabled and configured, I'm still hoping for something to decrease recovery time further. Perhaps recovery isn't being done as efficiently as it might. 1. My impression is that shard content is being evaluated in its full form. If it is, I imagine shard content and its integrity can be evaluated far faster and better by hash. 2. If hashes are used, I would suggest that they be saved as part of the "flush" command or a separate "flush, snapshot and shutdown ES" command. When a cluster restarts, perhaps the hash table can be used to quickly "snapshot" the existing node and "local data on disk" layout before commencing recovery and moving around shards. 3. Speaking of which, maybe sometime it could be useful to detail what ES is doing on startup and/or recovery so that we can tinker more intelligently. Thx, Tony -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3302e002-cd0d-433a-9f7f-9f6d92c095a6%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
