Re: Trying to optimize configuration for better cluster restart/recovery

Tony Su Tue, 11 Feb 2014 09:16:42 -0800

Update:
Whereas my previous tries to optimize for recovery failed miserably, the 
"gateway.recover_after_nodes" setting in elasticsearch.yml worked... To a 
point.

I noticed
- No ES node was responsive at all after nodes were brought online until the
quorum was met.
- It can take a long time for the ES cluster to agree to a quorum, on my tiny 5
node cluster, it took approx 10 minutes after the nodes were brought online
until one started responding to es-head. I poked all the nodes up to that
moment so it does seem like the cluster starts up all at once.
- But, at least in this early case, shard re-allocation and thrashing is not
avoided. Before shutting down I didn't carefully retain the shard mapping
across nodes but I did notice that once indexing settled down, for most indexes
there were as expected 10 shards evenly distributed across the nodes (2/node
because for every primary shard there is a replica). On restart, I observed
high concentrations of shards on certain nodes and fewer on others, not an even
distribution.
- For approx 9GB of indexed metadata (800mb raw data), it has taken a little
over 40 minutes for the cluster to recover to "green" state.

So, mixed and some disappointing results. Since shard re-allocation seems to
happen although perhaps less when the gateway_recover_after_nodes setting is
enabled and configured, I'm still hoping for something to decrease recovery
time further.

Perhaps recovery isn't being done as efficiently as it might.
1. My impression is that shard content is being evaluated in its full form. If
it is, I imagine shard content and its integrity can be evaluated far faster
and better by hash.
2. If hashes are used, I would suggest that they be saved as part of the
"flush" command or a separate "flush, snapshot and shutdown ES" command. When a
cluster restarts, perhaps the hash table can be used to quickly "snapshot" the
existing node and "local data on disk" layout before commencing recovery and
moving around shards.
3. Speaking of which, maybe sometime it could be useful to detail what ES is
doing on startup and/or recovery so that we can tinker more intelligently.

Thx,
Tony

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/3302e002-cd0d-433a-9f7f-9f6d92c095a6%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Trying to optimize configuration for better cluster restart/recovery

Reply via email to