I've read as much as I can get my hands on with respect to cluster restarts and optimizing replica recovery [1] on 1.3.2 and I still think there's a big problem here.
Scenario: - 2 node cluster running on Azure (2 large nodes: 7GB RAM, 4 cores, 400 Mbps network) - 600GB disk on each side, current cluster is using about 25GB of that on each node - Cluster must be resilient to index requests at all times! This is a core requirement that we can't get around. What works pretty reliably in this scenario is that after each node is restarted, you wait for the full sync before upgrading the other node (GREEN status). Anything else will cause data loss. The trouble with this approach is we are dealing with network capacity of 400 Mbps and while 25 GB can be transferred in a reasonable time... 600GB definitely can't. On Azure, you have a max of *15 minutes* OnStart to make sure the instance is healthy again before the fabric controller recycles it for you. And I think that limit is reasonable. Sure we could go up 1 size to Extra Large nodes that offer a 800 Mbps network but still this isn't going to get us to the point where we can maintain 100% availability of a 600GB cluster on 2 nodes. There is a bit of light at the end of the tunnel: We plan on assigning a sequence number to operations that occur on primary > shards.This is a really interesting feature that will lay the groundwork > for many future features in Elasticsearch. The most obvious one is speeding > up the replica recovery process when a node is restarted. Currently we have > to copy every segment which is different which, over time, means every > segment! Sequence numbers will allow us to copy over only the data that has > really changed. It couldn't come soon enough for us. The current replica recovery process is really untenable in cloud environments that carry heavy INDEX workloads. Any thoughts on when this can be expected or if there are viable workarounds in the meantime would be greatly appreciated. [1] - Research: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-upgrade.html http://stackoverflow.com/questions/24696020/elasticsearch-node-recovery-cluster-restart-correct-settings https://groups.google.com/forum/#!searchin/elasticsearch/cluster$20restart$20disable$20allocation/elasticsearch/csziKQPBauU/9PKWbkhJ50IJ http://elasticsearch-users.115913.n3.nabble.com/Restarting-an-active-node-without-needing-to-recover-all-data-remotely-td4039346.html http://stackoverflow.com/questions/17268495/how-to-remove-node-from-elasticsearch-cluster-on-runtime-without-down-time/23905040#23905040 https://groups.google.com/forum/#!searchin/elasticsearch/cluster$20restart/elasticsearch/lN6copl0Bzk/RK5ESX8nu-8J https://groups.google.com/forum/#!searchin/elasticsearch/cluster$20restart/elasticsearch/plTGgtE_YCU/fCyd2elxkAsJ -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0a9631f6-6a59-413c-ba0e-2c161ebea10e%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
