Hello,

I have a question about cluster recovery after the cluster goes into an 
unhealthy state.
Let's assume the following.

We have a cluster with 9 nodes. 
3 master nodes (esmX) (master=true, data=false)
4 data nodes (esdX) (master=false, data=true)
2 client nodes (escX) (master=false, data=false)
minimum_master_nodes is set to 2.

The cluster is deployed across multiple racks.
rack 1
esm1, esm2, esd1, esd2 and esc1

rack2
esm3, esd3, esd4 and esc2

With this configuration I can lose rack 2 and the cluster still fulfills 
the requirements to form a proper cluster.
If I would loose rack 1 forever or a long time, I would manual spin up a 
second master node in rack 2 that to fulfill 2 minimum masters.

If now the network connection between the 2 racks fails, the cluster goes 
in an unhealthy state.
After a while rack 1 will be back online and everything is working again.
I noticed that this takes up to many minutes. Even after playing with the 
timeout settings for failure detection it takes relative long until it 
thinks that the other nodes are gone and before it's back to normal.

My question is, is that normal? Do I have to live with a few minutes 
downtime if parts of the cluster becomes unreachable?
Or are there any options I could still try to tune?


Thanks
Marco



-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/8f9233b5-bc4c-47f0-8a42-7d38db8dc7fb%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to