This is some of the test cluster's configuration(not necessarily relevant)
*Physical configuration:*
Nodes deployed virtually on a single machine
All nodes are Master and Data eligible, Master sometimes moves around and 
data is distributed to all.
Node1 - Apps = 4GB RAM - 20GB storage - 
Node2 - ES1 - 1GB RAM - 20GB storage
Node3 - ES2 - 1GB RAM - 20GB storage
Node4 - ES3 - 1GB RAM - 20GB storage
Node5 - ES4 - 1GB RAM - 20GB storage
 
*Description - Nodes are dropped from a cluster too easily?*

For most of the past 48hrs I've been studying something in my lab 
cluster that seems to have popped up regularly and posted to this forum, a 
number of threads unanswered.

On my test cluster, I found that once data size reached a particular 
threshold, the ability of nodes to remain in the cluster became unstable. 
After loading the data 3 times into the cluster, I've been able to 
replicate the threshold consistently, so it's a real and replicable 
phenomenon. The actual threshold I reached is likely specific to my setup 
but I can see can be experienced regularly in other hardware setups as well.

And, pushed a bit further I experienced another often reported issue, a 
shard became "orphaned," ie Both primary and replica wouldn't allocate 
causing the entire cluster to remain in "red" status with no explanation 
(Only log entries stated "'Failed to execute..."). In the end, this 
specific issue was resolved only by deleting the entire index and 
re-loading the data. I can see that if there isn't an archived source so 
the data could be re-input into the cluster, the data might have been lost 
altogether although the new Snapshot/Restore might also be a solution.

*Attempted configs with no effect*
*1. network.node.http_keep-alive* NOTE that no commented out setting exists 
in a 1.0 elasticsearch.yml although is described in the 1.0 documentation, 
so I created manually. If it really is supposed to work, in my case the 
problem might not have been a networking issue.

*2*. *Periodic ICMP Ping*. I found this often seemed to help when joining a 
node to the cluster, but it seems to have no effect on a node leaving a 
cluster.

*Theory:*
I have already observed max disk utilization on the host, am speculating 
that moments of max disk activity are causing Guests to sometimes become 
unresponsive. There is probably a timeout for responses that is being 
exceeded.
*Looking for a Solution:*
I believe that once a node has become part of a cluster, it should not be 
so easy for the node to leave.
Individual actions might fail because a response has not been received, but 
IMO the node should not be automatically excluded from the cluster so 
easily or even automatically, so that intervention might be possible 
without the cost of re-joining the cluster (re-building node participation 
metadata, shard thrashing)
 
*Believe preferable process*
When an API call is made, it seems that there is a certain amount of time 
that is expected before the call is considered failed. Then the same call 
should be attempted against any replicas on record. It seems to me that not 
only might a call be re-routed to a replica but the node that failed to 
reply might also be dropped immediately. IMO the node should not be dropped 
immediately bt marked for dropping and the actual drop should be subject to 
specified configuration (or ultimately requiring manual action). 
 
*Reason why it should not be so easy to drop a node (and its data)*
The node may have the only full copy of valid shard(s) data, so 
dropping the node should not be so easy even if the node is completely 
unresponsive. The node should remain a fully recognized member of the 
cluster until a decision is made to definitely drop. 
 
Note also that although not the fundamental cause, this "easy drop node" is 
a major contributing cause of the theoretical scenario posed by Brad 
Lhotsky in the thread "Data Loss"
https://groups.google.com/forum/#!topic/elasticsearch/kl60-C63cXY
 
Counter Opinion and Solutions?
 
IMO and Thx,
Tony
 

 

 

 

 

 

 


 

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/0bd557b6-86b3-476b-89a1-3f753d54d71e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to