I have had some issues recently as I've expanded my ES cluster, where a 
single node failure causes basically all other index/search operations to 
timeout and fail.

I am currently running elasticsearch v1.2.1 and primarily interface with 
the indices using the elasticsearch python module.

My cluster is 20 nodes, each an m1.large ec2 instance. I currently have ~18 
indices each with 5 shards and 3 replicas. The average size of each index 
is ~20GB and ~10 million documents (low is ~100K documents (300mb), high 
~40 million (35gb)).
I run each node with ES_MAX_SIZE=4g and ES_MIN_SIZE=512m. There are no 
other services running on the elasticsearch nodes, except ssh. I use zen 
unicast discovery with a set list of nodes. I have tried to enable 
'bootstrap.mlockall', but the ulimit settings do not seem to be working and 
I keep getting 'Unable to lock JVM memory (ENOMEM)' errors when starting a 
node (note: I didn't see this log message when running 0.90.7).

I have a fairly constant series of new or updated documents (I don't 
actually update, but rather reindex when a new document with the same id is 
found) that are being ingested all the time, and a number of users who are 
querying the data on a regular basis - most queries are set queries through 
the python API.

The issue I have now is that while data is being ingested/indexed, I will 
hit Java heap out of memory errors. I think this is related to garbage 
collection as that seems to be the last activity in the logs nearly 
everytime this occurs. I have tried adjusting the heap max to 6g, and that 
seems to help but I am not sure it solves the issue. In conjunction with 
that, when the out of memory error occurs it seems to cause the other nodes 
to stop working effectively, timeout errors in both indexing and searching.

My question is: what is the best way to support a node failing for this 
reason? I would obviously like to solve the underlying problem as well, but 
I would also like to be able to support a node crashing for some reason 
(whether it be because of me or because ec2 took it away). Shouldn't the 
failover in replicas support the missing node? I understand the cluster 
state would be yellow at this time, but I should be able to index and 
search data on the remaining nodes, correct?

Are there configuration changes I can make to better support the cluster 
and identify or solve the underyling issue? 

Any help is appreciated. I understand I have a lot to learn about 
Elasticsearch, but I am hoping I can add some stability/resiliency to my 
cluster.

Thanks in advance,
-Kevin

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/74ac48ec-0c05-4683-9c78-66d8c97687fa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to