Handling node failure in ES cluster

kmoore.cce Mon, 21 Jul 2014 09:39:35 -0700

I have had some issues recently as I've expanded my ES cluster, where a 
single node failure causes basically all other index/search operations to 
timeout and fail.

I am currently running elasticsearch v1.2.1 and primarily interface with
the indices using the elasticsearch python module.

My cluster is 20 nodes, each an m1.large ec2 instance. I currently have ~18
indices each with 5 shards and 3 replicas. The average size of each index
is ~20GB and ~10 million documents (low is ~100K documents (300mb), high
~40 million (35gb)).
I run each node with ES_MAX_SIZE=4g and ES_MIN_SIZE=512m. There are no
other services running on the elasticsearch nodes, except ssh. I use zen
unicast discovery with a set list of nodes. I have tried to enable
'bootstrap.mlockall', but the ulimit settings do not seem to be working and
I keep getting 'Unable to lock JVM memory (ENOMEM)' errors when starting a
node (note: I didn't see this log message when running 0.90.7).

I have a fairly constant series of new or updated documents (I don't
actually update, but rather reindex when a new document with the same id is
found) that are being ingested all the time, and a number of users who are
querying the data on a regular basis - most queries are set queries through
the python API.

The issue I have now is that while data is being ingested/indexed, I will
hit Java heap out of memory errors. I think this is related to garbage
collection as that seems to be the last activity in the logs nearly
everytime this occurs. I have tried adjusting the heap max to 6g, and that
seems to help but I am not sure it solves the issue. In conjunction with
that, when the out of memory error occurs it seems to cause the other nodes
to stop working effectively, timeout errors in both indexing and searching.

My question is: what is the best way to support a node failing for this
reason? I would obviously like to solve the underlying problem as well, but
I would also like to be able to support a node crashing for some reason
(whether it be because of me or because ec2 took it away). Shouldn't the
failover in replicas support the missing node? I understand the cluster
state would be yellow at this time, but I should be able to index and
search data on the remaining nodes, correct?

Are there configuration changes I can make to better support the cluster
and identify or solve the underyling issue?

Any help is appreciated. I understand I have a lot to learn about
Elasticsearch, but I am hoping I can add some stability/resiliency to my
cluster.

Thanks in advance,
-Kevin

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/74ac48ec-0c05-4683-9c78-66d8c97687fa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Handling node failure in ES cluster

Reply via email to