> Why is Elasticsearch allowed to get into this state? Is it poor configuration on our part or a bug in the software?
It is the JVM with low memory condition. No Java code can execute when stack and heap is full and free memory is below a few bytes. ES works hard to overcome these situations. > Why was prd-elastic-y reporting its state as 'green' even though the cluster had seemingly failed? There is a delay before cluster operations timeout, there is no good method to detect faults earlier. Timeouts can be configured. > Why did prd-elastic-y report as having no indices when it was the only node in the cluster? This is due to the minimum_master_nodes I assume. The node waits for other masters (expecting ... nodes). > How did the heap space exception end up causing networking problems (failed to ping)? Or is that a separate problem entirely? The network library (Netty) can only work with enough heap. It is not a separate problem. The whole JVM stalls under low memory condition. > Is there anything we can do to prevent this happening again, other than throw more hardware at the problem? You have assigned more than 50% RAM to the heap. Why? Have you sized your configuration by driving enough tests? You should care about that. In a heavy load test you will encounter OOMs and then you know exactly if you should add more nodes or not. Jörg On Fri, Sep 5, 2014 at 11:44 AM, Ollie <[email protected]> wrote: > *Background* > We have a three node cluster comprised of *prd-elastic-x*, *prd-elastic-y* > and *prd-elastic-z*. Each box is an EC2 m2.xlarge, with 17.1 GB of RAM. > > Elasticsearch is run with the following java memory configuration: > java -server -Djava.net.preferIPv4Stack=true > -Des.config=/usr/local/etc/elasticsearch/elasticsearch.yml -Xms10247m > -Xmx10247m -Xss256k > > We run some very memory intensive term aggregations on fields with very > high cardinality (millions of documents, tens of thousands of values). We > have had issues with running out of memory before, including issues with > the OS oomkiller, but yesterday we had a particularly bad outage. > > There are no signs that the oomkiller took any action in the syslog. > > *Timeline* > > *12:35* > prd-elastic-x: > > [2014-09-04 11:35:53,002][WARN ][monitor.jvm ] > [prd-elastic-x] [gc][old][1812644][175] duration [26.3s], collections > [2]/[27.5s], total [26.3s]/[1.3m], memory [7.6gb]->[4.7gb]/[9.9gb], > all_pools {[young] [129.9mb]->[37.5mb]/[133.1mb]}{[survivor] > [16.6mb]->[0b]/[16.6mb]}{[old] [7.5gb]->[4.7gb]/[9.8gb]} > > *12:36* > prd-elastic-z: > > [2014-09-04 11:36:02,809][WARN ][monitor.jvm ] > [prd-elastic-z] [gc][old][3019662][378] duration [34.9s], collections > [2]/[36.1s], total [34.9s]/[2.8m], memory [8.8gb]->[6.3gb]/[9.9gb], > all_pools {[young] [116.9mb]->[12.6mb]/[133.1mb]}{[survivor] > [16.6mb]->[0b]/[16.6mb]}{[old] [8.6gb]->[6.3gb]/[9.8gb]} > > *12:38* > Got first application error - request has timed out. > > We start investigating. prd-elastic-y is reporting the cluster state as > green, with all three nodes in the cluster still in. However, attempts to > load debug information from endpoints such as /_cat/recovery hang, and we > continue to receive errors at the application level. > > We stop all non-critical application processes to try and reduce the load > on Elasticsearch, in the hope that it will recover. > > *12:41* > A lot of errors start appearing in the logs for prd-elastic-z, including > but in no way limited to: > > [2014-09-04 11:40:14,284][WARN > ][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the > selector loop. > java.lang.OutOfMemoryError: Java heap space > > *12:43* > We start seeing errors on prd-elastic-x as well, including: > > [2014-09-04 11:43:51,440][WARN ][netty.channel.DefaultChannelPipeline] An > exception was thrown by a user handler while handling an exception event > ([id: 0xe2f53491, /10.78.198.78:55953 => /10.79.75.106:9300] EXCEPTION: > java.lang.OutOfMemoryError: Java heap space) > java.lang.OutOfMemoryError: Java heap space > > *12:46* > prd-elastic-x reports that it is unable to ping prd-elastic-z > > [2014-09-04 11:46:23,016][INFO ][discovery.ec2 ] > [prd-elastic-x] master_left > [[prd-elastic-z][nbnLdQjgS4KVe7rGx8eIWw][ip-10-76-41-241.ec2.internal][inet[/10.76.41.241:9300]]{aws_availability_zone=us-east-1d, > max_local_storage_nodes=1}], reason [failed to ping, tried [3] times, each > with maximum [30s] timeout] > > *12:48* > We decide to restart prd-elastic-z (gracefully, using service > elasticsearch restart), based upon the aforementioned ping log message > > prd-elastic-z: > > [2014-09-04 11:48:54,186][INFO ][node ] > [prd-elastic-z] stopping ... > > *13:03* > prd-elastic-z has still not stopped (still appearing in ps -aux with a > long uptime), so we take the decision to forcefully kill it (kill -9) > > *13:06* > The cluster is still not behaving itself. We take the decision to restart > the other two nodes, starting with prd-elastic-y. > > *13:08* > prd-elastic-y has restarted without any further intervention required, but > is reporting as the only node in the cluster, in the 'red' state, and with > no indices or shards. > > *13:09* > We try to restart prd-elastic-x gracefully. > > prd-elastic-x: > > [2014-09-04 12:09:47,480][INFO ][node ] > [prd-elastic-x] stopping ... > > *13:22* > prd-elastic-x has still not stopped, so again we are forced to `kill -9` > it. We then restart it manually using `sudo service elasticsearch start` > > *prd-elastic-y:* > > [2014-09-04 12:23:14,108][INFO ][cluster.service ] > [prd-elastic-y] added > {[prd-elastic-x][OF5WaLTzRVG92z1Y7zMW2g][ip-10-79-75-106.ec2.internal][inet[/10.79.75.106:9300]]{aws_availability_zone=us-east-1d, > max_local_storage_nodes=1},}, reason: zen-disco-receive(join from > node[[prd-elastic-x][OF5WaLTzRVG92z1Y7zMW2g][ip-10-79-75-106.ec2.internal][inet[/10.79.75.106:9300]]{aws_availability_zone=us-east-1d, > max_local_storage_nodes=1}]) > [2014-09-04 12:23:14,119][INFO ][gateway ] > [prd-elastic-y] delaying initial state recovery for [5m]. expecting [3] > nodes, but only have [2] > > *13:24* > We decide to restart prd-elastic-z. We were originally going to leave it > out of the cluster and bring in a completely new instance, as we had to > kill it and thus were not sure about its data integrity, but given it fared > no worse than prd-elastic-x we figured it didn't make a difference at this > point. > > *13:27* > All three nodes are in the cluster, and recovery starts. The cluster > reports yellow almost immediately. > > *13:37* > The cluster is green and 'healthy' again (or so we believe) > > --- > > Having gone through the logs we believe that the root cause is the > "java.lang.OutOfMemoryError: Java heap space" exceptions we see shortly > after the garbage collection on prd-elastic-x and prd-elastic-y. > prd-elastic-y seemed 'stable' throughout, but was reporting the cluster > state incorrectly and then appeared to lose all of its data (or cluster > state, not sure?) > > The logs for all nodes can be found here: > https://www.dropbox.com/sh/9pr2nv2nu0auk3m/AABzi9TPuPJ1g_npxosGvNuKa?dl=0 > > *Questions* > > - Why is Elasticsearch allowed to get into this state? Is it poor > configuration on our part or a bug in the software? > - Why was prd-elastic-y reporting its state as 'green' even though the > cluster had seemingly failed? > - Why did prd-elastic-y report as having no indices when it was the only > node in the cluster? > - How did the heap space exception end up causing networking problems > (failed to ping)? Or is that a separate problem entirely? > - Is there anything we can do to prevent this happening again, other than > throw more hardware at the problem? > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/ce8492a6-fe9c-4148-921e-2e599e97832d%40googlegroups.com > <https://groups.google.com/d/msgid/elasticsearch/ce8492a6-fe9c-4148-921e-2e599e97832d%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFJ1kNZd8vc6xXLE3cTX1HCA0JA4FPz%2B10AzJ3_1NObhg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
