> Why is Elasticsearch allowed to get into this state? Is it poor
configuration on our part or a bug in the software?

It is the JVM with low memory condition. No Java code can execute when
stack and heap is full and free memory is below a few bytes.  ES works hard
to overcome these situations.

> Why was prd-elastic-y reporting its state as 'green' even though the
cluster had seemingly failed?

There is a delay before cluster operations timeout, there is no good method
to detect faults earlier. Timeouts can be configured.

> Why did prd-elastic-y report as having no indices when it was the only
node in the cluster?

This is due to the minimum_master_nodes I assume. The node waits for other
masters (expecting ... nodes).

> How did the heap space exception end up causing networking problems
(failed to ping)? Or is that a separate problem entirely?

The network library (Netty) can only work with enough heap. It is not a
separate problem. The whole JVM stalls under low memory condition.

> Is there anything we can do to prevent this happening again, other than
throw more hardware at the problem?

You have assigned more than 50% RAM to the heap. Why?

Have you sized your configuration by driving enough tests? You should care
about that. In a heavy load test you will encounter OOMs and then you know
exactly if you should add more nodes or not.

Jörg


On Fri, Sep 5, 2014 at 11:44 AM, Ollie <[email protected]>
wrote:

> *Background*
> We have a three node cluster comprised of *prd-elastic-x*, *prd-elastic-y*
> and *prd-elastic-z*. Each box is an EC2 m2.xlarge, with 17.1 GB of RAM.
>
> Elasticsearch is run with the following java memory configuration:
> java -server -Djava.net.preferIPv4Stack=true
> -Des.config=/usr/local/etc/elasticsearch/elasticsearch.yml -Xms10247m
> -Xmx10247m -Xss256k
>
> We run some very memory intensive term aggregations on fields with very
> high cardinality (millions of documents, tens of thousands of values). We
> have had issues with running out of memory before, including issues with
> the OS oomkiller, but yesterday we had a particularly bad outage.
>
> There are no signs that the oomkiller took any action in the syslog.
>
> *Timeline*
>
> *12:35*
> prd-elastic-x:
>
> [2014-09-04 11:35:53,002][WARN ][monitor.jvm              ]
> [prd-elastic-x] [gc][old][1812644][175] duration [26.3s], collections
> [2]/[27.5s], total [26.3s]/[1.3m], memory [7.6gb]->[4.7gb]/[9.9gb],
> all_pools {[young] [129.9mb]->[37.5mb]/[133.1mb]}{[survivor]
> [16.6mb]->[0b]/[16.6mb]}{[old] [7.5gb]->[4.7gb]/[9.8gb]}
>
> *12:36*
> prd-elastic-z:
>
> [2014-09-04 11:36:02,809][WARN ][monitor.jvm              ]
> [prd-elastic-z] [gc][old][3019662][378] duration [34.9s], collections
> [2]/[36.1s], total [34.9s]/[2.8m], memory [8.8gb]->[6.3gb]/[9.9gb],
> all_pools {[young] [116.9mb]->[12.6mb]/[133.1mb]}{[survivor]
> [16.6mb]->[0b]/[16.6mb]}{[old] [8.6gb]->[6.3gb]/[9.8gb]}
>
> *12:38*
> Got first application error - request has timed out.
>
> We start investigating. prd-elastic-y is reporting the cluster state as
> green, with all three nodes in the cluster still in. However, attempts to
> load debug information from endpoints such as /_cat/recovery hang, and we
> continue to receive errors at the application level.
>
> We stop all non-critical application processes to try and reduce the load
> on Elasticsearch, in the hope that it will recover.
>
> *12:41*
> A lot of errors start appearing in the logs for prd-elastic-z, including
> but in no way limited to:
>
> [2014-09-04 11:40:14,284][WARN
> ][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
> selector loop.
> java.lang.OutOfMemoryError: Java heap space
>
> *12:43*
> We start seeing errors on prd-elastic-x as well, including:
>
> [2014-09-04 11:43:51,440][WARN ][netty.channel.DefaultChannelPipeline] An
> exception was thrown by a user handler while handling an exception event
> ([id: 0xe2f53491, /10.78.198.78:55953 => /10.79.75.106:9300] EXCEPTION:
> java.lang.OutOfMemoryError: Java heap space)
> java.lang.OutOfMemoryError: Java heap space
>
> *12:46*
> prd-elastic-x reports that it is unable to ping prd-elastic-z
>
> [2014-09-04 11:46:23,016][INFO ][discovery.ec2            ]
> [prd-elastic-x] master_left
> [[prd-elastic-z][nbnLdQjgS4KVe7rGx8eIWw][ip-10-76-41-241.ec2.internal][inet[/10.76.41.241:9300]]{aws_availability_zone=us-east-1d,
> max_local_storage_nodes=1}], reason [failed to ping, tried [3] times, each
> with  maximum [30s] timeout]
>
> *12:48*
> We decide to restart prd-elastic-z (gracefully, using service
> elasticsearch restart), based upon the aforementioned ping log message
>
> prd-elastic-z:
>
> [2014-09-04 11:48:54,186][INFO ][node                     ]
> [prd-elastic-z] stopping ...
>
> *13:03*
> prd-elastic-z has still not stopped (still appearing in ps -aux with a
> long uptime), so we take the decision to forcefully kill it (kill -9)
>
> *13:06*
> The cluster is still not behaving itself. We take the decision to restart
> the other two nodes, starting with prd-elastic-y.
>
> *13:08*
> prd-elastic-y has restarted without any further intervention required, but
> is reporting as the only node in the cluster, in the 'red' state, and with
> no indices or shards.
>
> *13:09*
> We try to restart prd-elastic-x gracefully.
>
> prd-elastic-x:
>
> [2014-09-04 12:09:47,480][INFO ][node                     ]
> [prd-elastic-x] stopping ...
>
> *13:22*
> prd-elastic-x has still not stopped, so again we are forced to `kill -9`
> it. We then restart it manually using `sudo service elasticsearch start`
>
> *prd-elastic-y:*
>
> [2014-09-04 12:23:14,108][INFO ][cluster.service          ]
> [prd-elastic-y] added
> {[prd-elastic-x][OF5WaLTzRVG92z1Y7zMW2g][ip-10-79-75-106.ec2.internal][inet[/10.79.75.106:9300]]{aws_availability_zone=us-east-1d,
> max_local_storage_nodes=1},}, reason: zen-disco-receive(join from
> node[[prd-elastic-x][OF5WaLTzRVG92z1Y7zMW2g][ip-10-79-75-106.ec2.internal][inet[/10.79.75.106:9300]]{aws_availability_zone=us-east-1d,
> max_local_storage_nodes=1}])
> [2014-09-04 12:23:14,119][INFO ][gateway                  ]
> [prd-elastic-y] delaying initial state recovery for [5m]. expecting [3]
> nodes, but only have [2]
>
> *13:24*
> We decide to restart prd-elastic-z. We were originally going to leave it
> out of the cluster and bring in a completely new instance, as we had to
> kill it and thus were not sure about its data integrity, but given it fared
> no worse than prd-elastic-x we figured it didn't make a difference at this
> point.
>
> *13:27*
> All three nodes are in the cluster, and recovery starts. The cluster
> reports yellow almost immediately.
>
> *13:37*
> The cluster is green and 'healthy' again (or so we believe)
>
> ---
>
> Having gone through the logs we believe that the root cause is the
> "java.lang.OutOfMemoryError: Java heap space" exceptions we see shortly
> after the garbage collection on prd-elastic-x and prd-elastic-y.
> prd-elastic-y seemed 'stable' throughout, but was reporting the cluster
> state incorrectly and then appeared to lose all of its data (or cluster
> state, not sure?)
>
> The logs for all nodes can be found here:
> https://www.dropbox.com/sh/9pr2nv2nu0auk3m/AABzi9TPuPJ1g_npxosGvNuKa?dl=0
>
> *Questions*
>
> - Why is Elasticsearch allowed to get into this state? Is it poor
> configuration on our part or a bug in the software?
> - Why was prd-elastic-y reporting its state as 'green' even though the
> cluster had seemingly failed?
> - Why did prd-elastic-y report as having no indices when it was the only
> node in the cluster?
> - How did the heap space exception end up causing networking problems
> (failed to ping)? Or is that a separate problem entirely?
> - Is there anything we can do to prevent this happening again, other than
> throw more hardware at the problem?
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/ce8492a6-fe9c-4148-921e-2e599e97832d%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/ce8492a6-fe9c-4148-921e-2e599e97832d%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFJ1kNZd8vc6xXLE3cTX1HCA0JA4FPz%2B10AzJ3_1NObhg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to