You really need some kind of monitoring, like Marvel, around this to give you an idea of what was happening prior to the OOM. Generally a node becoming unresponsive will be due to GC, so take a look at the timings there.
On 5 March 2015 at 02:32, Chris Neal <[email protected]> wrote: > Hi all, > > I'm hoping someone can help me piece together the below log entries/stack > traces/Exceptions. I have a 3 node cluster in Development in EC2, and two > of them had issues. I'm running ES 1.4.4, 32GB RAM, 16GB heaps, dedicated > servers to ES. My idex rate averages about 10k/sec. There were no > searches going on at the time of the incident. > > It appears to me that node 10.0.0.12 began timing out requests to 10.0.45, > indicating that 10.0.0.45 was having issues. > Then at 4:36, 10.0.0.12 logs the ERROR about "Uncaught exception: > IndexWriter already closed", caused by an OOME. > Then at 4:43, 10.0.0.45 hits the "Create failed" WARN, and logs an OOME. > Then things are basically down and unresponsive. > > What is weird to me is that if 10.0.0.45 was the node having issues, why > did 10.0.0.12 log an exception 7 minutes before that? Did both nodes run > out of memory? Or is one of the Exceptions actually saying, "I see that > this other node hit an OOME, and I'm telling you about it." > > I have a few values tweaked in the elasticsearch.yml file to try and keep > this from happening (configured from Puppet): > > 'indices.breaker.fielddata.limit' => '20%', > 'indices.breaker.total.limit' => '25%', > 'indices.breaker.request.limit' => '10%', > 'index.merge.scheduler.type' => 'concurrent', > 'index.merge.scheduler.max_thread_count' => '1', > 'index.merge.policy.type' => 'tiered', > 'index.merge.policy.max_merged_segment' => '1gb', > 'index.merge.policy.segments_per_tier' => '4', > 'index.merge.policy.max_merge_at_once' => '4', > 'index.merge.policy.max_merge_at_once_explicit' => '4', > 'indices.memory.index_buffer_size' => '10%', > 'indices.store.throttle.type' => 'none', > 'index.translog.flush_threshold_size' => '1GB', > > I have done a fair bit of reading on this, and have tried about everything > I can think of. :( > > Can anyone tell me what caused this scenario, and what can be done to > avoid it? > Thank you so much for taking the time to read this. > Chris > > ===== > *On server 10.0.0.12 <http://10.0.0.12>:* > > [2015-03-04 03:56:12,548][WARN ][transport ] > [elasticsearch-ip-10-0-0-12] Received response for a request that has timed > out, sent [20456ms] ago, timed out [5392ms] ago, action > [cluster:monitor/nodes/st > ats[n]], node > [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}], > id [70061596] > [2015-03-04 04:06:02,407][INFO ][index.engine.internal ] > [elasticsearch-ip-10-0-0-12] [derbysoft-ihg-20150304][2] now throttling > indexing: numMergesInFlight=4, maxNumMerges=3 > [2015-03-04 04:06:04,141][INFO ][index.engine.internal ] > [elasticsearch-ip-10-0-0-12] [derbysoft-ihg-20150304][2] stop throttling > indexing: numMergesInFlight=2, maxNumMerges=3 > [2015-03-04 04:12:26,194][WARN ][transport ] > [elasticsearch-ip-10-0-0-12] Received response for a request that has timed > out, sent [15709ms] ago, timed out [708ms] ago, action > [cluster:monitor/nodes/sta > ts[n]], node > [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}], > id [70098828] > [2015-03-04 04:23:40,778][WARN ][transport ] > [elasticsearch-ip-10-0-0-12] Received response for a request that has timed > out, sent [21030ms] ago, timed out [6030ms] ago, action > [cluster:monitor/nodes/st > ats[n]], node > [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}], > id [70124234] > [2015-03-04 04:24:47,023][WARN ][transport ] > [elasticsearch-ip-10-0-0-12] Received response for a request that has timed > out, sent [27275ms] ago, timed out [12275ms] ago, action > [cluster:monitor/nodes/s > tats[n]], node > [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}], > id [70126273] > [2015-03-04 04:25:39,180][WARN ][transport ] > [elasticsearch-ip-10-0-0-12] Received response for a request that has timed > out, sent [19431ms] ago, timed out [4431ms] ago, action > [cluster:monitor/nodes/st > ats[n]], node > [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}], > id [70127835] > [2015-03-04 04:26:40,775][WARN ][transport ] > [elasticsearch-ip-10-0-0-12] Received response for a request that has timed > out, sent [19241ms] ago, timed out [4241ms] ago, action > [cluster:monitor/nodes/st > ats[n]], node > [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}], > id [70129981] > [2015-03-04 04:27:14,329][WARN ][transport ] > [elasticsearch-ip-10-0-0-12] Received response for a request that has timed > out, sent [22676ms] ago, timed out [6688ms] ago, action > [cluster:monitor/nodes/stats[n]], node > [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}], > id [70130668] > [2015-03-04 04:28:15,695][WARN ][transport ] > [elasticsearch-ip-10-0-0-12] Received response for a request that has timed > out, sent [24042ms] ago, timed out [9041ms] ago, action > [cluster:monitor/nodes/stats[n]], node > [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}], > id [70132644] > [2015-03-04 04:29:38,102][WARN ][transport ] > [elasticsearch-ip-10-0-0-12] Received response for a request that has timed > out, sent [16448ms] ago, timed out [1448ms] ago, action > [cluster:monitor/nodes/stats[n]], node > [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}], > id [70135333] > [2015-03-04 04:33:42,393][WARN ][transport ] > [elasticsearch-ip-10-0-0-12] Received response for a request that has timed > out, sent [20738ms] ago, timed out [5737ms] ago, action > [cluster:monitor/nodes/stats[n]], node > [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}], > id [70142427] > [2015-03-04 04:36:08,788][ERROR][marvel.agent ] > [elasticsearch-ip-10-0-0-12] Background thread had an uncaught exception: > org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed > at > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:698) > at > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:712) > at > org.apache.lucene.index.IndexWriter.ramBytesUsed(IndexWriter.java:462) > at > org.elasticsearch.index.engine.internal.InternalEngine.segmentsStats(InternalEngine.java:1224) > at > org.elasticsearch.index.shard.service.InternalIndexShard.segmentStats(InternalIndexShard.java:555) > at > org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:170) > at > org.elasticsearch.action.admin.indices.stats.ShardStats.<init>(ShardStats.java:49) > at > org.elasticsearch.indices.InternalIndicesService.stats(InternalIndicesService.java:212) > at > org.elasticsearch.indices.InternalIndicesService.stats(InternalIndicesService.java:172) > at > org.elasticsearch.node.service.NodeService.stats(NodeService.java:138) > at > org.elasticsearch.marvel.agent.AgentService$ExportingWorker.exportNodeStats(AgentService.java:300) > at > org.elasticsearch.marvel.agent.AgentService$ExportingWorker.run(AgentService.java:225) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.OutOfMemoryError: Java heap space > > ===== > *On server 10.0.0.45 <http://10.0.0.45>:* > > [2015-03-04 04:43:27,245][WARN ][index.engine.internal ] > [elasticsearch-ip-10-0-0-45] [myindex-20150304][1] failed engine > [indices:data/write/bulk[s] failed on replica] > org.elasticsearch.index.engine.CreateFailedEngineException: > [myindex-20150304][1] Create failed for [my_type#AUvjGHoiku-fZf277h_4] > at > org.elasticsearch.index.engine.internal.InternalEngine.create(InternalEngine.java:421) > at > org.elasticsearch.index.shard.service.InternalIndexShard.create(InternalIndexShard.java:403) > at > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:595) > at > org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$ReplicaOperationTransportHandler.messageReceived(TransportShardReplicationOperationAction.java:246) > at > org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$ReplicaOperationTransportHandler.messageReceived(TransportShardReplicationOperationAction.java:225) > at > org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.lucene.store.AlreadyClosedException: this > IndexWriter is closed > at > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:698) > at > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:712) > at > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1507) > at > org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1246) > at > org.elasticsearch.index.engine.internal.InternalEngine.innerCreateNoLock(InternalEngine.java:502) > at > org.elasticsearch.index.engine.internal.InternalEngine.innerCreate(InternalEngine.java:444) > at > org.elasticsearch.index.engine.internal.InternalEngine.create(InternalEngine.java:413) > ... 8 more > Caused by: java.lang.OutOfMemoryError: Java heap space > > ===== > > > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/CAND3DphzaT3Np5TBW%2B-h_aOo9BScPu_5QO9qCqnYLp__JCjOPA%40mail.gmail.com > <https://groups.google.com/d/msgid/elasticsearch/CAND3DphzaT3Np5TBW%2B-h_aOo9BScPu_5QO9qCqnYLp__JCjOPA%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_9uMAwF7nkZRnDvB9DAMmkSGrNG1HiWWvNgTRcg2TM8w%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
