Re: Please help to understand these Exceptions

Chris Neal Thu, 12 Mar 2015 08:10:40 -0700

Thank you Mark.

May I ask what about my answers caused you to say "definitely"? :)  I want
to better understand capacity related items for ES for sure.


Many thanks!
Chris

On Wed, Mar 11, 2015 at 2:13 PM, Mark Walkom <[email protected]> wrote:

> Then you're definitely going to be seeing node pressure. I'd add another
> one or two and see how things look after that.
>
> On 11 March 2015 at 07:21, Chris Neal <[email protected]> wrote:
>
>> Again Mark, thank you for your time :)
>>
>> 157 Indicies
>> 928 Shards
>> Daily indexing that adds 7 indexes per day
>> Each index has 3 shards and 1 replica
>> 2.27TB of data in the cluster
>> Index rate averages about 1500/sec
>> IOps on the servers is ~40
>>
>> Chris
>>
>> On Tue, Mar 10, 2015 at 7:57 PM, Mark Walkom <[email protected]>
>> wrote:
>>
>>> It looks like heap pressure.
>>> How many indices, how many shards, how much data do you have in the
>>> cluster?
>>>
>>> On 8 March 2015 at 19:24, Chris Neal <[email protected]> wrote:
>>>
>>>> Thank you Mark for your reply.
>>>>
>>>> I do have Marvel running, on a separate cluster even, so I do have that
>>>> data from the time of the problem.  I've attached 4 screenshots for
>>>> reference.
>>>>
>>>> It appears that node 10.0.0.12 (the green line on the charts) had
>>>> issues.  The heap usage drops from 80% to 0%.  I'm guessing that is some
>>>> sort of crash, because the heap should not empty itself.  Also its load
>>>> goes to 0.
>>>>
>>>> I also see a lot of Old GC duration on 10.0.0.45 (blue line).  Lots of
>>>> excessive Old GC Counts, so it does appear that the problem was memory
>>>> pressure on this node.  That's what I was thinking, but was hoping for
>>>> validation on that.
>>>>
>>>> If it was, I'm hoping to get some suggestions on what to do about it.
>>>> As I mentioned in the original post, I've tweaked I think needs tweaking
>>>> based on the system, and it still happens.
>>>>
>>>> Maybe it's just that I'm pushing the cluster too much for the resources
>>>> I'm giving it, and it "just won't work".
>>>>
>>>> The index rate was only about 2500/sec, and the search request rate had
>>>> one small spike that went to 3.0.  But 3 searches in one timeslice is
>>>> nothing.
>>>>
>>>> Thanks again for the help and reading all this stuff.  It is
>>>> appreciated.  Hopefully I can get a solution to keep the cluster stable.
>>>>
>>>> Chris
>>>>
>>>> On Fri, Mar 6, 2015 at 3:01 PM, Mark Walkom <[email protected]>
>>>> wrote:
>>>>
>>>>> You really need some kind of monitoring, like Marvel, around this to
>>>>> give you an idea of what was happening prior to the OOM.
>>>>> Generally a node becoming unresponsive will be due to GC, so take a
>>>>> look at the timings there.
>>>>>
>>>>> On 5 March 2015 at 02:32, Chris Neal <[email protected]> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I'm hoping someone can help me piece together the below log
>>>>>> entries/stack traces/Exceptions.  I have a 3 node cluster in Development 
>>>>>> in
>>>>>> EC2, and two of them had issues.  I'm running ES 1.4.4, 32GB RAM, 16GB
>>>>>> heaps, dedicated servers to ES.  My idex rate averages about 10k/sec.
>>>>>> There were no searches going on at the time of the incident.
>>>>>>
>>>>>> It appears to me that node 10.0.0.12 began timing out requests to
>>>>>> 10.0.45, indicating that 10.0.0.45 was having issues.
>>>>>> Then at 4:36, 10.0.0.12 logs the ERROR about "Uncaught exception:
>>>>>>  IndexWriter already closed", caused by an OOME.
>>>>>> Then at 4:43, 10.0.0.45 hits the "Create failed" WARN, and logs an
>>>>>> OOME.
>>>>>> Then things are basically down and unresponsive.
>>>>>>
>>>>>> What is weird to me is that if 10.0.0.45 was the node having issues,
>>>>>> why did 10.0.0.12 log an exception 7 minutes before that?  Did both nodes
>>>>>> run out of memory?  Or is one of the Exceptions actually saying, "I see
>>>>>> that this other node hit an OOME, and I'm telling you about it."
>>>>>>
>>>>>> I have a few values tweaked in the elasticsearch.yml file to try and
>>>>>> keep this from happening (configured from Puppet):
>>>>>>
>>>>>>     'indices.breaker.fielddata.limit' => '20%',
>>>>>>     'indices.breaker.total.limit' => '25%',
>>>>>>     'indices.breaker.request.limit' => '10%',
>>>>>>     'index.merge.scheduler.type' => 'concurrent',
>>>>>>     'index.merge.scheduler.max_thread_count' => '1',
>>>>>>     'index.merge.policy.type' => 'tiered',
>>>>>>     'index.merge.policy.max_merged_segment' => '1gb',
>>>>>>     'index.merge.policy.segments_per_tier' => '4',
>>>>>>     'index.merge.policy.max_merge_at_once' => '4',
>>>>>>     'index.merge.policy.max_merge_at_once_explicit' => '4',
>>>>>>     'indices.memory.index_buffer_size' => '10%',
>>>>>>     'indices.store.throttle.type' => 'none',
>>>>>>     'index.translog.flush_threshold_size' => '1GB',
>>>>>>
>>>>>> I have done a fair bit of reading on this, and have tried about
>>>>>> everything I can think of. :(
>>>>>>
>>>>>> Can anyone tell me what caused this scenario, and what can be done to
>>>>>> avoid it?
>>>>>> Thank you so much for taking the time to read this.
>>>>>> Chris
>>>>>>
>>>>>> =====
>>>>>> *On server 10.0.0.12 <http://10.0.0.12>:*
>>>>>>
>>>>>> [2015-03-04 03:56:12,548][WARN ][transport                ]
>>>>>> [elasticsearch-ip-10-0-0-12] Received response for a request that has 
>>>>>> timed
>>>>>> out, sent [20456ms] ago, timed out [5392ms] ago, action
>>>>>> [cluster:monitor/nodes/st
>>>>>> ats[n]], node
>>>>>> [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}],
>>>>>> id [70061596]
>>>>>> [2015-03-04 04:06:02,407][INFO ][index.engine.internal    ]
>>>>>> [elasticsearch-ip-10-0-0-12] [derbysoft-ihg-20150304][2] now throttling
>>>>>> indexing: numMergesInFlight=4, maxNumMerges=3
>>>>>> [2015-03-04 04:06:04,141][INFO ][index.engine.internal    ]
>>>>>> [elasticsearch-ip-10-0-0-12] [derbysoft-ihg-20150304][2] stop throttling
>>>>>> indexing: numMergesInFlight=2, maxNumMerges=3
>>>>>> [2015-03-04 04:12:26,194][WARN ][transport                ]
>>>>>> [elasticsearch-ip-10-0-0-12] Received response for a request that has 
>>>>>> timed
>>>>>> out, sent [15709ms] ago, timed out [708ms] ago, action
>>>>>> [cluster:monitor/nodes/sta
>>>>>> ts[n]], node
>>>>>> [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}],
>>>>>> id [70098828]
>>>>>> [2015-03-04 04:23:40,778][WARN ][transport                ]
>>>>>> [elasticsearch-ip-10-0-0-12] Received response for a request that has 
>>>>>> timed
>>>>>> out, sent [21030ms] ago, timed out [6030ms] ago, action
>>>>>> [cluster:monitor/nodes/st
>>>>>> ats[n]], node
>>>>>> [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}],
>>>>>> id [70124234]
>>>>>> [2015-03-04 04:24:47,023][WARN ][transport                ]
>>>>>> [elasticsearch-ip-10-0-0-12] Received response for a request that has 
>>>>>> timed
>>>>>> out, sent [27275ms] ago, timed out [12275ms] ago, action
>>>>>> [cluster:monitor/nodes/s
>>>>>> tats[n]], node
>>>>>> [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}],
>>>>>> id [70126273]
>>>>>> [2015-03-04 04:25:39,180][WARN ][transport                ]
>>>>>> [elasticsearch-ip-10-0-0-12] Received response for a request that has 
>>>>>> timed
>>>>>> out, sent [19431ms] ago, timed out [4431ms] ago, action
>>>>>> [cluster:monitor/nodes/st
>>>>>> ats[n]], node
>>>>>> [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}],
>>>>>> id [70127835]
>>>>>> [2015-03-04 04:26:40,775][WARN ][transport                ]
>>>>>> [elasticsearch-ip-10-0-0-12] Received response for a request that has 
>>>>>> timed
>>>>>> out, sent [19241ms] ago, timed out [4241ms] ago, action
>>>>>> [cluster:monitor/nodes/st
>>>>>> ats[n]], node
>>>>>> [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}],
>>>>>> id [70129981]
>>>>>> [2015-03-04 04:27:14,329][WARN ][transport                ]
>>>>>> [elasticsearch-ip-10-0-0-12] Received response for a request that has 
>>>>>> timed
>>>>>> out, sent [22676ms] ago, timed out [6688ms] ago, action
>>>>>> [cluster:monitor/nodes/stats[n]], node
>>>>>> [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}],
>>>>>> id [70130668]
>>>>>> [2015-03-04 04:28:15,695][WARN ][transport                ]
>>>>>> [elasticsearch-ip-10-0-0-12] Received response for a request that has 
>>>>>> timed
>>>>>> out, sent [24042ms] ago, timed out [9041ms] ago, action
>>>>>> [cluster:monitor/nodes/stats[n]], node
>>>>>> [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}],
>>>>>> id [70132644]
>>>>>> [2015-03-04 04:29:38,102][WARN ][transport                ]
>>>>>> [elasticsearch-ip-10-0-0-12] Received response for a request that has 
>>>>>> timed
>>>>>> out, sent [16448ms] ago, timed out [1448ms] ago, action
>>>>>> [cluster:monitor/nodes/stats[n]], node
>>>>>> [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}],
>>>>>> id [70135333]
>>>>>> [2015-03-04 04:33:42,393][WARN ][transport                ]
>>>>>> [elasticsearch-ip-10-0-0-12] Received response for a request that has 
>>>>>> timed
>>>>>> out, sent [20738ms] ago, timed out [5737ms] ago, action
>>>>>> [cluster:monitor/nodes/stats[n]], node
>>>>>> [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}],
>>>>>> id [70142427]
>>>>>> [2015-03-04 04:36:08,788][ERROR][marvel.agent             ]
>>>>>> [elasticsearch-ip-10-0-0-12] Background thread had an uncaught exception:
>>>>>> org.apache.lucene.store.AlreadyClosedException: this IndexWriter is
>>>>>> closed
>>>>>>         at
>>>>>> org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:698)
>>>>>>         at
>>>>>> org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:712)
>>>>>>         at
>>>>>> org.apache.lucene.index.IndexWriter.ramBytesUsed(IndexWriter.java:462)
>>>>>>         at
>>>>>> org.elasticsearch.index.engine.internal.InternalEngine.segmentsStats(InternalEngine.java:1224)
>>>>>>         at
>>>>>> org.elasticsearch.index.shard.service.InternalIndexShard.segmentStats(InternalIndexShard.java:555)
>>>>>>         at
>>>>>> org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:170)
>>>>>>         at
>>>>>> org.elasticsearch.action.admin.indices.stats.ShardStats.<init>(ShardStats.java:49)
>>>>>>         at
>>>>>> org.elasticsearch.indices.InternalIndicesService.stats(InternalIndicesService.java:212)
>>>>>>         at
>>>>>> org.elasticsearch.indices.InternalIndicesService.stats(InternalIndicesService.java:172)
>>>>>>         at
>>>>>> org.elasticsearch.node.service.NodeService.stats(NodeService.java:138)
>>>>>>         at
>>>>>> org.elasticsearch.marvel.agent.AgentService$ExportingWorker.exportNodeStats(AgentService.java:300)
>>>>>>         at
>>>>>> org.elasticsearch.marvel.agent.AgentService$ExportingWorker.run(AgentService.java:225)
>>>>>>         at java.lang.Thread.run(Thread.java:745)
>>>>>> Caused by: java.lang.OutOfMemoryError: Java heap space
>>>>>>
>>>>>> =====
>>>>>> *On server 10.0.0.45 <http://10.0.0.45>:*
>>>>>>
>>>>>> [2015-03-04 04:43:27,245][WARN ][index.engine.internal    ]
>>>>>> [elasticsearch-ip-10-0-0-45] [myindex-20150304][1] failed engine
>>>>>> [indices:data/write/bulk[s] failed on replica]
>>>>>> org.elasticsearch.index.engine.CreateFailedEngineException:
>>>>>> [myindex-20150304][1] Create failed for [my_type#AUvjGHoiku-fZf277h_4]
>>>>>>         at
>>>>>> org.elasticsearch.index.engine.internal.InternalEngine.create(InternalEngine.java:421)
>>>>>>         at
>>>>>> org.elasticsearch.index.shard.service.InternalIndexShard.create(InternalIndexShard.java:403)
>>>>>>         at
>>>>>> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:595)
>>>>>>         at
>>>>>> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$ReplicaOperationTransportHandler.messageReceived(TransportShardReplicationOperationAction.java:246)
>>>>>>         at
>>>>>> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$ReplicaOperationTransportHandler.messageReceived(TransportShardReplicationOperationAction.java:225)
>>>>>>         at
>>>>>> org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
>>>>>>         at
>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>>         at
>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>         at java.lang.Thread.run(Thread.java:745)
>>>>>> Caused by: org.apache.lucene.store.AlreadyClosedException: this
>>>>>> IndexWriter is closed
>>>>>>         at
>>>>>> org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:698)
>>>>>>         at
>>>>>> org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:712)
>>>>>>         at
>>>>>> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1507)
>>>>>>         at
>>>>>> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1246)
>>>>>>         at
>>>>>> org.elasticsearch.index.engine.internal.InternalEngine.innerCreateNoLock(InternalEngine.java:502)
>>>>>>         at
>>>>>> org.elasticsearch.index.engine.internal.InternalEngine.innerCreate(InternalEngine.java:444)
>>>>>>         at
>>>>>> org.elasticsearch.index.engine.internal.InternalEngine.create(InternalEngine.java:413)
>>>>>>         ... 8 more
>>>>>> Caused by: java.lang.OutOfMemoryError: Java heap space
>>>>>>
>>>>>> =====
>>>>>>
>>>>>>
>>>>>>
>>>>>>  --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "elasticsearch" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/elasticsearch/CAND3DphzaT3Np5TBW%2B-h_aOo9BScPu_5QO9qCqnYLp__JCjOPA%40mail.gmail.com
>>>>>> <https://groups.google.com/d/msgid/elasticsearch/CAND3DphzaT3Np5TBW%2B-h_aOo9BScPu_5QO9qCqnYLp__JCjOPA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>  --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "elasticsearch" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_9uMAwF7nkZRnDvB9DAMmkSGrNG1HiWWvNgTRcg2TM8w%40mail.gmail.com
>>>>> <https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_9uMAwF7nkZRnDvB9DAMmkSGrNG1HiWWvNgTRcg2TM8w%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/elasticsearch/CAND3Dpjr5ZUHKeWZROCJ6uCCjmEU3_geDuSdK96_-uqL6qGX2A%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/elasticsearch/CAND3Dpjr5ZUHKeWZROCJ6uCCjmEU3_geDuSdK96_-uqL6qGX2A%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/elasticsearch/CAEYi1X88-hv1vp3xwJsz2kPex3tAND-rx%3DT-CEO1GXO0CkwSww%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/elasticsearch/CAEYi1X88-hv1vp3xwJsz2kPex3tAND-rx%3DT-CEO1GXO0CkwSww%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/CAND3Dph4U3pnL%3D1RYCT-ojJK3chd1goP%3DeRGbtd_pgmtP2oa5w%40mail.gmail.com
>> <https://groups.google.com/d/msgid/elasticsearch/CAND3Dph4U3pnL%3D1RYCT-ojJK3chd1goP%3DeRGbtd_pgmtP2oa5w%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAEYi1X9SMb2DeTvkWA2OogQH%2BijSKHP%2B40ZYt-OXnCm10QgYJQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/elasticsearch/CAEYi1X9SMb2DeTvkWA2OogQH%2BijSKHP%2B40ZYt-OXnCm10QgYJQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAND3Dphh0Awaswu2_0JEreWOZ5Xcc%3DS4E5LpMxrOt284SXLzzA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Please help to understand these Exceptions

Reply via email to