The limit of a node is hard to definitively know as use cases vary so much, but from what I have seen 3TB on 3 nodes is pretty dense.
On 12 March 2015 at 08:09, Chris Neal <[email protected]> wrote: > Thank you Mark. > > May I ask what about my answers caused you to say "definitely"? :) I want > to better understand capacity related items for ES for sure. > > Many thanks! > Chris > > On Wed, Mar 11, 2015 at 2:13 PM, Mark Walkom <[email protected]> wrote: > >> Then you're definitely going to be seeing node pressure. I'd add another >> one or two and see how things look after that. >> >> On 11 March 2015 at 07:21, Chris Neal <[email protected]> wrote: >> >>> Again Mark, thank you for your time :) >>> >>> 157 Indicies >>> 928 Shards >>> Daily indexing that adds 7 indexes per day >>> Each index has 3 shards and 1 replica >>> 2.27TB of data in the cluster >>> Index rate averages about 1500/sec >>> IOps on the servers is ~40 >>> >>> Chris >>> >>> On Tue, Mar 10, 2015 at 7:57 PM, Mark Walkom <[email protected]> >>> wrote: >>> >>>> It looks like heap pressure. >>>> How many indices, how many shards, how much data do you have in the >>>> cluster? >>>> >>>> On 8 March 2015 at 19:24, Chris Neal <[email protected]> wrote: >>>> >>>>> Thank you Mark for your reply. >>>>> >>>>> I do have Marvel running, on a separate cluster even, so I do have >>>>> that data from the time of the problem. I've attached 4 screenshots for >>>>> reference. >>>>> >>>>> It appears that node 10.0.0.12 (the green line on the charts) had >>>>> issues. The heap usage drops from 80% to 0%. I'm guessing that is some >>>>> sort of crash, because the heap should not empty itself. Also its load >>>>> goes to 0. >>>>> >>>>> I also see a lot of Old GC duration on 10.0.0.45 (blue line). Lots of >>>>> excessive Old GC Counts, so it does appear that the problem was memory >>>>> pressure on this node. That's what I was thinking, but was hoping for >>>>> validation on that. >>>>> >>>>> If it was, I'm hoping to get some suggestions on what to do about it. >>>>> As I mentioned in the original post, I've tweaked I think needs tweaking >>>>> based on the system, and it still happens. >>>>> >>>>> Maybe it's just that I'm pushing the cluster too much for the >>>>> resources I'm giving it, and it "just won't work". >>>>> >>>>> The index rate was only about 2500/sec, and the search request rate >>>>> had one small spike that went to 3.0. But 3 searches in one timeslice is >>>>> nothing. >>>>> >>>>> Thanks again for the help and reading all this stuff. It is >>>>> appreciated. Hopefully I can get a solution to keep the cluster stable. >>>>> >>>>> Chris >>>>> >>>>> On Fri, Mar 6, 2015 at 3:01 PM, Mark Walkom <[email protected]> >>>>> wrote: >>>>> >>>>>> You really need some kind of monitoring, like Marvel, around this to >>>>>> give you an idea of what was happening prior to the OOM. >>>>>> Generally a node becoming unresponsive will be due to GC, so take a >>>>>> look at the timings there. >>>>>> >>>>>> On 5 March 2015 at 02:32, Chris Neal <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> I'm hoping someone can help me piece together the below log >>>>>>> entries/stack traces/Exceptions. I have a 3 node cluster in >>>>>>> Development in >>>>>>> EC2, and two of them had issues. I'm running ES 1.4.4, 32GB RAM, 16GB >>>>>>> heaps, dedicated servers to ES. My idex rate averages about 10k/sec. >>>>>>> There were no searches going on at the time of the incident. >>>>>>> >>>>>>> It appears to me that node 10.0.0.12 began timing out requests to >>>>>>> 10.0.45, indicating that 10.0.0.45 was having issues. >>>>>>> Then at 4:36, 10.0.0.12 logs the ERROR about "Uncaught exception: >>>>>>> IndexWriter already closed", caused by an OOME. >>>>>>> Then at 4:43, 10.0.0.45 hits the "Create failed" WARN, and logs an >>>>>>> OOME. >>>>>>> Then things are basically down and unresponsive. >>>>>>> >>>>>>> What is weird to me is that if 10.0.0.45 was the node having issues, >>>>>>> why did 10.0.0.12 log an exception 7 minutes before that? Did both >>>>>>> nodes >>>>>>> run out of memory? Or is one of the Exceptions actually saying, "I see >>>>>>> that this other node hit an OOME, and I'm telling you about it." >>>>>>> >>>>>>> I have a few values tweaked in the elasticsearch.yml file to try and >>>>>>> keep this from happening (configured from Puppet): >>>>>>> >>>>>>> 'indices.breaker.fielddata.limit' => '20%', >>>>>>> 'indices.breaker.total.limit' => '25%', >>>>>>> 'indices.breaker.request.limit' => '10%', >>>>>>> 'index.merge.scheduler.type' => 'concurrent', >>>>>>> 'index.merge.scheduler.max_thread_count' => '1', >>>>>>> 'index.merge.policy.type' => 'tiered', >>>>>>> 'index.merge.policy.max_merged_segment' => '1gb', >>>>>>> 'index.merge.policy.segments_per_tier' => '4', >>>>>>> 'index.merge.policy.max_merge_at_once' => '4', >>>>>>> 'index.merge.policy.max_merge_at_once_explicit' => '4', >>>>>>> 'indices.memory.index_buffer_size' => '10%', >>>>>>> 'indices.store.throttle.type' => 'none', >>>>>>> 'index.translog.flush_threshold_size' => '1GB', >>>>>>> >>>>>>> I have done a fair bit of reading on this, and have tried about >>>>>>> everything I can think of. :( >>>>>>> >>>>>>> Can anyone tell me what caused this scenario, and what can be done >>>>>>> to avoid it? >>>>>>> Thank you so much for taking the time to read this. >>>>>>> Chris >>>>>>> >>>>>>> ===== >>>>>>> *On server 10.0.0.12 <http://10.0.0.12>:* >>>>>>> >>>>>>> [2015-03-04 03:56:12,548][WARN ][transport ] >>>>>>> [elasticsearch-ip-10-0-0-12] Received response for a request that has >>>>>>> timed >>>>>>> out, sent [20456ms] ago, timed out [5392ms] ago, action >>>>>>> [cluster:monitor/nodes/st >>>>>>> ats[n]], node >>>>>>> [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}], >>>>>>> id [70061596] >>>>>>> [2015-03-04 04:06:02,407][INFO ][index.engine.internal ] >>>>>>> [elasticsearch-ip-10-0-0-12] [derbysoft-ihg-20150304][2] now throttling >>>>>>> indexing: numMergesInFlight=4, maxNumMerges=3 >>>>>>> [2015-03-04 04:06:04,141][INFO ][index.engine.internal ] >>>>>>> [elasticsearch-ip-10-0-0-12] [derbysoft-ihg-20150304][2] stop throttling >>>>>>> indexing: numMergesInFlight=2, maxNumMerges=3 >>>>>>> [2015-03-04 04:12:26,194][WARN ][transport ] >>>>>>> [elasticsearch-ip-10-0-0-12] Received response for a request that has >>>>>>> timed >>>>>>> out, sent [15709ms] ago, timed out [708ms] ago, action >>>>>>> [cluster:monitor/nodes/sta >>>>>>> ts[n]], node >>>>>>> [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}], >>>>>>> id [70098828] >>>>>>> [2015-03-04 04:23:40,778][WARN ][transport ] >>>>>>> [elasticsearch-ip-10-0-0-12] Received response for a request that has >>>>>>> timed >>>>>>> out, sent [21030ms] ago, timed out [6030ms] ago, action >>>>>>> [cluster:monitor/nodes/st >>>>>>> ats[n]], node >>>>>>> [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}], >>>>>>> id [70124234] >>>>>>> [2015-03-04 04:24:47,023][WARN ][transport ] >>>>>>> [elasticsearch-ip-10-0-0-12] Received response for a request that has >>>>>>> timed >>>>>>> out, sent [27275ms] ago, timed out [12275ms] ago, action >>>>>>> [cluster:monitor/nodes/s >>>>>>> tats[n]], node >>>>>>> [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}], >>>>>>> id [70126273] >>>>>>> [2015-03-04 04:25:39,180][WARN ][transport ] >>>>>>> [elasticsearch-ip-10-0-0-12] Received response for a request that has >>>>>>> timed >>>>>>> out, sent [19431ms] ago, timed out [4431ms] ago, action >>>>>>> [cluster:monitor/nodes/st >>>>>>> ats[n]], node >>>>>>> [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}], >>>>>>> id [70127835] >>>>>>> [2015-03-04 04:26:40,775][WARN ][transport ] >>>>>>> [elasticsearch-ip-10-0-0-12] Received response for a request that has >>>>>>> timed >>>>>>> out, sent [19241ms] ago, timed out [4241ms] ago, action >>>>>>> [cluster:monitor/nodes/st >>>>>>> ats[n]], node >>>>>>> [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}], >>>>>>> id [70129981] >>>>>>> [2015-03-04 04:27:14,329][WARN ][transport ] >>>>>>> [elasticsearch-ip-10-0-0-12] Received response for a request that has >>>>>>> timed >>>>>>> out, sent [22676ms] ago, timed out [6688ms] ago, action >>>>>>> [cluster:monitor/nodes/stats[n]], node >>>>>>> [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}], >>>>>>> id [70130668] >>>>>>> [2015-03-04 04:28:15,695][WARN ][transport ] >>>>>>> [elasticsearch-ip-10-0-0-12] Received response for a request that has >>>>>>> timed >>>>>>> out, sent [24042ms] ago, timed out [9041ms] ago, action >>>>>>> [cluster:monitor/nodes/stats[n]], node >>>>>>> [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}], >>>>>>> id [70132644] >>>>>>> [2015-03-04 04:29:38,102][WARN ][transport ] >>>>>>> [elasticsearch-ip-10-0-0-12] Received response for a request that has >>>>>>> timed >>>>>>> out, sent [16448ms] ago, timed out [1448ms] ago, action >>>>>>> [cluster:monitor/nodes/stats[n]], node >>>>>>> [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}], >>>>>>> id [70135333] >>>>>>> [2015-03-04 04:33:42,393][WARN ][transport ] >>>>>>> [elasticsearch-ip-10-0-0-12] Received response for a request that has >>>>>>> timed >>>>>>> out, sent [20738ms] ago, timed out [5737ms] ago, action >>>>>>> [cluster:monitor/nodes/stats[n]], node >>>>>>> [[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}], >>>>>>> id [70142427] >>>>>>> [2015-03-04 04:36:08,788][ERROR][marvel.agent ] >>>>>>> [elasticsearch-ip-10-0-0-12] Background thread had an uncaught >>>>>>> exception: >>>>>>> org.apache.lucene.store.AlreadyClosedException: this IndexWriter is >>>>>>> closed >>>>>>> at >>>>>>> org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:698) >>>>>>> at >>>>>>> org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:712) >>>>>>> at >>>>>>> org.apache.lucene.index.IndexWriter.ramBytesUsed(IndexWriter.java:462) >>>>>>> at >>>>>>> org.elasticsearch.index.engine.internal.InternalEngine.segmentsStats(InternalEngine.java:1224) >>>>>>> at >>>>>>> org.elasticsearch.index.shard.service.InternalIndexShard.segmentStats(InternalIndexShard.java:555) >>>>>>> at >>>>>>> org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:170) >>>>>>> at >>>>>>> org.elasticsearch.action.admin.indices.stats.ShardStats.<init>(ShardStats.java:49) >>>>>>> at >>>>>>> org.elasticsearch.indices.InternalIndicesService.stats(InternalIndicesService.java:212) >>>>>>> at >>>>>>> org.elasticsearch.indices.InternalIndicesService.stats(InternalIndicesService.java:172) >>>>>>> at >>>>>>> org.elasticsearch.node.service.NodeService.stats(NodeService.java:138) >>>>>>> at >>>>>>> org.elasticsearch.marvel.agent.AgentService$ExportingWorker.exportNodeStats(AgentService.java:300) >>>>>>> at >>>>>>> org.elasticsearch.marvel.agent.AgentService$ExportingWorker.run(AgentService.java:225) >>>>>>> at java.lang.Thread.run(Thread.java:745) >>>>>>> Caused by: java.lang.OutOfMemoryError: Java heap space >>>>>>> >>>>>>> ===== >>>>>>> *On server 10.0.0.45 <http://10.0.0.45>:* >>>>>>> >>>>>>> [2015-03-04 04:43:27,245][WARN ][index.engine.internal ] >>>>>>> [elasticsearch-ip-10-0-0-45] [myindex-20150304][1] failed engine >>>>>>> [indices:data/write/bulk[s] failed on replica] >>>>>>> org.elasticsearch.index.engine.CreateFailedEngineException: >>>>>>> [myindex-20150304][1] Create failed for [my_type#AUvjGHoiku-fZf277h_4] >>>>>>> at >>>>>>> org.elasticsearch.index.engine.internal.InternalEngine.create(InternalEngine.java:421) >>>>>>> at >>>>>>> org.elasticsearch.index.shard.service.InternalIndexShard.create(InternalIndexShard.java:403) >>>>>>> at >>>>>>> org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:595) >>>>>>> at >>>>>>> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$ReplicaOperationTransportHandler.messageReceived(TransportShardReplicationOperationAction.java:246) >>>>>>> at >>>>>>> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$ReplicaOperationTransportHandler.messageReceived(TransportShardReplicationOperationAction.java:225) >>>>>>> at >>>>>>> org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275) >>>>>>> at >>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>>>>>> at >>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>>>>>> at java.lang.Thread.run(Thread.java:745) >>>>>>> Caused by: org.apache.lucene.store.AlreadyClosedException: this >>>>>>> IndexWriter is closed >>>>>>> at >>>>>>> org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:698) >>>>>>> at >>>>>>> org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:712) >>>>>>> at >>>>>>> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1507) >>>>>>> at >>>>>>> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1246) >>>>>>> at >>>>>>> org.elasticsearch.index.engine.internal.InternalEngine.innerCreateNoLock(InternalEngine.java:502) >>>>>>> at >>>>>>> org.elasticsearch.index.engine.internal.InternalEngine.innerCreate(InternalEngine.java:444) >>>>>>> at >>>>>>> org.elasticsearch.index.engine.internal.InternalEngine.create(InternalEngine.java:413) >>>>>>> ... 8 more >>>>>>> Caused by: java.lang.OutOfMemoryError: Java heap space >>>>>>> >>>>>>> ===== >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "elasticsearch" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/elasticsearch/CAND3DphzaT3Np5TBW%2B-h_aOo9BScPu_5QO9qCqnYLp__JCjOPA%40mail.gmail.com >>>>>>> <https://groups.google.com/d/msgid/elasticsearch/CAND3DphzaT3Np5TBW%2B-h_aOo9BScPu_5QO9qCqnYLp__JCjOPA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "elasticsearch" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_9uMAwF7nkZRnDvB9DAMmkSGrNG1HiWWvNgTRcg2TM8w%40mail.gmail.com >>>>>> <https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_9uMAwF7nkZRnDvB9DAMmkSGrNG1HiWWvNgTRcg2TM8w%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "elasticsearch" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/elasticsearch/CAND3Dpjr5ZUHKeWZROCJ6uCCjmEU3_geDuSdK96_-uqL6qGX2A%40mail.gmail.com >>>>> <https://groups.google.com/d/msgid/elasticsearch/CAND3Dpjr5ZUHKeWZROCJ6uCCjmEU3_geDuSdK96_-uqL6qGX2A%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "elasticsearch" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/elasticsearch/CAEYi1X88-hv1vp3xwJsz2kPex3tAND-rx%3DT-CEO1GXO0CkwSww%40mail.gmail.com >>>> <https://groups.google.com/d/msgid/elasticsearch/CAEYi1X88-hv1vp3xwJsz2kPex3tAND-rx%3DT-CEO1GXO0CkwSww%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "elasticsearch" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/elasticsearch/CAND3Dph4U3pnL%3D1RYCT-ojJK3chd1goP%3DeRGbtd_pgmtP2oa5w%40mail.gmail.com >>> <https://groups.google.com/d/msgid/elasticsearch/CAND3Dph4U3pnL%3D1RYCT-ojJK3chd1goP%3DeRGbtd_pgmtP2oa5w%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/CAEYi1X9SMb2DeTvkWA2OogQH%2BijSKHP%2B40ZYt-OXnCm10QgYJQ%40mail.gmail.com >> <https://groups.google.com/d/msgid/elasticsearch/CAEYi1X9SMb2DeTvkWA2OogQH%2BijSKHP%2B40ZYt-OXnCm10QgYJQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/CAND3Dphh0Awaswu2_0JEreWOZ5Xcc%3DS4E5LpMxrOt284SXLzzA%40mail.gmail.com > <https://groups.google.com/d/msgid/elasticsearch/CAND3Dphh0Awaswu2_0JEreWOZ5Xcc%3DS4E5LpMxrOt284SXLzzA%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X-HWqDAF%2B8E8s%3DxU19tQgPpK12kX1DFwYiwus%2B9_kbD-g%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
