A followup as promised. 5 days of up-time, no OOME yet. So far so good! One more update after a few more days....
Again, thank you Jorg! :) On Wed, Sep 17, 2014 at 11:54 AM, Chris Neal <[email protected]> wrote: > Thank you so very much for the reply! > > That makes sense. I will look at the gist as well, and make some changes > to test. > > Again, thank you for your time. I will report back with some results! > Chris > > > On Wed, Sep 17, 2014 at 11:36 AM, [email protected] < > [email protected]> wrote: > >> I have very similar cluster setup here (ES 1.3.2, 64G RAM, 3 nodes, Java >> 8, G1GC, ~100 shards, ~500g indexes on disk) >> >> This is the culprit >> >> max_merged_segment: 15gb >> >> I recommend >> >> max_merged_segment: 1gb >> >> See also https://gist.github.com/jprante/10666960 (which also holds for >> ES 1.2 and ES 1.3 - these versions have better OOTB defaults for merge) >> >> With this I can use 8g heap for my workload. >> >> Rule of thumb: at each time your heap must be able to cope with an extra >> allocation of max_merged_segment (this is NOT what happens behind the >> scene, it is just a rough estimate) >> >> With 15g in your setting, the risk is high to overallocate the heap when >> your index gets large. >> >> Jörg >> >> >> On Wed, Sep 17, 2014 at 5:43 PM, Chris Neal <[email protected]> >> wrote: >> >>> Sorry to bump my own thread, but It's been awhile and I was hoping to >>> get some more eyes on this. I've since added a third node to the cluster >>> to see if that helps, but it did not. I still see these OOME on merges on >>> any of the three nodes in the cluster. >>> >>> I have also increased the shard count to 3 to match the number of nodes >>> in the cluster. >>> >>> The error happens on an index that is 44GB in size. >>> >>> The process in top looks like this: >>> ================ >>> top - 15:39:59 up 63 days, 18:34, 2 users, load average: 0.77, 0.66, >>> 0.71 >>> Tasks: 343 total, 1 running, 342 sleeping, 0 stopped, 0 zombie >>> Cpu(s): 1.1%us, 0.1%sy, 0.0%ni, 98.8%id, 0.0%wa, 0.0%hi, 0.0%si, >>> 0.0%st >>> Mem: 30688804k total, 27996932k used, 2691872k free, 62760k buffers >>> Swap: 10485752k total, 5832k used, 10479920k free, 9434584k cached >>> >>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >>> >>> 29539 elastics 20 0 207g 17g 1.1g S 20.2 60.9 2874:16 java >>> ================ >>> >>> The process using the 5MB of swap is not elasticsearch, just FYI. >>> >>> If there is any more information I can provide, please let me know. I'm >>> getting a bit desperate to get this one resolved! >>> Thank you so much for your time. >>> Chris >>> >>> On Thu, Jul 31, 2014 at 10:06 AM, Chris Neal <[email protected]> >>> wrote: >>> >>>> Ooops. Sorry. That was a copy/paste error. It is using 16GB. Here >>>> is the correct process arguments: >>>> >>>> /usr/bin/java -Xms16g -Xmx16g -Xss256k -Djava.awt.headless=true -server >>>> -XX:+UseCompressedOops -XX:+UseG1GC -XX:MaxGCPauseMillis=20 >>>> -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError -Delasticsearch >>>> -Des.pidfile=/var/run/elasticsearch/elasticsearch.pid >>>> -Des.path.home=/usr/share/elasticsearch [snip CP] >>>> >>>> Thanks! >>>> Chris >>>> >>>> >>>> On Thu, Jul 31, 2014 at 2:43 AM, David Pilato <[email protected]> wrote: >>>> >>>>> Why do you start with 8gb HEAP? Can't you give 16gb or so? >>>>> >>>>> /usr/bin/java -Xms8g -Xmx8g >>>>> >>>>> >>>>> >>>>> -- >>>>> David ;-) >>>>> Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs >>>>> >>>>> >>>>> Le 30 juil. 2014 à 19:47, Chris Neal <[email protected]> a >>>>> écrit : >>>>> >>>>> Hi everyone, >>>>> >>>>> First off, apologies for the thread. I know OOME discussions are >>>>> somewhat overdone in the group, but I need to reach out for some help for >>>>> this one. >>>>> >>>>> I have a 2 node development cluster in EC2 on c3.4xlarge AMIs. That >>>>> means 16 vCPUs, 30GB RAM, 1Gb network, and I have 2 500GB EBS volumes for >>>>> Elasticsearch data on each AMI. >>>>> >>>>> I'm running Java 1.7.0_55, and using the G1 collector. The Java args >>>>> are: >>>>> >>>>> /usr/bin/java -Xms8g -Xmx8g -Xss256k -Djava.awt.headless=true -server >>>>> -XX:+UseCompressedOops -XX:+UseG1GC -XX:MaxGCPauseMillis=20 >>>>> -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError >>>>> The index has 2 shards, each with 1 replica. >>>>> >>>>> I have a daily index being filled with application log data. The >>>>> index, on average, gets to be about: >>>>> 486M documents >>>>> 53.1GB (primary size) >>>>> 106.2GB (total size) >>>>> >>>>> Other than indexing, there really is nothing going on in the cluster. >>>>> No searches, or percolators, just collecting data. >>>>> >>>>> I have: >>>>> >>>>> - Tweaked the idex.merge.policy >>>>> - Tweaked the indices.fielddata.breaker.limit and cache.size >>>>> - change the index refresh_interval from 1s to 60s >>>>> - created a default template for the index such that _all is >>>>> disabled, and all fields in the mapping are set to "not_analyzed". >>>>> >>>>> Here is my complete elasticsearch.yml: >>>>> >>>>> action: >>>>> disable_delete_all_indices: true >>>>> cluster: >>>>> name: elasticsearch-dev >>>>> discovery: >>>>> zen: >>>>> minimum_master_nodes: 2 >>>>> ping: >>>>> multicast: >>>>> enabled: false >>>>> unicast: >>>>> hosts: 10.0.0.45,10.0.0.41 >>>>> gateway: >>>>> recover_after_nodes: 2 >>>>> index: >>>>> merge: >>>>> policy: >>>>> max_merge_at_once: 5 >>>>> max_merged_segment: 15gb >>>>> number_of_replicas: 1 >>>>> number_of_shards: 2 >>>>> refresh_interval: 60s >>>>> indices: >>>>> fielddata: >>>>> breaker: >>>>> limit: 50% >>>>> cache: >>>>> size: 30% >>>>> node: >>>>> name: elasticsearch-ip-10-0-0-45 >>>>> path: >>>>> data: >>>>> - /usr/local/ebs01/elasticsearch >>>>> - /usr/local/ebs02/elasticsearch >>>>> threadpool: >>>>> bulk: >>>>> queue_size: 500 >>>>> size: 75 >>>>> type: fixed >>>>> get: >>>>> queue_size: 200 >>>>> size: 100 >>>>> type: fixed >>>>> index: >>>>> queue_size: 1000 >>>>> size: 100 >>>>> type: fixed >>>>> search: >>>>> queue_size: 200 >>>>> size: 100 >>>>> type: fixed >>>>> >>>>> The heap sits about 13GB used. I had been batting OOME exceptions >>>>> for awhile, and thought I had it licked, but one just popped up again. My >>>>> cluster has been up and running fine for 14 days, and I just got this >>>>> OOME: >>>>> >>>>> ===== >>>>> [2014-07-30 11:52:28,394][INFO ][monitor.jvm ] >>>>> [elasticsearch-ip-10-0-0-41] [gc][young][1158834][109906] duration >>>>> [770ms], >>>>> collections [1]/[1s], total [770ms]/[43.2m], memory >>>>> [13.4gb]->[13.4gb]/[16gb], all_pools {[young] >>>>> [648mb]->[8mb]/[0b]}{[survivor] [0b]->[0b]/[0b]}{[old] >>>>> [12.8gb]->[13.4gb]/[16gb]} >>>>> [2014-07-30 15:03:01,070][WARN ][index.engine.internal ] >>>>> [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed engine [out of >>>>> memory] >>>>> [2014-07-30 15:03:10,324][WARN >>>>> ][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in >>>>> the >>>>> selector loop. >>>>> java.lang.OutOfMemoryError: Java heap space >>>>> [2014-07-30 15:03:10,335][WARN >>>>> ][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in >>>>> the >>>>> selector loop. >>>>> java.lang.OutOfMemoryError: Java heap space >>>>> [2014-07-30 15:03:10,324][WARN ][index.merge.scheduler ] >>>>> [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed to merge >>>>> java.lang.OutOfMemoryError: Java heap space >>>>> [2014-07-30 15:03:28,595][WARN ][index.translog ] >>>>> [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed to flush shard >>>>> on translog threshold >>>>> org.elasticsearch.index.engine.FlushFailedEngineException: >>>>> [derbysoft-20140730][0] Flush failed >>>>> at >>>>> org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:805) >>>>> at >>>>> org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:604) >>>>> at >>>>> org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:202) >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>>>> at java.lang.Thread.run(Thread.java:744) >>>>> Caused by: java.lang.IllegalStateException: this writer hit an >>>>> OutOfMemoryError; cannot commit >>>>> at >>>>> org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4416) >>>>> at >>>>> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2989) >>>>> at >>>>> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3096) >>>>> at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3063) >>>>> at >>>>> org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:797) >>>>> ... 5 more >>>>> [2014-07-30 15:03:28,658][WARN ][cluster.action.shard ] >>>>> [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] sending failed shard >>>>> for [derbysoft-20140730][0], node[W-7FsjjZTyOXZdaJhhqxEA], [R], >>>>> s[STARTED], >>>>> indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [engine failure, message [out >>>>> of >>>>> memory][IllegalStateException[this writer hit an OutOfMemoryError; cannot >>>>> commit]]] >>>>> [2014-07-30 15:34:36,418][WARN >>>>> ][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in >>>>> the >>>>> selector loop. >>>>> java.lang.OutOfMemoryError: Java heap space >>>>> [2014-07-30 15:34:39,847][WARN >>>>> ][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in >>>>> the >>>>> selector loop. >>>>> java.lang.OutOfMemoryError: Java heap space >>>>> [2014-07-30 15:34:42,873][WARN ][index.merge.scheduler ] >>>>> [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed to merge >>>>> java.lang.OutOfMemoryError: Java heap space >>>>> [2014-07-30 15:34:42,873][WARN ][index.engine.internal ] >>>>> [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed engine [merge >>>>> exception] >>>>> [2014-07-30 15:34:43,185][WARN ][cluster.action.shard ] >>>>> [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] sending failed shard >>>>> for [derbysoft-20140730][1], node[W-7FsjjZTyOXZdaJhhqxEA], [P], >>>>> s[STARTED], >>>>> indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [engine failure, message [merge >>>>> exception][MergeException[java.lang.OutOfMemoryError: Java heap space]; >>>>> nested: OutOfMemoryError[Java heap space]; ]] >>>>> [2014-07-30 15:57:42,531][WARN ][indices.recovery ] >>>>> [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] recovery from >>>>> [[elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][ip-10-0-0-45.us >>>>> -west-2.compute.internal][inet[/10.0.0.45:9300]]] failed >>>>> org.elasticsearch.transport.RemoteTransportException: >>>>> [elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300 >>>>> ]][index/shard/recovery/startRecovery] >>>>> Caused by: org.elasticsearch.index.engine.RecoveryEngineException: >>>>> [derbysoft-20140730][1] Phase[2] Execution failed >>>>> at >>>>> org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1011) >>>>> at >>>>> org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:631) >>>>> at >>>>> org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:122) >>>>> at >>>>> org.elasticsearch.indices.recovery.RecoverySource.access$1600(RecoverySource.java:62) >>>>> at >>>>> org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:351) >>>>> at >>>>> org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:337) >>>>> at >>>>> org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270) >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>>>> at java.lang.Thread.run(Thread.java:744) >>>>> Caused by: >>>>> org.elasticsearch.transport.ReceiveTimeoutTransportException: >>>>> [elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog] >>>>> request_id [13988539] timed out after [900000ms] >>>>> at >>>>> org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:369) >>>>> ... 3 more >>>>> [2014-07-30 15:57:42,534][WARN ][indices.cluster ] >>>>> [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed to start shard >>>>> org.elasticsearch.indices.recovery.RecoveryFailedException: >>>>> [derbysoft-20140730][1]: Recovery failed from >>>>> [elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][ip-10-0-0-45.us >>>>> -west-2.compute.internal][inet[/10.0.0.45:9300]] into >>>>> [elasticsearch-ip-10-0-0-41][W-7FsjjZTyOXZdaJhhqxEA][ip-10-0-0-41.us >>>>> -west-2.compute.internal][inet[ip-10-0-0-41.us >>>>> -west-2.compute.internal/10.0.0.41:9300]] >>>>> at >>>>> org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:306) >>>>> at >>>>> org.elasticsearch.indices.recovery.RecoveryTarget.access$300(RecoveryTarget.java:65) >>>>> at >>>>> org.elasticsearch.indices.recovery.RecoveryTarget$3.run(RecoveryTarget.java:184) >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>>>> at java.lang.Thread.run(Thread.java:744) >>>>> Caused by: org.elasticsearch.transport.RemoteTransportException: >>>>> [elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300 >>>>> ]][index/shard/recovery/startRecovery] >>>>> Caused by: org.elasticsearch.index.engine.RecoveryEngineException: >>>>> [derbysoft-20140730][1] Phase[2] Execution failed >>>>> at >>>>> org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1011) >>>>> at >>>>> org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:631) >>>>> at >>>>> org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:122) >>>>> at >>>>> org.elasticsearch.indices.recovery.RecoverySource.access$1600(RecoverySource.java:62) >>>>> at >>>>> org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:351) >>>>> at >>>>> org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:337) >>>>> at >>>>> org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270) >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>>>> at java.lang.Thread.run(Thread.java:744) >>>>> Caused by: >>>>> org.elasticsearch.transport.ReceiveTimeoutTransportException: >>>>> [elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog] >>>>> request_id [13988539] timed out after [900000ms] >>>>> at >>>>> org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:369) >>>>> ... 3 more >>>>> [2014-07-30 15:57:42,535][WARN ][cluster.action.shard ] >>>>> [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] sending failed shard >>>>> for [derbysoft-20140730][1], node[W-7FsjjZTyOXZdaJhhqxEA], [R], >>>>> s[INITIALIZING], indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [Failed to >>>>> start shard, message [RecoveryFailedException[[derbysoft-20140730][1]: >>>>> Recovery failed from [elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][ >>>>> ip-10-0-0-45.us-west-2.compute.internal][inet[/10.0.0.45:9300]] into >>>>> [elasticsearch-ip-10-0-0-41][W-7FsjjZTyOXZdaJhhqxEA][ip-10-0-0-41.us >>>>> -west-2.compute.internal][inet[ip-10-0-0-41.us >>>>> -west-2.compute.internal/10.0.0.41:9300]]]; nested: >>>>> RemoteTransportException[[elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300]][index/shard/recovery/startRecovery]]; >>>>> nested: RecoveryEngineException[[derbysoft-20140730][1] Phase[2] Execution >>>>> failed]; nested: >>>>> ReceiveTimeoutTransportException[[elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog] >>>>> request_id [13988539] timed out after [900000ms]]; ]] >>>>> >>>>> ===== >>>>> >>>>> I'm a bit at a loss as to what to try next to address this problem. >>>>> Can anyone offer a suggestion? >>>>> Thanks for reading this. >>>>> >>>>> Chris >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "elasticsearch" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/elasticsearch/CAND3DpjXh8Bcrfuy6GLvFOoDH7yUMLr1%3DA2Sb0SFQx6PgkQEvA%40mail.gmail.com >>>>> <https://groups.google.com/d/msgid/elasticsearch/CAND3DpjXh8Bcrfuy6GLvFOoDH7yUMLr1%3DA2Sb0SFQx6PgkQEvA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "elasticsearch" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/elasticsearch/A258FFB5-72F5-40C2-A75C-BFF5FA3B8314%40pilato.fr >>>>> <https://groups.google.com/d/msgid/elasticsearch/A258FFB5-72F5-40C2-A75C-BFF5FA3B8314%40pilato.fr?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "elasticsearch" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/elasticsearch/CAND3Dpg3UJCu60NDa01iUYkF8sUQAxSpon6-nE86%2B3%2BbY3w4Og%40mail.gmail.com >>> <https://groups.google.com/d/msgid/elasticsearch/CAND3Dpg3UJCu60NDa01iUYkF8sUQAxSpon6-nE86%2B3%2BbY3w4Og%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEt2E5eGOdMkNDERuASsP6wYEONtkoWBYGdoMZRnk3SdA%40mail.gmail.com >> <https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEt2E5eGOdMkNDERuASsP6wYEONtkoWBYGdoMZRnk3SdA%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAND3Dpj84Lw526N6v3TsN27oFTOavbuhXo%2BHpF8gs2EVx1RVJA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
