Hi Jesse, It looks like your nodes don't have enough memory to load the lucene data structures into memory. When you added the replicas you doubled the number of lucene indices and thus doubled the memory requirements. To reduce memory you can consider using less shards per index (lucene structured will be shared), this might cause indexing to be slower - depending on your hardware and what you shard settings are now.
I also wonder why tweaked the merge policy settings and threads pools. The defaults are good. Not merging correctly can also increase memory usage. On Thursday, April 10, 2014 4:48:18 PM UTC+2, Jesse Davis wrote: > > I've also attached the output from running /_nodes/stats as well. > > > On Wed, Apr 9, 2014 at 5:31 PM, Jesse Davis <[email protected] > > wrote: > >> As well, the uncommented sections of our elasticsearch.yml: >> >> bootstrap.mlockall: true >> gateway.type: local >> gateway.recover_after_nodes: 4 >> gateway.recover_after_time: 5m >> gateway.expected_nodes: 4 >> discovery.zen.minimum_master_nodes: 2 >> discovery.zen.ping.timeout: 20s >> discovery.zen.ping.multicast.enabled: false >> >> index.search.slowlog.threshold.query.warn: 10s >> index.search.slowlog.threshold.query.info: 5s >> index.search.slowlog.threshold.query.debug: 2s >> index.search.slowlog.threshold.query.trace: 500ms >> >> index.indexing.slowlog.threshold.index.warn: 10s >> index.indexing.slowlog.threshold.index.info: 5s >> index.indexing.slowlog.threshold.index.debug: 2s >> index.indexing.slowlog.threshold.index.trace: 500ms >> >> threadpool: >> bulk: >> type: fixed >> min: 1 >> size: 30 >> wait_time: 30s >> queue_size: -1 >> index: >> type: fixed >> min: 1 >> size: 30 >> wait_time: 30s >> queue_size: -1 >> >> discovery.zen.fd.ping_interval: 10s >> discovery.zen.fd.ping_timeout: 60s >> discovery.zen.fd.ping_retries: 10 >> >> index.translog.flush_threshold_ops: 20000 >> index.translog.flush_threshold_size: 400mb >> index.translog.flush_threshold_period: 60m >> >> On Wednesday, April 9, 2014 5:02:11 PM UTC-4, Jesse Davis wrote: >> >>> We're attempting to create a new Elasticsearch cluster for indexing URLs, >>> but have run into a memory leak when turning replication on for our indices. >>> >>> The current setup is: 5 x m2.2xlarge, 4 TB mounted on EBS per node (not >>> Provisioned IOPs). >>> >>> We create one index per day, and will keep the past 90 days around for >>> searching. We have been been performing bulk inserts with routing enabled, >>> 1 day at a time, and have been successful in loading all 90 days. This >>> ended up being approximately 313 million documents. I had inserted with >>> the number of replicas per index set to 0 to increase our bulk insertion >>> rate. >>> I then started changing the number of replicas per index to 1, one index at >>> a time. I was able to successfully create the replicas for about 70 of the >>> shards (i.e. about 65 or 70 days), but then ran out of heap space. >>> >>> We are planning to bulk insert about 2-4 millions records per day in 10 >>> minute intervals, so I would appreciate any advice on the validity of our >>> configuration so far. In particular, we would like to know if there's any >>> known memory leaks with shard replication or bulk inserts. >>> >>> Our configuration: >>> >>> Ubuntu 12.04 LTS >>> Java 7 u51 (I am aware of >>> https://groups.google.com/forum/#!msg/elasticsearch/D4WNQZSvqSU/zo7ancelKi4J >>> and am doing a rolling restart of the cluster as we speak to move to Java >>> 7 u25). >>> Marvel was installed on each node, but in order to simplify our setup, I >>> will be removing it during the aforementioned cluster restart. >>> >>> Elasticsearch 1.0.0 >>> >>> "version" : { >>> "number" : "1.0.0", >>> "build_hash" : "a46900e9c72c0a623d71b54016357d5f94c8ea32", >>> "build_timestamp" : "2014-02-12T16:18:34Z", >>> "build_snapshot" : false, >>> "lucene_version" : "4.6" >>> }, >>> >>> Settings applied for our bulk insert: >>> >>> { >>> "index" : { >>> "merge.policy.max_merge_at_once" : 4, >>> "merge.policy.segments_per_tier" : 20, >>> "refresh_interval" : "-1" # I will be setting this back to 1s when >>> our backfill/replicas are done >>> } >>> } >>> >>> { >>> "transient" : { >>> "index.merge.policy.merge_factor" : 30, >>> "threadpool.bulk.queue_size" : -1, >>> "index.merge.scheduler.max_thread_count" : 5 >>> } >>> } >>> >>> Our Java configuration variables (those that are different from the default >>> /etc/default/elasticsearch in the .deb): >>> >>> JAVA_HOME=/usr/lib/jvm/java-1.7.0_25-oracle (this was Oracle's Java 7 >>> u51, being backed down during the restart) >>> ES_HEAP_SIZE=18g >>> MAX_OPEN_FILES=256000 >>> >>> From a running instance: >>> >>> /usr/lib/jvm/java-1.7.0_25-oracle/bin/java -Xms18g -Xmx18g -Xss256k >>> -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC >>> -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly >>> -XX:+HeapDumpOnOutOfMemoryError -Delasticsearch >>> -Des.pidfile=/var/run/elasticsearch.pid >>> -Des.path.home=/usr/share/elasticsearch -cp >>> :/usr/share/elasticsearch/lib/elasticsearch-1.0.0.jar:/usr/share/elasticsearch/lib/*:/usr/share/elasticsearch/lib/sigar/* >>> -Des.default.config=/etc/elasticsearch/elasticsearch.yml >>> -Des.default.path.home=/usr/share/elasticsearch >>> -Des.default.path.logs=/var/log/elasticsearch >>> -Des.default.path.data=/var/lib/elasticsearch >>> -Des.default.path.work=/tmp/elasticsearch >>> -Des.default.path.conf=/etc/elasticsearch >>> org.elasticsearch.bootstrap.Elasticsearch >>> >>> The log message I saw during the OutOfMemoryError: >>> >>> >>> [2014-04-09 14:17:28,393][WARN ][cluster.action.shard ] [esearch16] >>> [.marvel-2014.04.09][0] received shard failed for [.marvel-2014.04.09][0], >>> node[SK5okikgSWSdbrQdZWET8g], [R], s[INITIALIZING], in >>> dexUUID [K4IB1Px3RoOqcjPbta-fKw], reason [Failed to start shard, message >>> [RecoveryFailedException[[.marvel-2014.04.09][0]: Recovery failed from >>> [esearch16][EbYQ9HNzQtexkEZ1PgwpnQ][esearch16.tlys.us][in >>> et[/10.145.167.184:9300]] into >>> [esearch13][SK5okikgSWSdbrQdZWET8g][esearch13.tlys.us][inet[ip-10-185-195-69.ec2.internal/10.185.195.69:9300]]]; >>> nested: RemoteTransportException[[esearch16][inet[/10.145 >>> .167.184:9300]][index/shard/recovery/startRecovery]]; nested: >>> RecoveryEngineException[[.marvel-2014.04.09][0] Phase[2] Execution failed]; >>> nested: RemoteTransportException[[esearch13][inet[/10.185.195.6 >>> 9:9300]][index/shard/recovery/prepareTranslog]]; nested: >>> EngineCreationFailureException[[.marvel-2014.04.09][0] failed to create >>> engine]; nested: LockObtainFailedException[Lock obtain timed out: Native >>> FSLock@/ebsmnt/data/elasticsearch/search-prod/nodes/0/indices/.marvel-2014.04.09/0/index/write.lock]; >>> ]] >>> [2014-04-09 14:24:47,111][WARN ][cluster.action.shard ] [esearch16] >>> [domain_url_2014-01-03][4] received shard failed for >>> [domain_url_2014-01-03][4], node[SK5okikgSWSdbrQdZWET8g], [R], s[STARTED], i >>> ndexUUID [OkUadV5JSJGI2B9dKwgMLw], reason [engine failure, message >>> [OutOfMemoryError[Java heap space]]] >>> [2014-04-09 14:26:06,104][WARN ][cluster.action.shard ] [esearch16] >>> [.marvel-2014.04.09][0] received shard failed for [.marvel-2014.04.09][0], >>> node[SK5okikgSWSdbrQdZWET8g], [R], s[INITIALIZING], in >>> dexUUID [K4IB1Px3RoOqcjPbta-fKw], reason [Failed to start shard, message >>> [RecoveryFailedException[[.marvel-2014.04.09][0]: Recovery failed from >>> [esearch16][EbYQ9HNzQtexkEZ1PgwpnQ][esearch16.tlys.us][in >>> et[/10.145.167.184:9300]] into >>> [esearch13][SK5okikgSWSdbrQdZWET8g][esearch13.tlys.us][inet[ip-10-185-195-69.ec2.internal/10.185.195.69:9300]]]; >>> nested: RemoteTransportException[[esearch16][inet[/10.145 >>> .167.184:9300]][index/shard/recovery/startRecovery]]; nested: >>> RecoveryEngineException[[.marvel-2014.04.09][0] Phase[2] Execution failed]; >>> nested: RemoteTransportException[[esearch13][inet[/10.185.195.6 >>> 9:9300]][index/shard/recovery/prepareTranslog]]; nested: >>> EngineCreationFailureException[[.marvel-2014.04.09][0] failed to create >>> engine]; nested: LockObtainFailedException[Lock obtain timed out: Native >>> FSLock@/ebsmnt/data/elasticsearch/search-prod/nodes/0/indices/.marvel-2014.04.09/0/index/write.lock]; >>> ]] >>> [2014-04-09 14:26:48,562][INFO ][cluster.metadata ] [esearch16] >>> updating number_of_replicas to [0] for indices [.marvel-2014.04.09] >>> [2014-04-09 14:27:27,235][INFO ][cluster.metadata ] [esearch16] >>> updating number_of_replicas to [0] for indices [.marvel-2014.04.09] >>> [2014-04-09 14:37:01,359][INFO ][cluster.metadata ] [esearch16] >>> [.marvel-2014.04.09] update_mapping [shard_event] (dynamic) >>> [2014-04-09 14:37:01,531][INFO ][cluster.metadata ] [esearch16] >>> [.marvel-2014.04.09] update_mapping [routing_event] (dynamic) >>> [2014-04-09 14:40:51,469][WARN ][cluster.action.shard ] [esearch16] >>> [domain_url_2014-01-01][2] received shard failed for >>> [domain_url_2014-01-01][2], node[BKCZOztRRP6FXVKJSkT_oA], [R], s[STARTED], i >>> ndexUUID [jDcZyjUrSW6eD3_TH5v0_Q], reason [engine failure, message >>> [OutOfMemoryError[Java heap space]]] >>> [2014-04-09 14:41:00,353][WARN ][cluster.action.shard ] [esearch16] >>> [domain_url_2014-03-11][2] received shard failed for >>> [domain_url_2014-03-11][2], node[SK5okikgSWSdbrQdZWET8g], [R], s[STARTED], i >>> ndexUUID [HuQzTDCmTMeS3He3DumnOg], reason [engine failure, message >>> [OutOfMemoryError[Java heap space]]] >>> [2014-04-09 15:04:32,504][WARN ][cluster.action.shard ] [esearch16] >>> [domain_url_2014-01-03][2] received shard failed for >>> [domain_url_2014-01-03][2], node[BKCZOztRRP6FXVKJSkT_oA], [R], s[STARTED], i >>> ndexUUID [OkUadV5JSJGI2B9dKwgMLw], reason [engine failure, message >>> [OutOfMemoryError[Java heap space]]] >>> [2014-04-09 15:12:13,529][WARN ][cluster.action.shard ] [esearch16] >>> [domain_url_2014-01-01][2] received shard failed for >>> [domain_url_2014-01-01][2], node[BKCZOztRRP6FXVKJSkT_oA], [R], s[STARTED], i >>> ndexUUID [jDcZyjUrSW6eD3_TH5v0_Q], reason [engine failure, message >>> [OutOfMemoryError[Java heap space]]] >>> [2014-04-09 15:39:24,021][WARN ][cluster.action.shard ] [esearch16] >>> [domain_url_2014-01-03][1] received shard failed for >>> [domain_url_2014-01-03][1], node[4ft2nd1lRE-BdvL2iYGIkg], relocating [BKCZOz >>> tRRP6FXVKJSkT_oA], [R], s[INITIALIZING], indexUUID >>> [OkUadV5JSJGI2B9dKwgMLw], reason [Failed to start shard, message >>> [RecoveryFailedException[[domain_url_2014-01-03][1]: Recovery failed from >>> [esearch15] >>> [EkR2xgpURrunkxrRnpkzYQ][esearch15.tlys.us][inet[ip-10-185-171-146.ec2.internal/10.185.171.146:9300]] >>> into >>> [esearch14][4ft2nd1lRE-BdvL2iYGIkg][esearch14.tlys.us][inet[ip-10-184-39-23.ec2.internal/10.18 >>> 4.39.23:9300]]]; nested: >>> RemoteTransportException[[esearch15][inet[/10.185.171.146:9300]][index/shard/recovery/startRecovery]]; >>> nested: RecoveryEngineException[[domain_url_2014-01-03][1] Phase[2] Execu >>> tion failed]; nested: >>> RemoteTransportException[[esearch14][inet[/10.184.39.23:9300]][index/shard/recovery/prepareTranslog]]; >>> nested: OutOfMemoryError[Java heap space]; ]] >>> [2014-04-09 15:42:51,176][WARN ][cluster.action.shard ] [esearch16] >>> [domain_url_2014-01-06][1] received shard failed for >>> [domain_url_2014-01-06][1], node[4ft2nd1lRE-BdvL2iYGIkg], [R], s[STARTED], i >>> ndexUUID [51jdwEMrTGKtTpA90ZjXiQ], reason [engine failure, message >>> [OutOfMemoryError[Java heap space]]] >>> [2014-04-09 15:54:42,711][DEBUG][action.admin.cluster.stats] [esearch16] >>> failed to execute on node [4ft2nd1lRE-BdvL2iYGIkg] >>> org.elasticsearch.transport.RemoteTransportException: >>> [esearch14][inet[/10.184.39.23:9300]][cluster/stats/n] >>> Caused by: org.elasticsearch.index.engine.EngineClosedException: >>> [domain_url_2014-01-01][1] CurrentState[CLOSED] >>> at >>> org.elasticsearch.index.engine.internal.InternalEngine.ensureOpen(InternalEngine.java:913) >>> at >>> org.elasticsearch.index.engine.internal.InternalEngine.segmentsStats(InternalEngine.java:1130) >>> at >>> org.elasticsearch.index.shard.service.InternalIndexShard.segmentStats(InternalIndexShard.java:532) >>> at >>> org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:161) >>> at >>> org.elasticsearch.action.admin.indices.stats.ShardStats.<init>(ShardStats.java:49) >>> at >>> org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:130) >>> at >>> org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:54) >>> at >>> org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:281) >>> at >>> org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:272) >>> at >>> org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270) >>> at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>> at java.lang.Thread.run(Thread.java:744) >>> Caused by: java.lang.OutOfMemoryError: Java heap space >>> at org.apache.lucene.util.fst.BytesStore.<init>(BytesStore.java:62) >>> at org.apache.lucene.util.fst.FST.<init>(FST.java:366) >>> at org.apache.lucene.util.fst.FST.<init>(FST.java:301) >>> at >>> org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader.<init>(BlockTreeTermsReader.java:481) >>> at >>> org.apache.lucene.codecs.BlockTreeTermsReader.<init>(BlockTreeTermsReader.java:175) >>> at >>> org.apache.lucene.codecs.lucene41.Lucene41PostingsFormat.fieldsProducer(Lucene41PostingsFormat.java:437) >>> at >>> org.elasticsearch.index.codec.postingsformat.BloomFilterPostingsFormat$BloomFilteredFieldsProducer.<init>(BloomFilterPostingsFormat.java:131) >>> at >>> org.elasticsearch.index.codec.postingsformat.BloomFilterPostingsFormat.fieldsProducer(BloomFilterPostingsFormat.java:102) >>> at >>> org.elasticsearch.index.codec.postingsformat.Elasticsearch090PostingsFormat.fieldsProducer(Elasticsearch090PostingsFormat.java:79) >>> at >>> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:195) >>> at >>> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:244) >>> at >>> org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:115) >>> at >>> org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:95) >>> at >>> org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:141) >>> at >>> org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:235) >>> at >>> org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:100) >>> at >>> org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:382) >>> at >>> org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:111) >>> at >>> org.apache.lucene.search.XSearcherManager.<init>(XSearcherManager.java:94) >>> at >>> org.elasticsearch.index.engine.internal.InternalEngine.buildSearchManager(InternalEngine.java:1462) >>> at >>> org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:801) >>> at >>> org.elasticsearch.index.engine.internal.InternalEngine.updateIndexingBufferSize(InternalEngine.java:223) >>> at >>> org.elasticsearch.indices.memory.IndexingMemoryController$ShardsIndicesStatusChecker.run(IndexingMemoryController.java:201) >>> at >>> org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:437) >>> at >>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) >>> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) >>> at >>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) >>> at >>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) >>> ... 3 more >>> [2014-04-09 15:54:51,827][WARN ][cluster.action.shard ] [esearch16] >>> [domain_url_2014-01-01][1] received shard failed for >>> [domain_url_2014-01-01][1], node[4ft2nd1lRE-BdvL2iYGIkg], [R], s[STARTED], >>> indexUUID [jDcZyjUrSW6eD3_TH5v0_Q], reason [engine failure, message >>> [OutOfMemoryError[Java heap space]]] >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/fd602559-ac92-4ef2-b602-70fc409e4b58%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/fd602559-ac92-4ef2-b602-70fc409e4b58%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > > > -- > Jesse M. Davis > [email protected] > 206.226.9575 > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/794af77f-ba17-4147-bd2a-120bde788c69%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
