Also, the uncommented portions of our elasticsearch.yml:
bootstrap.mlockall: true
gateway.type: local
gateway.recover_after_nodes: 4
gateway.recover_after_time: 5m
gateway.expected_nodes: 4
indices.recovery.max_size_per_sec: 500mb
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.timeout: 20s
discovery.zen.ping.multicast.enabled: false
index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.query.debug: 2s
index.search.slowlog.threshold.query.trace: 500ms
index.indexing.slowlog.threshold.index.warn: 10s
index.indexing.slowlog.threshold.index.info: 5s
index.indexing.slowlog.threshold.index.debug: 2s
index.indexing.slowlog.threshold.index.trace: 500ms
threadpool:
bulk:
type: fixed
min: 1
size: 30
wait_time: 30s
queue_size: -1
index:
type: fixed
min: 1
size: 30
wait_time: 30s
queue_size: -1
discovery.zen.fd.ping_interval: 10s
discovery.zen.fd.ping_timeout: 60s
discovery.zen.fd.ping_retries: 10
index.translog.flush_threshold_ops: 20000
index.translog.flush_threshold_size: 400mb
index.translog.flush_threshold_period: 60m
On Wednesday, April 9, 2014 5:02:11 PM UTC-4, Jesse Davis wrote:
>
> We're attempting to create a new Elasticsearch cluster for indexing URLs, but
> have run into a memory leak when turning replication on for our indices.
>
> The current setup is: 5 x m2.2xlarge, 4 TB mounted on EBS per node (not
> Provisioned IOPs).
>
> We create one index per day, and will keep the past 90 days around for
> searching. We have been been performing bulk inserts with routing enabled, 1
> day at a time, and have been successful in loading all 90 days. This ended
> up being approximately 313 million documents. I had inserted with the number
> of replicas per index set to 0 to increase our bulk insertion rate.
> I then started changing the number of replicas per index to 1, one index at a
> time. I was able to successfully create the replicas for about 70 of the
> shards (i.e. about 65 or 70 days), but then ran out of heap space.
>
> We are planning to bulk insert about 2-4 millions records per day in 10
> minute intervals, so I would appreciate any advice on the validity of our
> configuration so far. In particular, we would like to know if there's any
> known memory leaks with shard replication or bulk inserts.
>
> Our configuration:
>
> Ubuntu 12.04 LTS
> Java 7 u51 (I am aware of
> https://groups.google.com/forum/#!msg/elasticsearch/D4WNQZSvqSU/zo7ancelKi4J
> and am doing a rolling restart of the cluster as we speak to move to Java 7
> u25).
> Marvel was installed on each node, but in order to simplify our setup, I will
> be removing it during the aforementioned cluster restart.
>
> Elasticsearch 1.0.0
>
> "version" : {
> "number" : "1.0.0",
> "build_hash" : "a46900e9c72c0a623d71b54016357d5f94c8ea32",
> "build_timestamp" : "2014-02-12T16:18:34Z",
> "build_snapshot" : false,
> "lucene_version" : "4.6"
> },
>
> Settings applied for our bulk insert:
>
> {
> "index" : {
> "merge.policy.max_merge_at_once" : 4,
> "merge.policy.segments_per_tier" : 20,
> "refresh_interval" : "-1" # I will be setting this back to 1s when
> our backfill/replicas are done
> }
> }
>
> {
> "transient" : {
> "index.merge.policy.merge_factor" : 30,
> "threadpool.bulk.queue_size" : -1,
> "index.merge.scheduler.max_thread_count" : 5
> }
> }
>
> Our Java configuration variables (those that are different from the default
> /etc/default/elasticsearch in the .deb):
>
> JAVA_HOME=/usr/lib/jvm/java-1.7.0_25-oracle (this was Oracle's Java 7 u51,
> being backed down during the restart)
> ES_HEAP_SIZE=18g
> MAX_OPEN_FILES=256000
>
> From a running instance:
>
> /usr/lib/jvm/java-1.7.0_25-oracle/bin/java -Xms18g -Xmx18g -Xss256k
> -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
> -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
> -XX:+HeapDumpOnOutOfMemoryError -Delasticsearch
> -Des.pidfile=/var/run/elasticsearch.pid
> -Des.path.home=/usr/share/elasticsearch -cp
> :/usr/share/elasticsearch/lib/elasticsearch-1.0.0.jar:/usr/share/elasticsearch/lib/*:/usr/share/elasticsearch/lib/sigar/*
> -Des.default.config=/etc/elasticsearch/elasticsearch.yml
> -Des.default.path.home=/usr/share/elasticsearch
> -Des.default.path.logs=/var/log/elasticsearch
> -Des.default.path.data=/var/lib/elasticsearch
> -Des.default.path.work=/tmp/elasticsearch
> -Des.default.path.conf=/etc/elasticsearch
> org.elasticsearch.bootstrap.Elasticsearch
>
> The log message I saw during the OutOfMemoryError:
>
>
> [2014-04-09 14:17:28,393][WARN ][cluster.action.shard ] [esearch16]
> [.marvel-2014.04.09][0] received shard failed for [.marvel-2014.04.09][0],
> node[SK5okikgSWSdbrQdZWET8g], [R], s[INITIALIZING], in
> dexUUID [K4IB1Px3RoOqcjPbta-fKw], reason [Failed to start shard, message
> [RecoveryFailedException[[.marvel-2014.04.09][0]: Recovery failed from
> [esearch16][EbYQ9HNzQtexkEZ1PgwpnQ][esearch16.tlys.us][in
> et[/10.145.167.184:9300]] into
> [esearch13][SK5okikgSWSdbrQdZWET8g][esearch13.tlys.us][inet[ip-10-185-195-69.ec2.internal/10.185.195.69:9300]]];
> nested: RemoteTransportException[[esearch16][inet[/10.145
> .167.184:9300]][index/shard/recovery/startRecovery]]; nested:
> RecoveryEngineException[[.marvel-2014.04.09][0] Phase[2] Execution failed];
> nested: RemoteTransportException[[esearch13][inet[/10.185.195.6
> 9:9300]][index/shard/recovery/prepareTranslog]]; nested:
> EngineCreationFailureException[[.marvel-2014.04.09][0] failed to create
> engine]; nested: LockObtainFailedException[Lock obtain timed out: Native
> FSLock@/ebsmnt/data/elasticsearch/search-prod/nodes/0/indices/.marvel-2014.04.09/0/index/write.lock];
> ]]
> [2014-04-09 14:24:47,111][WARN ][cluster.action.shard ] [esearch16]
> [domain_url_2014-01-03][4] received shard failed for
> [domain_url_2014-01-03][4], node[SK5okikgSWSdbrQdZWET8g], [R], s[STARTED], i
> ndexUUID [OkUadV5JSJGI2B9dKwgMLw], reason [engine failure, message
> [OutOfMemoryError[Java heap space]]]
> [2014-04-09 14:26:06,104][WARN ][cluster.action.shard ] [esearch16]
> [.marvel-2014.04.09][0] received shard failed for [.marvel-2014.04.09][0],
> node[SK5okikgSWSdbrQdZWET8g], [R], s[INITIALIZING], in
> dexUUID [K4IB1Px3RoOqcjPbta-fKw], reason [Failed to start shard, message
> [RecoveryFailedException[[.marvel-2014.04.09][0]: Recovery failed from
> [esearch16][EbYQ9HNzQtexkEZ1PgwpnQ][esearch16.tlys.us][in
> et[/10.145.167.184:9300]] into
> [esearch13][SK5okikgSWSdbrQdZWET8g][esearch13.tlys.us][inet[ip-10-185-195-69.ec2.internal/10.185.195.69:9300]]];
> nested: RemoteTransportException[[esearch16][inet[/10.145
> .167.184:9300]][index/shard/recovery/startRecovery]]; nested:
> RecoveryEngineException[[.marvel-2014.04.09][0] Phase[2] Execution failed];
> nested: RemoteTransportException[[esearch13][inet[/10.185.195.6
> 9:9300]][index/shard/recovery/prepareTranslog]]; nested:
> EngineCreationFailureException[[.marvel-2014.04.09][0] failed to create
> engine]; nested: LockObtainFailedException[Lock obtain timed out: Native
> FSLock@/ebsmnt/data/elasticsearch/search-prod/nodes/0/indices/.marvel-2014.04.09/0/index/write.lock];
> ]]
> [2014-04-09 14:26:48,562][INFO ][cluster.metadata ] [esearch16]
> updating number_of_replicas to [0] for indices [.marvel-2014.04.09]
> [2014-04-09 14:27:27,235][INFO ][cluster.metadata ] [esearch16]
> updating number_of_replicas to [0] for indices [.marvel-2014.04.09]
> [2014-04-09 14:37:01,359][INFO ][cluster.metadata ] [esearch16]
> [.marvel-2014.04.09] update_mapping [shard_event] (dynamic)
> [2014-04-09 14:37:01,531][INFO ][cluster.metadata ] [esearch16]
> [.marvel-2014.04.09] update_mapping [routing_event] (dynamic)
> [2014-04-09 14:40:51,469][WARN ][cluster.action.shard ] [esearch16]
> [domain_url_2014-01-01][2] received shard failed for
> [domain_url_2014-01-01][2], node[BKCZOztRRP6FXVKJSkT_oA], [R], s[STARTED], i
> ndexUUID [jDcZyjUrSW6eD3_TH5v0_Q], reason [engine failure, message
> [OutOfMemoryError[Java heap space]]]
> [2014-04-09 14:41:00,353][WARN ][cluster.action.shard ] [esearch16]
> [domain_url_2014-03-11][2] received shard failed for
> [domain_url_2014-03-11][2], node[SK5okikgSWSdbrQdZWET8g], [R], s[STARTED], i
> ndexUUID [HuQzTDCmTMeS3He3DumnOg], reason [engine failure, message
> [OutOfMemoryError[Java heap space]]]
> [2014-04-09 15:04:32,504][WARN ][cluster.action.shard ] [esearch16]
> [domain_url_2014-01-03][2] received shard failed for
> [domain_url_2014-01-03][2], node[BKCZOztRRP6FXVKJSkT_oA], [R], s[STARTED], i
> ndexUUID [OkUadV5JSJGI2B9dKwgMLw], reason [engine failure, message
> [OutOfMemoryError[Java heap space]]]
> [2014-04-09 15:12:13,529][WARN ][cluster.action.shard ] [esearch16]
> [domain_url_2014-01-01][2] received shard failed for
> [domain_url_2014-01-01][2], node[BKCZOztRRP6FXVKJSkT_oA], [R], s[STARTED], i
> ndexUUID [jDcZyjUrSW6eD3_TH5v0_Q], reason [engine failure, message
> [OutOfMemoryError[Java heap space]]]
> [2014-04-09 15:39:24,021][WARN ][cluster.action.shard ] [esearch16]
> [domain_url_2014-01-03][1] received shard failed for
> [domain_url_2014-01-03][1], node[4ft2nd1lRE-BdvL2iYGIkg], relocating [BKCZOz
> tRRP6FXVKJSkT_oA], [R], s[INITIALIZING], indexUUID [OkUadV5JSJGI2B9dKwgMLw],
> reason [Failed to start shard, message
> [RecoveryFailedException[[domain_url_2014-01-03][1]: Recovery failed from
> [esearch15]
> [EkR2xgpURrunkxrRnpkzYQ][esearch15.tlys.us][inet[ip-10-185-171-146.ec2.internal/10.185.171.146:9300]]
> into
> [esearch14][4ft2nd1lRE-BdvL2iYGIkg][esearch14.tlys.us][inet[ip-10-184-39-23.ec2.internal/10.18
> 4.39.23:9300]]]; nested:
> RemoteTransportException[[esearch15][inet[/10.185.171.146:9300]][index/shard/recovery/startRecovery]];
> nested: RecoveryEngineException[[domain_url_2014-01-03][1] Phase[2] Execu
> tion failed]; nested:
> RemoteTransportException[[esearch14][inet[/10.184.39.23:9300]][index/shard/recovery/prepareTranslog]];
> nested: OutOfMemoryError[Java heap space]; ]]
> [2014-04-09 15:42:51,176][WARN ][cluster.action.shard ] [esearch16]
> [domain_url_2014-01-06][1] received shard failed for
> [domain_url_2014-01-06][1], node[4ft2nd1lRE-BdvL2iYGIkg], [R], s[STARTED], i
> ndexUUID [51jdwEMrTGKtTpA90ZjXiQ], reason [engine failure, message
> [OutOfMemoryError[Java heap space]]]
> [2014-04-09 15:54:42,711][DEBUG][action.admin.cluster.stats] [esearch16]
> failed to execute on node [4ft2nd1lRE-BdvL2iYGIkg]
> org.elasticsearch.transport.RemoteTransportException:
> [esearch14][inet[/10.184.39.23:9300]][cluster/stats/n]
> Caused by: org.elasticsearch.index.engine.EngineClosedException:
> [domain_url_2014-01-01][1] CurrentState[CLOSED]
> at
> org.elasticsearch.index.engine.internal.InternalEngine.ensureOpen(InternalEngine.java:913)
> at
> org.elasticsearch.index.engine.internal.InternalEngine.segmentsStats(InternalEngine.java:1130)
> at
> org.elasticsearch.index.shard.service.InternalIndexShard.segmentStats(InternalIndexShard.java:532)
> at
> org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:161)
> at
> org.elasticsearch.action.admin.indices.stats.ShardStats.<init>(ShardStats.java:49)
> at
> org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:130)
> at
> org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:54)
> at
> org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:281)
> at
> org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:272)
> at
> org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> Caused by: java.lang.OutOfMemoryError: Java heap space
> at org.apache.lucene.util.fst.BytesStore.<init>(BytesStore.java:62)
> at org.apache.lucene.util.fst.FST.<init>(FST.java:366)
> at org.apache.lucene.util.fst.FST.<init>(FST.java:301)
> at
> org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader.<init>(BlockTreeTermsReader.java:481)
> at
> org.apache.lucene.codecs.BlockTreeTermsReader.<init>(BlockTreeTermsReader.java:175)
> at
> org.apache.lucene.codecs.lucene41.Lucene41PostingsFormat.fieldsProducer(Lucene41PostingsFormat.java:437)
> at
> org.elasticsearch.index.codec.postingsformat.BloomFilterPostingsFormat$BloomFilteredFieldsProducer.<init>(BloomFilterPostingsFormat.java:131)
> at
> org.elasticsearch.index.codec.postingsformat.BloomFilterPostingsFormat.fieldsProducer(BloomFilterPostingsFormat.java:102)
> at
> org.elasticsearch.index.codec.postingsformat.Elasticsearch090PostingsFormat.fieldsProducer(Elasticsearch090PostingsFormat.java:79)
> at
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:195)
> at
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:244)
> at
> org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:115)
> at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:95)
> at
> org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:141)
> at
> org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:235)
> at
> org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:100)
> at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:382)
> at
> org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:111)
> at
> org.apache.lucene.search.XSearcherManager.<init>(XSearcherManager.java:94)
> at
> org.elasticsearch.index.engine.internal.InternalEngine.buildSearchManager(InternalEngine.java:1462)
> at
> org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:801)
> at
> org.elasticsearch.index.engine.internal.InternalEngine.updateIndexingBufferSize(InternalEngine.java:223)
> at
> org.elasticsearch.indices.memory.IndexingMemoryController$ShardsIndicesStatusChecker.run(IndexingMemoryController.java:201)
> at
> org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:437)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> ... 3 more
> [2014-04-09 15:54:51,827][WARN ][cluster.action.shard ] [esearch16]
> [domain_url_2014-01-01][1] received shard failed for
> [domain_url_2014-01-01][1], node[4ft2nd1lRE-BdvL2iYGIkg], [R], s[STARTED],
> indexUUID [jDcZyjUrSW6eD3_TH5v0_Q], reason [engine failure, message
> [OutOfMemoryError[Java heap space]]]
>
>
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a99e9fc6-d0d7-4562-a0c1-bc1b7bce8929%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.