We're attempting to create a new Elasticsearch cluster for indexing URLs, but 
have run into a memory leak when turning replication on for our indices.

The current setup is: 5 x m2.2xlarge, 4 TB mounted on EBS per node (not 
Provisioned IOPs).

We create one index per day, and will keep the past 90 days around for 
searching.  We have been been performing bulk inserts with routing enabled, 1 
day at a time, and have been successful in loading all 90 days.  This ended up 
being approximately 313 million documents.  I had inserted with the number of 
replicas per index set to 0 to increase our bulk insertion rate.
I then started changing the number of replicas per index to 1, one index at a 
time.  I was able to successfully create the replicas for about 70 of the 
shards (i.e. about 65 or 70 days), but then ran out of heap space.

We are planning to bulk insert about 2-4 millions records per day in 10 minute 
intervals, so I would appreciate any advice on the validity of our 
configuration so far.  In particular, we would like to know if there's any 
known memory leaks with shard replication or bulk inserts.

Our configuration:

Ubuntu 12.04 LTS
Java 7 u51  (I am aware of 
https://groups.google.com/forum/#!msg/elasticsearch/D4WNQZSvqSU/zo7ancelKi4J 
and am doing a rolling restart of the cluster as we speak to move to Java 7 
u25).
Marvel was installed on each node, but in order to simplify our setup, I will 
be removing it during the aforementioned cluster restart.

Elasticsearch 1.0.0

  "version" : {
    "number" : "1.0.0",
    "build_hash" : "a46900e9c72c0a623d71b54016357d5f94c8ea32",
    "build_timestamp" : "2014-02-12T16:18:34Z",
    "build_snapshot" : false,
    "lucene_version" : "4.6"
  },

Settings applied for our bulk insert:

{
  "index" : {
    "merge.policy.max_merge_at_once" : 4,
    "merge.policy.segments_per_tier" : 20,
    "refresh_interval" : "-1"     # I will be setting this back to 1s when our 
backfill/replicas are done
  }
}

{
  "transient" : {
    "index.merge.policy.merge_factor" : 30,
    "threadpool.bulk.queue_size" : -1,
    "index.merge.scheduler.max_thread_count" : 5
  }
}

Our Java configuration variables (those that are different from the default 
/etc/default/elasticsearch in the .deb):

JAVA_HOME=/usr/lib/jvm/java-1.7.0_25-oracle   (this was Oracle's Java 7 u51, 
being backed down during the restart)
ES_HEAP_SIZE=18g
MAX_OPEN_FILES=256000

>From a running instance:

/usr/lib/jvm/java-1.7.0_25-oracle/bin/java -Xms18g -Xmx18g -Xss256k 
-Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC 
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly 
-XX:+HeapDumpOnOutOfMemoryError -Delasticsearch 
-Des.pidfile=/var/run/elasticsearch.pid -Des.path.home=/usr/share/elasticsearch 
-cp 
:/usr/share/elasticsearch/lib/elasticsearch-1.0.0.jar:/usr/share/elasticsearch/lib/*:/usr/share/elasticsearch/lib/sigar/*
 -Des.default.config=/etc/elasticsearch/elasticsearch.yml 
-Des.default.path.home=/usr/share/elasticsearch 
-Des.default.path.logs=/var/log/elasticsearch 
-Des.default.path.data=/var/lib/elasticsearch 
-Des.default.path.work=/tmp/elasticsearch 
-Des.default.path.conf=/etc/elasticsearch 
org.elasticsearch.bootstrap.Elasticsearch

The log message I saw during the OutOfMemoryError:


[2014-04-09 14:17:28,393][WARN ][cluster.action.shard     ] [esearch16] 
[.marvel-2014.04.09][0] received shard failed for [.marvel-2014.04.09][0], 
node[SK5okikgSWSdbrQdZWET8g], [R], s[INITIALIZING], in
dexUUID [K4IB1Px3RoOqcjPbta-fKw], reason [Failed to start shard, message 
[RecoveryFailedException[[.marvel-2014.04.09][0]: Recovery failed from 
[esearch16][EbYQ9HNzQtexkEZ1PgwpnQ][esearch16.tlys.us][in
et[/10.145.167.184:9300]] into 
[esearch13][SK5okikgSWSdbrQdZWET8g][esearch13.tlys.us][inet[ip-10-185-195-69.ec2.internal/10.185.195.69:9300]]];
 nested: RemoteTransportException[[esearch16][inet[/10.145
.167.184:9300]][index/shard/recovery/startRecovery]]; nested: 
RecoveryEngineException[[.marvel-2014.04.09][0] Phase[2] Execution failed]; 
nested: RemoteTransportException[[esearch13][inet[/10.185.195.6
9:9300]][index/shard/recovery/prepareTranslog]]; nested: 
EngineCreationFailureException[[.marvel-2014.04.09][0] failed to create 
engine]; nested: LockObtainFailedException[Lock obtain timed out: Native
FSLock@/ebsmnt/data/elasticsearch/search-prod/nodes/0/indices/.marvel-2014.04.09/0/index/write.lock];
 ]]
[2014-04-09 14:24:47,111][WARN ][cluster.action.shard     ] [esearch16] 
[domain_url_2014-01-03][4] received shard failed for 
[domain_url_2014-01-03][4], node[SK5okikgSWSdbrQdZWET8g], [R], s[STARTED], i
ndexUUID [OkUadV5JSJGI2B9dKwgMLw], reason [engine failure, message 
[OutOfMemoryError[Java heap space]]]
[2014-04-09 14:26:06,104][WARN ][cluster.action.shard     ] [esearch16] 
[.marvel-2014.04.09][0] received shard failed for [.marvel-2014.04.09][0], 
node[SK5okikgSWSdbrQdZWET8g], [R], s[INITIALIZING], in
dexUUID [K4IB1Px3RoOqcjPbta-fKw], reason [Failed to start shard, message 
[RecoveryFailedException[[.marvel-2014.04.09][0]: Recovery failed from 
[esearch16][EbYQ9HNzQtexkEZ1PgwpnQ][esearch16.tlys.us][in
et[/10.145.167.184:9300]] into 
[esearch13][SK5okikgSWSdbrQdZWET8g][esearch13.tlys.us][inet[ip-10-185-195-69.ec2.internal/10.185.195.69:9300]]];
 nested: RemoteTransportException[[esearch16][inet[/10.145
.167.184:9300]][index/shard/recovery/startRecovery]]; nested: 
RecoveryEngineException[[.marvel-2014.04.09][0] Phase[2] Execution failed]; 
nested: RemoteTransportException[[esearch13][inet[/10.185.195.6
9:9300]][index/shard/recovery/prepareTranslog]]; nested: 
EngineCreationFailureException[[.marvel-2014.04.09][0] failed to create 
engine]; nested: LockObtainFailedException[Lock obtain timed out: Native
FSLock@/ebsmnt/data/elasticsearch/search-prod/nodes/0/indices/.marvel-2014.04.09/0/index/write.lock];
 ]]
[2014-04-09 14:26:48,562][INFO ][cluster.metadata         ] [esearch16] 
updating number_of_replicas to [0] for indices [.marvel-2014.04.09]
[2014-04-09 14:27:27,235][INFO ][cluster.metadata         ] [esearch16] 
updating number_of_replicas to [0] for indices [.marvel-2014.04.09]
[2014-04-09 14:37:01,359][INFO ][cluster.metadata         ] [esearch16] 
[.marvel-2014.04.09] update_mapping [shard_event] (dynamic)
[2014-04-09 14:37:01,531][INFO ][cluster.metadata         ] [esearch16] 
[.marvel-2014.04.09] update_mapping [routing_event] (dynamic)
[2014-04-09 14:40:51,469][WARN ][cluster.action.shard     ] [esearch16] 
[domain_url_2014-01-01][2] received shard failed for 
[domain_url_2014-01-01][2], node[BKCZOztRRP6FXVKJSkT_oA], [R], s[STARTED], i
ndexUUID [jDcZyjUrSW6eD3_TH5v0_Q], reason [engine failure, message 
[OutOfMemoryError[Java heap space]]]
[2014-04-09 14:41:00,353][WARN ][cluster.action.shard     ] [esearch16] 
[domain_url_2014-03-11][2] received shard failed for 
[domain_url_2014-03-11][2], node[SK5okikgSWSdbrQdZWET8g], [R], s[STARTED], i
ndexUUID [HuQzTDCmTMeS3He3DumnOg], reason [engine failure, message 
[OutOfMemoryError[Java heap space]]]
[2014-04-09 15:04:32,504][WARN ][cluster.action.shard     ] [esearch16] 
[domain_url_2014-01-03][2] received shard failed for 
[domain_url_2014-01-03][2], node[BKCZOztRRP6FXVKJSkT_oA], [R], s[STARTED], i
ndexUUID [OkUadV5JSJGI2B9dKwgMLw], reason [engine failure, message 
[OutOfMemoryError[Java heap space]]]
[2014-04-09 15:12:13,529][WARN ][cluster.action.shard     ] [esearch16] 
[domain_url_2014-01-01][2] received shard failed for 
[domain_url_2014-01-01][2], node[BKCZOztRRP6FXVKJSkT_oA], [R], s[STARTED], i
ndexUUID [jDcZyjUrSW6eD3_TH5v0_Q], reason [engine failure, message 
[OutOfMemoryError[Java heap space]]]
[2014-04-09 15:39:24,021][WARN ][cluster.action.shard     ] [esearch16] 
[domain_url_2014-01-03][1] received shard failed for 
[domain_url_2014-01-03][1], node[4ft2nd1lRE-BdvL2iYGIkg], relocating [BKCZOz
tRRP6FXVKJSkT_oA], [R], s[INITIALIZING], indexUUID [OkUadV5JSJGI2B9dKwgMLw], 
reason [Failed to start shard, message 
[RecoveryFailedException[[domain_url_2014-01-03][1]: Recovery failed from 
[esearch15]
[EkR2xgpURrunkxrRnpkzYQ][esearch15.tlys.us][inet[ip-10-185-171-146.ec2.internal/10.185.171.146:9300]]
 into 
[esearch14][4ft2nd1lRE-BdvL2iYGIkg][esearch14.tlys.us][inet[ip-10-184-39-23.ec2.internal/10.18
4.39.23:9300]]]; nested: 
RemoteTransportException[[esearch15][inet[/10.185.171.146:9300]][index/shard/recovery/startRecovery]];
 nested: RecoveryEngineException[[domain_url_2014-01-03][1] Phase[2] Execu
tion failed]; nested: 
RemoteTransportException[[esearch14][inet[/10.184.39.23:9300]][index/shard/recovery/prepareTranslog]];
 nested: OutOfMemoryError[Java heap space]; ]]
[2014-04-09 15:42:51,176][WARN ][cluster.action.shard     ] [esearch16] 
[domain_url_2014-01-06][1] received shard failed for 
[domain_url_2014-01-06][1], node[4ft2nd1lRE-BdvL2iYGIkg], [R], s[STARTED], i
ndexUUID [51jdwEMrTGKtTpA90ZjXiQ], reason [engine failure, message 
[OutOfMemoryError[Java heap space]]]
[2014-04-09 15:54:42,711][DEBUG][action.admin.cluster.stats] [esearch16] failed 
to execute on node [4ft2nd1lRE-BdvL2iYGIkg]
org.elasticsearch.transport.RemoteTransportException: 
[esearch14][inet[/10.184.39.23:9300]][cluster/stats/n]
Caused by: org.elasticsearch.index.engine.EngineClosedException: 
[domain_url_2014-01-01][1] CurrentState[CLOSED]
        at 
org.elasticsearch.index.engine.internal.InternalEngine.ensureOpen(InternalEngine.java:913)
        at 
org.elasticsearch.index.engine.internal.InternalEngine.segmentsStats(InternalEngine.java:1130)
        at 
org.elasticsearch.index.shard.service.InternalIndexShard.segmentStats(InternalIndexShard.java:532)
        at 
org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:161)
        at 
org.elasticsearch.action.admin.indices.stats.ShardStats.<init>(ShardStats.java:49)
        at 
org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:130)
        at 
org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:54)
        at 
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:281)
        at 
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:272)
        at 
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.OutOfMemoryError: Java heap space
        at org.apache.lucene.util.fst.BytesStore.<init>(BytesStore.java:62)
        at org.apache.lucene.util.fst.FST.<init>(FST.java:366)
        at org.apache.lucene.util.fst.FST.<init>(FST.java:301)
        at 
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader.<init>(BlockTreeTermsReader.java:481)
        at 
org.apache.lucene.codecs.BlockTreeTermsReader.<init>(BlockTreeTermsReader.java:175)
        at 
org.apache.lucene.codecs.lucene41.Lucene41PostingsFormat.fieldsProducer(Lucene41PostingsFormat.java:437)
        at 
org.elasticsearch.index.codec.postingsformat.BloomFilterPostingsFormat$BloomFilteredFieldsProducer.<init>(BloomFilterPostingsFormat.java:131)
        at 
org.elasticsearch.index.codec.postingsformat.BloomFilterPostingsFormat.fieldsProducer(BloomFilterPostingsFormat.java:102)
        at 
org.elasticsearch.index.codec.postingsformat.Elasticsearch090PostingsFormat.fieldsProducer(Elasticsearch090PostingsFormat.java:79)
        at 
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:195)
        at 
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:244)
        at 
org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:115)
        at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:95)
        at 
org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:141)
        at 
org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:235)
        at 
org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:100)
        at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:382)
        at 
org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:111)
        at 
org.apache.lucene.search.XSearcherManager.<init>(XSearcherManager.java:94)
        at 
org.elasticsearch.index.engine.internal.InternalEngine.buildSearchManager(InternalEngine.java:1462)
        at 
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:801)
        at 
org.elasticsearch.index.engine.internal.InternalEngine.updateIndexingBufferSize(InternalEngine.java:223)
        at 
org.elasticsearch.indices.memory.IndexingMemoryController$ShardsIndicesStatusChecker.run(IndexingMemoryController.java:201)
        at 
org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:437)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        ... 3 more
[2014-04-09 15:54:51,827][WARN ][cluster.action.shard     ] [esearch16] 
[domain_url_2014-01-01][1] received shard failed for 
[domain_url_2014-01-01][1], node[4ft2nd1lRE-BdvL2iYGIkg], [R], s[STARTED], 
indexUUID [jDcZyjUrSW6eD3_TH5v0_Q], reason [engine failure, message 
[OutOfMemoryError[Java heap space]]]

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/d2d66060-d0ac-49bf-b9ab-f4157ac3a4d5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to