I recently built a new Elasticsearch cluster. 3 nodes running ES 0.90.7 and 
Java 7u45. Each node has 60G of memory and 2T SSD disk. I loaded 3T of data 
into an index with 128 shards on this cluster. So far so good.

I then added three more nodes (identical hw) and reconfigured the index to 
have one replica. Then all hell breaks loose. The cluster starts to 
replicate shards but before it manages to complete one of the nodes get 
stuck doing garbage collections and everything stops working. If I restart 
the cluster completely the scenario repeats.

There is a light load of queries coming into the cluster as it is trying to 
stabilize and I noted that only some of the nodes were responding, even 
though all nodes had completed shards. When digging further into this I 
noted that when I do a get on _cluster/nodes/stats I receive a list showing 
all six nodes, as expected. But when I get 
_cluster/nodes/stats?indices=true the response only includes a subset of 
the nodes. Sometimes it is three nodes, at the moment it is just one.

Looking in the logs on the machine I am doing the get of 
_cluster/nodes/stats?indices=true on I see exceptions like this in the log:
[2013-12-18 07:19:55,707][DEBUG][action.admin.cluster.node.stats] [NODE1] 
failed to execute on node [fAblkA8gRiuWP5_IWFihrA]
org.elasticsearch.transport.RemoteTransportException: 
[NODE2][inet[/A.B.C.D:9300]][cluster/nodes/stats/n]
Caused by: org.elasticsearch.index.engine.EngineClosedException: 
[reference][24] CurrentState[CLOSED]
        at 
org.elasticsearch.index.engine.robin.RobinEngine.ensureOpen(RobinEngine.java:969)
        at 
org.elasticsearch.index.engine.robin.RobinEngine.segmentsStats(RobinEngine.java:1181)
        at 
org.elasticsearch.index.shard.service.InternalIndexShard.segmentStats(InternalIndexShard.java:509)
        at 
org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:154)
        at 
org.elasticsearch.indices.InternalIndicesService.stats(InternalIndicesService.java:212)
        at 
org.elasticsearch.node.service.NodeService.stats(NodeService.java:165)
        at 
org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:100)
        at 
org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:43)
        at 
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:273)
        at 
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:264)
        at 
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)

When the response only includes one node I get five of these, so they very 
much feel related.

Does anybody have any idea what is happening here? Currently this cluster 
is completely unusable.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/0812ab27-c872-40af-8970-141a9bdcede8%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to