I recently built a new Elasticsearch cluster. 3 nodes running ES 0.90.7 and
Java 7u45. Each node has 60G of memory and 2T SSD disk. I loaded 3T of data
into an index with 128 shards on this cluster. So far so good.
I then added three more nodes (identical hw) and reconfigured the index to
have one replica. Then all hell breaks loose. The cluster starts to
replicate shards but before it manages to complete one of the nodes get
stuck doing garbage collections and everything stops working. If I restart
the cluster completely the scenario repeats.
There is a light load of queries coming into the cluster as it is trying to
stabilize and I noted that only some of the nodes were responding, even
though all nodes had completed shards. When digging further into this I
noted that when I do a get on _cluster/nodes/stats I receive a list showing
all six nodes, as expected. But when I get
_cluster/nodes/stats?indices=true the response only includes a subset of
the nodes. Sometimes it is three nodes, at the moment it is just one.
Looking in the logs on the machine I am doing the get of
_cluster/nodes/stats?indices=true on I see exceptions like this in the log:
[2013-12-18 07:19:55,707][DEBUG][action.admin.cluster.node.stats] [NODE1]
failed to execute on node [fAblkA8gRiuWP5_IWFihrA]
org.elasticsearch.transport.RemoteTransportException:
[NODE2][inet[/A.B.C.D:9300]][cluster/nodes/stats/n]
Caused by: org.elasticsearch.index.engine.EngineClosedException:
[reference][24] CurrentState[CLOSED]
at
org.elasticsearch.index.engine.robin.RobinEngine.ensureOpen(RobinEngine.java:969)
at
org.elasticsearch.index.engine.robin.RobinEngine.segmentsStats(RobinEngine.java:1181)
at
org.elasticsearch.index.shard.service.InternalIndexShard.segmentStats(InternalIndexShard.java:509)
at
org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:154)
at
org.elasticsearch.indices.InternalIndicesService.stats(InternalIndicesService.java:212)
at
org.elasticsearch.node.service.NodeService.stats(NodeService.java:165)
at
org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:100)
at
org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:43)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:273)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:264)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
When the response only includes one node I get five of these, so they very
much feel related.
Does anybody have any idea what is happening here? Currently this cluster
is completely unusable.
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0812ab27-c872-40af-8970-141a9bdcede8%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.