We are running into an issue where our client nodes will stop responding to 
requests that require checking with other nodes. The following is our setup:

1 dedicated master node

2 dedicated data nodes

3 client nodes (master: false; data: false)

Everything is fine and happy for a while, and then after 30-45 minutes, one 
of the client nodes will stop sending responses to queries that require 
talking with other nodes. We are using the HTTP REST API. When things go 
badly, the following will hang:

curl -XGET ‘http://localhost:9200/_search?size=1’

curl -XGET ‘http://localhost:9200/_cat/thread_pool?v’

But the following will succeed (as it can just use metadata on the node 
itself):

curl -XGET ‘http://localhost:9200/_cluster/health?pretty=1’

The problem node doesn’t seem to have any CPU or IO load. We don’t seem to 
be running into heap issues. netstat doesn’t report any connections in 
TIME_WAIT on any of the nodes. If we run queries from the problem client 
node at the command prompt directly at the data node, everything works. So, 
if we instead run:

curl -XGET ‘http://data.node.ip:9200/_search?size=1 
<http://localhost:9200/_search?size=1>’

It works as expected. This tells me there isn’t a socket exhaustion issue 
since we can make new connections from the problem node to other nodes.

We turned logged all the way up (“ALL”) on one of the client nodes until it 
started failing, but there was nothing in there of interest. The last few 
minutes just had messages about the idle connection reaper running every 
minute.

We tried increasing the various connections_per_node values to:

transport.connections_per_node.bulk => 6

transport.connections_per_node.reg => 12

transport.connections_per_node.state => 2

transport.connections_per_node.ping => 2

This made no noticeable difference.


When one of the client nodes has started having problems, the cluster still 
sees the node as part of the cluster. When we kill the ES process on that 
node, all the other nodes then notice it went away as expected. When we 
restart ES on the problem node, it comes back up and everything works great 
for another 30-45 minutes.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/b8febb38-cd43-4102-b4fb-6dcdd9749aa5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to