We are running into an issue where our client nodes will stop responding to requests that require checking with other nodes. The following is our setup:
1 dedicated master node 2 dedicated data nodes 3 client nodes (master: false; data: false) Everything is fine and happy for a while, and then after 30-45 minutes, one of the client nodes will stop sending responses to queries that require talking with other nodes. We are using the HTTP REST API. When things go badly, the following will hang: curl -XGET ‘http://localhost:9200/_search?size=1’ curl -XGET ‘http://localhost:9200/_cat/thread_pool?v’ But the following will succeed (as it can just use metadata on the node itself): curl -XGET ‘http://localhost:9200/_cluster/health?pretty=1’ The problem node doesn’t seem to have any CPU or IO load. We don’t seem to be running into heap issues. netstat doesn’t report any connections in TIME_WAIT on any of the nodes. If we run queries from the problem client node at the command prompt directly at the data node, everything works. So, if we instead run: curl -XGET ‘http://data.node.ip:9200/_search?size=1 <http://localhost:9200/_search?size=1>’ It works as expected. This tells me there isn’t a socket exhaustion issue since we can make new connections from the problem node to other nodes. We turned logged all the way up (“ALL”) on one of the client nodes until it started failing, but there was nothing in there of interest. The last few minutes just had messages about the idle connection reaper running every minute. We tried increasing the various connections_per_node values to: transport.connections_per_node.bulk => 6 transport.connections_per_node.reg => 12 transport.connections_per_node.state => 2 transport.connections_per_node.ping => 2 This made no noticeable difference. When one of the client nodes has started having problems, the cluster still sees the node as part of the cluster. When we kill the ES process on that node, all the other nodes then notice it went away as expected. When we restart ES on the problem node, it comes back up and everything works great for another 30-45 minutes. -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b8febb38-cd43-4102-b4fb-6dcdd9749aa5%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
