A few days ago we started to receive a lot of timeouts across our cluster. This is causing shard allocation to fail and a perpetual red/yellow state.
Examples: [2015-04-16 15:04:50,970][DEBUG][action.admin.cluster.node.stats] [coordinator02] failed to execute on node [1rfWT-mXTZmF_NzR_h1IZw] org.elasticsearch.transport.ReceiveTimeoutTransportException: [search01][inet[ip-172-30-11-161.ec2.internal/172.30.11.161:9300]][cluster:monitor/nodes/stats[n]] request_id [3680727] timed out after [15001ms] at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) [2015-04-16 15:03:26,105][WARN ][gateway.local ] [coordinator02] [global.y2014m01d30.v2][0]: failed to list shard stores on node [1rfWT-mXTZmF_NzR_h1IZw] org.elasticsearch.action.FailedNodeException: Failed node [1rfWT-mXTZmF_NzR_h1IZw] at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.onFailure(TransportNodesOperationAction.java:206) at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$1000(TransportNodesOperationAction.java:97) at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction$4.handleException(TransportNodesOperationAction.java:178) at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [search01][inet[ip-172-30-11-161.ec2.internal/172.30.11.161:9300]][internal:cluster/nodes/indices/shard/store[n]] request_id [3677537] timed out after [30001ms] ... 4 more I believe I have tracked this down to the management thread pool being saturated on our data nodes and not responding to requests. Our cluster has 3 master nodes,no data and 3 worker nodes,no master. I increased the maximum pool size from 5 to 20 and the workers immediately jumped to 20. I'm still seeing the errors. host management.type management.active management.size management.queue management.queueSize management.rejected management.largest management.completed management.min management.max management.keepAlive coordinator01 scaling 1 2 0 0 2 37884 1 20 5m search02 scaling 1 20 0 0 20 1945337 1 20 5m search01 scaling 1 20 0 0 20 2034838 1 20 5m search03 scaling 1 20 0 0 20 1862848 1 20 5m coordinator03 scaling 1 2 0 0 2 37875 1 20 5m coordinator02 scaling 2 5 0 0 5 44127 1 20 5m How can I address this problem? Thanks, Charlie -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/58d67b54-212c-4b72-944b-3ae3f75fe4da%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.