This was tracked down to a problem with Ubuntu 14.04 running under Xen (in AWS). The latest kernel in Ubuntu resolves the problem, so I had to do a rolling "apt-get update; apt-get dist-upgrade; reboot" on all nodes. This appears to have resolved the issue.
For reference: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1317811 On Thursday, April 16, 2015 at 11:20:06 AM UTC-4, Charlie Moad wrote: > > A few days ago we started to receive a lot of timeouts across our cluster. > This is causing shard allocation to fail and a perpetual red/yellow state. > > Examples: > [2015-04-16 15:04:50,970][DEBUG][action.admin.cluster.node.stats] > [coordinator02] failed to execute on node [1rfWT-mXTZmF_NzR_h1IZw] > org.elasticsearch.transport.ReceiveTimeoutTransportException: > [search01][inet[ip-172-30-11-161.ec2.internal/172.30.11.161:9300]][cluster:monitor/nodes/stats[n]] > > request_id [3680727] timed out after [15001ms] > at > org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > [2015-04-16 15:03:26,105][WARN ][gateway.local ] > [coordinator02] [global.y2014m01d30.v2][0]: failed to list shard stores on > node [1rfWT-mXTZmF_NzR_h1IZw] > org.elasticsearch.action.FailedNodeException: Failed node > [1rfWT-mXTZmF_NzR_h1IZw] > at > org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.onFailure(TransportNodesOperationAction.java:206) > at > org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$1000(TransportNodesOperationAction.java:97) > > at > org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction$4.handleException(TransportNodesOperationAction.java:178) > at > org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: > [search01][inet[ip-172-30-11-161.ec2.internal/172.30.11.161:9300]][internal:cluster/nodes/indices/shard/store[n]] > > request_id [3677537] timed out after [30001ms] > ... 4 more > > I believe I have tracked this down to the management thread pool being > saturated on our data nodes and not responding to requests. Our cluster has > 3 master nodes,no data and 3 worker nodes,no master. I increased the > maximum pool size from 5 to 20 and the workers immediately jumped to 20. > I'm still seeing the errors. > > host management.type management.active > management.size management.queue management.queueSize management.rejected > management.largest management.completed management.min management.max > management.keepAlive > coordinator01 scaling 1 > 2 0 0 > 2 37884 1 20 > 5m > search02 scaling 1 > 20 0 0 > 20 1945337 1 20 > 5m > search01 scaling 1 > 20 0 0 > 20 2034838 1 20 > 5m > search03 scaling 1 > 20 0 0 > 20 1862848 1 20 > 5m > coordinator03 scaling 1 > 2 0 0 > 2 37875 1 20 > 5m > coordinator02 scaling 2 > 5 0 0 > 5 44127 1 20 > 5m > > How can I address this problem? > > Thanks, > Charlie > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c49d0468-2d02-49f7-8356-4b9865842eb0%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.