A few days ago we started to receive a lot of timeouts across our cluster. 
This is causing shard allocation to fail and a perpetual red/yellow state.

Examples:
[2015-04-16 15:04:50,970][DEBUG][action.admin.cluster.node.stats] 
[coordinator02] failed to execute on node [1rfWT-mXTZmF_NzR_h1IZw]
org.elasticsearch.transport.ReceiveTimeoutTransportException: 
[search01][inet[ip-172-30-11-161.ec2.internal/172.30.11.161:9300]][cluster:monitor/nodes/stats[n]]
 
request_id [3680727] timed out after [15001ms]
        at 
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

[2015-04-16 15:03:26,105][WARN ][gateway.local            ] [coordinator02] 
[global.y2014m01d30.v2][0]: failed to list shard stores on node 
[1rfWT-mXTZmF_NzR_h1IZw]
org.elasticsearch.action.FailedNodeException: Failed node 
[1rfWT-mXTZmF_NzR_h1IZw]
        at 
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.onFailure(TransportNodesOperationAction.java:206)
        at 
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$1000(TransportNodesOperationAction.java:97)

        at 
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction$4.handleException(TransportNodesOperationAction.java:178)
        at 
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: 
[search01][inet[ip-172-30-11-161.ec2.internal/172.30.11.161:9300]][internal:cluster/nodes/indices/shard/store[n]]
 
request_id [3677537] timed out after [30001ms]
        ... 4 more

I believe I have tracked this down to the management thread pool being 
saturated on our data nodes and not responding to requests. Our cluster has 
3 master nodes,no data and 3 worker nodes,no master. I increased the 
maximum pool size from 5 to 20 and the workers immediately jumped to 20. 
I'm still seeing the errors.

host                        management.type management.active 
management.size management.queue management.queueSize management.rejected 
management.largest management.completed management.min management.max 
management.keepAlive 
coordinator01               scaling                         1               
2                0                                        0                 
 2                37884              1             20                   5m 
search02                    scaling                         1             
 20                0                                        0               
  20              1945337              1             20                   
5m 
search01                    scaling                         1             
 20                0                                        0               
  20              2034838              1             20                   
5m 
search03                    scaling                         1             
 20                0                                        0               
  20              1862848              1             20                   
5m 
coordinator03               scaling                         1               
2                0                                        0                 
 2                37875              1             20                   5m 
coordinator02               scaling                         2               
5                0                                        0                 
 5                44127              1             20                   5m 

How can I address this problem?

Thanks,
     Charlie

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/58d67b54-212c-4b72-944b-3ae3f75fe4da%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to