This was tracked down to a problem with Ubuntu 14.04 running under Xen (in 
AWS). The latest kernel in Ubuntu resolves the problem, so I had to do a 
rolling "apt-get update; apt-get dist-upgrade; reboot" on all nodes. This 
appears to have resolved the issue.

For reference: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1317811

On Thursday, April 16, 2015 at 11:20:06 AM UTC-4, Charlie Moad wrote:
>
> A few days ago we started to receive a lot of timeouts across our cluster. 
> This is causing shard allocation to fail and a perpetual red/yellow state.
>
> Examples:
> [2015-04-16 15:04:50,970][DEBUG][action.admin.cluster.node.stats] 
> [coordinator02] failed to execute on node [1rfWT-mXTZmF_NzR_h1IZw]
> org.elasticsearch.transport.ReceiveTimeoutTransportException: 
> [search01][inet[ip-172-30-11-161.ec2.internal/172.30.11.161:9300]][cluster:monitor/nodes/stats[n]]
>  
> request_id [3680727] timed out after [15001ms]
>         at 
> org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
>
> [2015-04-16 15:03:26,105][WARN ][gateway.local            ] 
> [coordinator02] [global.y2014m01d30.v2][0]: failed to list shard stores on 
> node [1rfWT-mXTZmF_NzR_h1IZw]
> org.elasticsearch.action.FailedNodeException: Failed node 
> [1rfWT-mXTZmF_NzR_h1IZw]
>         at 
> org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.onFailure(TransportNodesOperationAction.java:206)
>         at 
> org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$1000(TransportNodesOperationAction.java:97)
>
>         at 
> org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction$4.handleException(TransportNodesOperationAction.java:178)
>         at 
> org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: 
> [search01][inet[ip-172-30-11-161.ec2.internal/172.30.11.161:9300]][internal:cluster/nodes/indices/shard/store[n]]
>  
> request_id [3677537] timed out after [30001ms]
>         ... 4 more
>
> I believe I have tracked this down to the management thread pool being 
> saturated on our data nodes and not responding to requests. Our cluster has 
> 3 master nodes,no data and 3 worker nodes,no master. I increased the 
> maximum pool size from 5 to 20 and the workers immediately jumped to 20. 
> I'm still seeing the errors.
>
> host                        management.type management.active 
> management.size management.queue management.queueSize management.rejected 
> management.largest management.completed management.min management.max 
> management.keepAlive 
> coordinator01               scaling                         1             
>   2                0                                        0               
>    2                37884              1             20                   
> 5m 
> search02                    scaling                         1             
>  20                0                                        0               
>   20              1945337              1             20                   
> 5m 
> search01                    scaling                         1             
>  20                0                                        0               
>   20              2034838              1             20                   
> 5m 
> search03                    scaling                         1             
>  20                0                                        0               
>   20              1862848              1             20                   
> 5m 
> coordinator03               scaling                         1             
>   2                0                                        0               
>    2                37875              1             20                   
> 5m 
> coordinator02               scaling                         2             
>   5                0                                        0               
>    5                44127              1             20                   
> 5m 
>
> How can I address this problem?
>
> Thanks,
>      Charlie
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/c49d0468-2d02-49f7-8356-4b9865842eb0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to