Hi,

I have a cluster of 3-node cluster in EC2 - and am seeing frequent 
NodeNotConnectedException related errors which cause intermittent failures 
during indexing. I'm hoping some one knows what this is able and can help.

Thanks in advance for your help - Here are the details - 

There are 3 nodes (es1, es2 and es3 - all are defined to be 
node.master=true, node.data=true - and es1 is the current master). All 
three nodes are running ES 1.4.2, 15GB heap, r3.xlarge instances, JDK 
1.7.0_72. We are using the AWS-Cloud plugin for ec2 discovery. The 
discovery part works fine I think and we haven't had problems there.

What we are seeing is that the cluster is running fine for most of the 
time, but periodically (say once every hour or two) we seem to see failures 
in the logs on es1 (the master node) with both indexing and with the node 
[indices:monitor/stats] apis (these are debug messages) - and they seem to 
be happening because the connection between the master node (es1) and 
either of the other nodes is lost.

I tried doing searches in this mailing list and then configured tcp keep 
alive settings- I think it helped but not really sure since the "node not 
connected" errors are still happening. 

Here is a section of the master log  that shows the exceptions:



[2015-01-08 14:02:52,203][DEBUG][action.admin.indices.stats] [es1] 
[alert][0], node[jAhWlTiKTASdHDQaZGVncw], [P], s[STARTED]: failed to 
execute 
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@2a694684]
org.elasticsearch.transport.NodeDisconnectedException: 
[es2][inet[/10.109.172.201:9300]][indices:monitor/stats[s]] 
disconnected

<....deleted for brevity - Bunch of these exceptions on index stats for 
each of the indexes we have....>

[2015-01-08
 14:02:52,205][WARN ][action.index             ] [es1] Failed to perform
 indices:data/write/index on remote replica 
[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[
/10.109.172.201:9300]][config][3]
org.elasticsearch.transport.NodeDisconnectedException:
 [es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]] 
disconnected
[2015-01-08 14:02:52,206][WARN ][cluster.action.shard    
 ] [es1] [config][3] sending failed shard for [config][3], 
node[jAhWlTiKTASdHDQaZGVncw], [R], s[STARTED], indexUUID 
[xnxor01lSTC8dY-0wwPXlQ], reason [Failed to perform 
[indices:data/write/index] on replica, message 
[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
 disconnected]]]
[2015-01-08 14:02:52,206][WARN 
][cluster.action.shard     ] [es1] [config][3] received shard failed for
 [config][3], node[jAhWlTiKTASdHDQaZGVncw], [R], s[STARTED], indexUUID 
[xnxor01lSTC8dY-0wwPXlQ], reason [Failed to perform 
[indices:data/write/index] on replica, message 
[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
 disconnected]]]
....
[2015-01-08 14:02:52,206][WARN 
][action.index             ] [es1] Failed to perform 
indices:data/write/index on remote replica 
[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[
/10.109.172.201:9300]][origin_v0101][0]
org.elasticsearch.transport.NodeDisconnectedException:
 [es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]] 
disconnected
[2015-01-08 14:02:52,206][WARN ][cluster.action.shard    
 ] [es1] [origin_v0101][0] sending failed shard for [origin_v0101][0], 
node[jAhWlTiKTASdHDQaZGVncw], [R], s[STARTED], indexUUID 
[_G8gVWViS6OoX59MHJtwhA], reason [Failed to perform 
[indices:data/write/index] on replica, message 
[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
 disconnected]]]
[2015-01-08 14:02:52,206][WARN 
][cluster.action.shard     ] [es1] [origin_v0101][0] received shard 
failed for [origin_v0101][0], node[jAhWlTiKTASdHDQaZGVncw], [R], 
s[STARTED], indexUUID [_G8gVWViS6OoX59MHJtwhA], reason [Failed to 
perform [indices:data/write/index] on replica, message 
[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
 disconnected]]]
[2015-01-08 14:02:52,206][WARN ][action.index        
     ] [es1] Failed to perform indices:data/write/index on remote 
replica 
[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[
/10.109.172.201:9300]][origin_v0101][0]
org.elasticsearch.transport.NodeDisconnectedException:
 [es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]] 
disconnected
[2015-01-08 14:02:52,206][WARN ][cluster.action.shard    
 ] [es1] [origin_v0101][0] sending failed shard for [origin_v0101][0], 
node[jAhWlTiKTASdHDQaZGVncw], [R], s[STARTED], indexUUID 
[_G8gVWViS6OoX59MHJtwhA], reason [Failed to perform 
[indices:data/write/index] on replica, message 
[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
 disconnected]]]
[2015-01-08 14:02:52,207][WARN 
][cluster.action.shard     ] [es1] [origin_v0101][0] received shard 
failed for [origin_v0101][0], node[jAhWlTiKTASdHDQaZGVncw], [R], 
s[STARTED], indexUUID [_G8gVWViS6OoX59MHJtwhA], reason [Failed to 
perform [indices:data/write/index] on replica, message 
[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
 disconnected]]]
....
org.elasticsearch.transport.NodeDisconnectedException:
 [es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]] 
disconnected
[2015-01-08 14:02:52,230][WARN 
][cluster.action.shard     ] [es1] [origin_v0101][0] sending failed 
shard for [origin_v0101][0], node[jAhWlTiKTASdHDQaZGVncw], [R], 
s[STARTED], indexUUID [_G8gVWViS6OoX59MHJtwhA], reason [Failed to 
perform [indices:data/write/index] on replica, message 
[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
 disconnected]]]
[2015-01-08 14:02:52,230][WARN 
][cluster.action.shard     ] [es1] [origin_v0101][0] received shard 
failed for [origin_v0101][0], node[jAhWlTiKTASdHDQaZGVncw], [R], 
s[STARTED], indexUUID [_G8gVWViS6OoX59MHJtwhA], reason [Failed to 
perform [indices:data/write/index] on replica, message 
[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
 disconnected]]]
[2015-01-08 
14:02:52,230][DEBUG][action.admin.indices.stats] [es1] 
[event-v1-20141227][4], node[jAhWlTiKTASdHDQaZGVncw], [R], s[STARTED]: 
failed to execute 
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@d1974d5]
org.elasticsearch.transport.NodeDisconnectedException: 
[es2][inet[/10.109.172.201:9300]][indices:monitor/stats[s]] disconnected
[2015-01-08
 14:02:52,227][WARN ][action.index             ] [es1] Failed to perform
 indices:data/write/index on remote replica 
[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[
/10.109.172.201:9300]][origin_v0101][0]
org.elasticsearch.transport.SendRequestTransportException: 
[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
    at 
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:213)
   
 at 
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnReplica(TransportShardReplicationOperationAction.java:669)
   
 at 
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performReplicas(TransportShardReplicationOperationAction.java:641)
   
 at 
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:512)
   
 at 
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:419)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.NodeNotConnectedException: 
[es2][inet[/10.109.172.201:9300]] Node not connected
    at 
org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:946)
    at 
org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:640)
    at 
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:199)
    ... 7 more
org.elasticsearch.transport.NodeDisconnectedException: 
[es2][inet[/10.109.172.201:9300]][indices:monitor/stats[s]] disconnected
[2015-01-08
 14:02:52,232][WARN ][action.index             ] [es1] Failed to perform
 indices:data/write/index on remote replica 
[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[/10.109.172.201:9300
]][config][3]
org.elasticsearch.transport.NodeDisconnectedException:
 [es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]] 
disconnected
[2015-01-08 14:02:52,232][WARN ][search.action            ] [es1] Failed to 
send release search context
org.elasticsearch.transport.SendRequestTransportException:
 
[es2][inet[/10.109.172.201:9300]][indices:data/read/search[free_context]]
    at org.elasticsearch.transport.TransportService.sendRequest(
TransportService.java:213)
    at org.elasticsearch.transport.TransportService.sendRequest(
TransportService.java:183)
    at org.elasticsearch.search.action.SearchServiceTransportAction.
sendFreeContext(SearchServiceTransportAction.java:143)
   
 at 
org.elasticsearch.action.search.type.
TransportSearchTypeAction$BaseAsyncAction.releaseIrrelevantSearchContexts(
TransportSearchTypeAction.java:341)
   
 at 
org.elasticsearch.action.search.type.
TransportSearchQueryThenFetchAction$AsyncAction$2.run(
TransportSearchQueryThenFetchAction.java:158)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [es2][inet
[/10.109.172.201:9300]] Node not connected
    at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(
NettyTransport.java:946)
    at org.elasticsearch.transport.netty.NettyTransport.sendRequest(
NettyTransport.java:640)
    at org.elasticsearch.transport.TransportService.sendRequest(
TransportService.java:199)
..
<..... Bunch of these failures then we get the connection and things settle 
down again.......>
[2015-01-08
 14:02:54,165][INFO ][cluster.service          ] [es1] removed 
{[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[/10.109.172.201:9300
]],},
 reason: 
zen-disco-node_failed([es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet
[/10.109.172.201:9300]]),
 reason transport disconnected
[2015-01-08 14:03:27,330][INFO 
][cluster.service          ] [es1] added 
{[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[/10.109.172.201:9300
]],},
 reason: zen-disco-receive(join from 
node[[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[/10.109.172.201:
9300]]])

At the same time on the disconnecting node - es2 - the logs are fairly 
minimal/quiet:

[2015-01-08 14:02:55,431][INFO ][discovery.ec2            ] [es2] 
master_left [[es1][MJ5njwAqQ_-10imG53WDGw][ip-10-152-16-37][inet[/10.152.
16.37:9300]]], reason [do not exists on master, act as master failure]
[2015-01-08 14:02:55,431][WARN ][discovery.ec2            ] [es2] master 
left (reason = do not exists on master, act as master failure), current 
nodes: {[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[/10.109.
172.201:9300]],[es3][XVZWtpq7Sc28Cj6C2wd42A][ip-10-79-189-47][inet[/10.79.
189.47:9300]]{master=true},}
[2015-01-08 14:02:55,432][INFO ][cluster.service          ] [es2] removed {[
es1][MJ5njwAqQ_-10imG53WDGw][ip-10-152-16-37][inet[/10.152.16.37:9300]],}, 
reason: zen-disco-master_failed ([es1][MJ5njwAqQ_-10imG53WDGw][ip-10-152-16-
37][inet[/10.152.16.37:9300]])
[2015-01-08 14:03:25,884][INFO ][cluster.service          ] [es2] 
detected_master [es1][MJ5njwAqQ_-10imG53WDGw][ip-10-152-16-37][inet[/10.152.
16.37:9300]], added {[es1][MJ5njwAqQ_-10imG53WDGw][ip-10-152-16-37][inet[/
10.152.16.37:9300]],}, reason: zen-disco-receive(from master [[es1][
MJ5njwAqQ_-10imG53WDGw][ip-10-152-16-37][inet[/10.152.16.37:9300]]])


*sysctl.conf changes:*
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_probes = 6
net.ipv4.tcp_keepalive_intvl = 10


*Here are our elasticsearch.yml config parameters:*
action.disable_delete_all_indices: true
node.name: [es1 OR es2 OR es3]
path.data: <data paths....>
gateway.type: local
gateway.recover_after_nodes: 2
gateway.recover_after_time: 10m
gateway.expected_nodes: 3
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.timeout: 30s
discovery.zen.ping.multicast.enabled: false
cloud:
    aws:
        access_key: <our key>
        secret_key: <our key>
discovery.type: ec2
discovery.ec2.groups: <group_name>
discovery.ec2.tag.elasticsearch: true
repositories:
    s3:
        bucket: <bucketname>
        region: <region-name>
        base_path: <backuppath>
index.search.slowlog.threshold.query.warn: 10
#
# we plan to raise this but set currently lower than the RAM of 15GB would 
allow
#
indices.fielddata.cache.size: 4.8GB
indices.fielddata.breaker.limit: 5.5GB 
http.cors.enabled: true




-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/5777e0f0-f68a-4c74-93b3-f8dcbdc5d677%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to