Inconsistent search cluster status and search results after long GC run

Thomas S. Thu, 27 Mar 2014 04:13:41 -0700

Hi,

Multiple times we ran into a problem where our search cluster was in an 
inconsistent state. We have 3 nodes (all running 1.0.1), where nodes 2+3 
hold the data (all the shards each, i.e. one replica per shard). Sometimes, 
a long GC run happens on one of the nodes (here on node 3), causing it to 
disconnect because the GC took longer than the timeout (here GC took 35.1s 
and our timeout is currently 9s):



NODE 1
[2014-03-27 00:55:41,032][WARN ][discovery.zen            ] [node1] 
received cluster state from 
[[node2][A45sMYqtQsGrwY5exK0sEg][node2][inet[/10.216.32.81:9300]]{master=true}] 
which is also master but with an older cluster_state, telling 
[[node2][A45sMYqtQsGrwY5exK0sEg][node2][inet[/10.216.32.81:9300]]{master=true}] 
to rejoin the cluster
[2014-03-27 00:55:41,033][WARN ][discovery.zen            ] [node1] failed 
to send rejoin request to 
[[node2][A45sMYqtQsGrwY5exK0sEg][node2][inet[/10.216.32.81:9300]]{master=true}]
org.elasticsearch.transport.SendRequestTransportException: 
[node2][inet[/10.216.32.81:9300]][discovery/zen/rejoin]
        at 
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:202)
        at 
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:173)
        at 
org.elasticsearch.discovery.zen.ZenDiscovery$7.execute(ZenDiscovery.java:556)
        at 
org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:308)
        at 
org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:134)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)
Caused by: org.elasticsearch.transport.NodeNotConnectedException: 
[node2][inet[/10.216.32.81:9300]] Node not connected
        at 
org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:859)
        at 
org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:540)
        at 
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:189)
        ... 7 more
[2014-03-27 01:54:45,722][WARN ][discovery.zen            ] [node1] 
received cluster state from 
[[node2][A45sMYqtQsGrwY5exK0sEg][node2][inet[/10.216.32.81:9300]]{master=true}] 
which is also master but with an older cluster_state, telling 
[[node2][A45sMYqtQsGrwY5exK0sEg][node2][inet[/10.216.32.81:9300]]{master=true}] 
to rejoin the cluster
[2014-03-27 01:54:45,723][WARN ][discovery.zen            ] [node1] failed 
to send rejoin request to 
[[node2][A45sMYqtQsGrwY5exK0sEg][node2][inet[/10.216.32.81:9300]]{master=true}]
org.elasticsearch.transport.SendRequestTransportException: 
[node2][inet[/10.216.32.81:9300]][discovery/zen/rejoin]
        at 
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:202)
        at 
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:173)
        at 
org.elasticsearch.discovery.zen.ZenDiscovery$7.execute(ZenDiscovery.java:556)
        at 
org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:308)
        at 
org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:134)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)
Caused by: org.elasticsearch.transport.NodeNotConnectedException: 
[node2][inet[/10.216.32.81:9300]] Node not connected
        at 
org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:859)
        at 
org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:540)
        at 
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:189)
        ... 7 more
[2014-03-27 07:19:02,889][WARN ][discovery.zen            ] [node1] 
received cluster state from 
[[node2][A45sMYqtQsGrwY5exK0sEg][node2][inet[/10.216.32.81:9300]]{master=true}] 
which is also master but with an older cluster_state, telling 
[[node2][A45sMYqtQsGrwY5exK0sEg][node2][inet[/10.216.32.81:9300]]{master=true}] 
to rejoin the cluster
[2014-03-27 07:19:02,889][WARN ][discovery.zen            ] [node1] failed 
to send rejoin request to 
[[node2][A45sMYqtQsGrwY5exK0sEg][node2][inet[/10.216.32.81:9300]]{master=true}]
org.elasticsearch.transport.SendRequestTransportException: 
[node2][inet[/10.216.32.81:9300]][discovery/zen/rejoin]
        at 
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:202)
        at 
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:173)
        at 
org.elasticsearch.discovery.zen.ZenDiscovery$7.execute(ZenDiscovery.java:556)
        at 
org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:308)
        at 
org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:134)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)
Caused by: org.elasticsearch.transport.NodeNotConnectedException: 
[node2][inet[/10.216.32.81:9300]] Node not connected
        at 
org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:859)
        at 
org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:540)
        at 
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:189)
        ... 7 more


NODE 2
[2014-03-27 07:19:02,871][INFO ][cluster.service          ] [node2] removed 
{[node3][RRqWlTWnQ7ygvsOaJS0_mA][node3][inet[/10.235.38.84:9300]]{master=true},},
 
reason: 
zen-disco-node_failed([node3][RRqWlTWnQ7ygvsOaJS0_mA][node3][inet[/10.235.38.84:9
300]]{master=true}), reason failed to ping, tried [2] times, each with 
maximum [9s] timeout


NODE 3
[2014-03-27 07:19:20,055][WARN ][monitor.jvm              ] [node3] 
[gc][old][539697][754] duration [35.1s], collections [1]/[35.8s], total 
[35.1s]/[2.7m], memory [4.9gb]->[4.2gb]/[7.9gb], all_pools {[young] 
[237.8mb]->[7.4mb]/[266.2mb]}{[survivor] [25.5mb]->[0b]/[33
.2mb]}{[old] [4.6gb]->[4.2gb]/[7.6gb]}
[2014-03-27 07:19:20,112][INFO ][discovery.zen            ] [node3] 
master_left 
[[node2][A45sMYqtQsGrwY5exK0sEg][node2][inet[/10.216.32.81:9300]]{master=true}],
 
reason [do not exists on master, act as master failure]
[2014-03-27 07:19:20,117][INFO ][cluster.service          ] [node3] master 
{new 
[node1][DxlcpaqOTmmpNSRoqt1sZg][node1.example][inet[/10.252.78.88:9300]]{data=false,
 
master=true}, previous 
[node2][A45sMYqtQsGrwY5exK0sEg][node2][inet[/10.216.32.81:9300
]]{master=true}}, removed 
{[node2][A45sMYqtQsGrwY5exK0sEg][node2][inet[/10.216.32.81:9300]]{master=true},},
 
reason: zen-disco-master_failed 
([node2][A45sMYqtQsGrwY5exK0sEg][node2][inet[/10.216.32.81:9300]]{master=true})


After this scenario, the cluster doesn't recover properly: The worst thing 
is that node 1 sees nodes 1+3, node 2 sees nodes 1+2 and node 3 sees nodes 
1+3. Since the cluster is set up to operate with two nodes, both data nodes 
2 and 3 accept data and searches, causing inconsistent results and 
requiring us to do a full cluster restart and reindex all production data 
to make sure the cluster is consistent again.


NODE 1 (GET /_nodes):
{
  "cluster_name" : "elasticsearch",
  "nodes" : {
    "DxlcpaqOTmmpNSRoqt1sZg" : {
      "name" : "node1",
      ...
    },
    "RRqWlTWnQ7ygvsOaJS0_mA" : {
      "name" : "node3",
      ...
    }
  }
}

NODE 2 (GET /_nodes):
{
  "cluster_name" : "elasticsearch",
  "nodes" : {
    "A45sMYqtQsGrwY5exK0sEg" : {
      "name" : "node2",
      ...
    },
    "DxlcpaqOTmmpNSRoqt1sZg" : {
      "name" : "node1",
      ...
    }
  }
}

NODE 3 (GET /_nodes):
{
  "cluster_name" : "elasticsearch",
  "nodes" : {
    "DxlcpaqOTmmpNSRoqt1sZg" : {
      "name" : "node1",
      ...
    },
    "RRqWlTWnQ7ygvsOaJS0_mA" : {
      "name" : "node3",
      ...
    }
  }
}


Here are the configurations:

BASE CONFIG (for all nodes):
action:
  disable_delete_all_indices: true
discovery:
  zen:
    fd:
      ping_retries: 2
      ping_timeout: 9s
    minimum_master_nodes: 2
    ping:
      multicast:
        enabled: false
      unicast:
        hosts: ["node1.example", "node2.example", "node3.example"]
index:
  fielddata:
    cache: node
indices:
  fielddata:
    cache:
      size: 40%
  memory:
    index_buffer_size: 20%
threadpool:
  bulk:
    queue_size: 100
    type: fixed
transport:
  tcp:
    connect_timeout: 3s

NODE 1:
node:
  data: false
  master: true
  name: node1

NODE 2:
node:
  data: true
  master: true
  name: node2

NODE 3:
node:
  data: true
  master: true
  name: node3


Questions:
1) What can we do to minimize long GC runs, so the nodes don't become 
unresponsive and disconnect in the first place? (FYI: Our index is 
currently about 80 GB in size with over 2M docs (per node), 60 shards, heap 
size 8 GB. We run both searches and aggregations on it.)
2) Obviously, having the cluster state in a state like the above is 
unacceptable and we therefore want to make sure that even if a node 
disconnects because of GC, the cluster can fully recover and only one of 
the two data nodes can accept data and searches while a node is 
disconnected. Is there anything that needs to be changed in the 
Elasticsearch code to fix this issue?

Thanks,
Thomas

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/1d04f9e9-541d-4440-b874-143564c6ecdb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Inconsistent search cluster status and search results after long GC run

Reply via email to