Hello,

I am doing a bit of work to harden our Elasticsearch cluster configuration. 
We've had issues with split brain scenarios in the past. We have clusters 
with 3 or 5 nodes where any node can be handling index and search 
operations. We have used the min master nodes to try and alleviate issues 
but it did not work for us, still split brains.

We are using ES 1.1.0 with the latest ZooKeeper plugin. I have a test 
cluster of 3 nodes running on m3.large instances w/ 500 GB EBS volumes. I 
have tested a few scenarios so far which have performed as expected:

   1. Communication between a node and ZK going down. After short timeout 
   (~30 seconds) node is eliminated from cluster.
   2. The sudden death of a node (master or otherwise) via 'kill -9'. 
   Rebalancing and election worked out very well here. 
   3. Stopping a node cleanly, nothing odd here works every time and ZK 
   makes cluster state updates really fast.
   4. Adding new nodes, again quick cluster state updates via ZK.
   
The last scenario I am interested in is network partitions. In this case I 
am trying to sever the communication between two of the nodes and a third. 
I have been using iptables to DROP all in/out bound data from one of the 3 
nodes in the test cluster to the other 2. I basically make four entries on 
the node I want to cease communication with.

After doing so it takes a very long time for the node to finally be evicted 
from cluster state. During this time a number of api methods will stop 
working, including /_stats and /_nodes but also search will time out on the 
node where com was severed. GOOD news is no split brains, bad news is 
eviction of the bad node takes a looooong time.

Any help with explaining what is going on or how I can better test this 
sort of scenario is much appreciated.

cheers,
Rob


The following is a bunch of the info on the exceptions I see when things 
start to time out finally. Until the node is removed from cluster state 
everything works kinda wonky.

The exception on a node trying to talk to the iptable'd node looks like 
this:

[2014-04-28 21:15:36,488][WARN ][cluster.service          ] [Kukulcan] 
failed to reconnect to node [Harold "Happy" Hogan][_xgiPJYmSuecN0--yDB
mlg][zookeeper-test-builders-us-west-1-i-c9f6b095][inet[/10.168.250.15:9300]]{availabilityzone=us-west-1b}
org.elasticsearch.transport.ConnectTransportException: [Harold "Happy" 
Hogan][inet[/10.168.250.15:9300]] connect_timeout[30s]
        at 
org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:773)
        at 
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:702)
        at 
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:670)
        at 
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:129)
        at 
org.elasticsearch.cluster.service.InternalClusterService$ReconnectToNodes.run(InternalClusterService.java:515)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)
Caused by: org.elasticsearch.common.netty.channel.ConnectTimeoutException: 
connection timed out: /10.168.250.15:9300
        at 
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.processConnectTimeout(NioClientBoss.java:137)
        at 
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:83)
        at 
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
        at 
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
        at 
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        at 
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)

I get these messages on the iptable'd node:

[2014-04-28 21:08:37,831][DEBUG][action.admin.indices.stats] [Harold 
"Happy" Hogan] [events-2014.04.27][0], node[Cin_0uRIQwubm585lpkYnQ], [P], 
s[STARTED]: Failed to execute 
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@bc812]
org.elasticsearch.transport.NodeDisconnectedException: [Herr 
Kleiser][inet[/10.176.41.55:9300]][indices/stats/s] disconnected
[2014-04-28 21:08:37,832][DEBUG][action.admin.indices.stats] [Harold 
"Happy" Hogan] [events-2014.04.28][0], node[Cin_0uRIQwubm585lpkYnQ], [P], 
s[STARTED]: Failed to execute 
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@7c82439e]
org.elasticsearch.transport.NodeDisconnectedException: [Herr 
Kleiser][inet[/10.176.41.55:9300]][indices/stats/s] disconnected
[2014-04-28 21:08:37,831][DEBUG][action.admin.indices.stats] [Harold 
"Happy" Hogan] [_river][0], node[Cin_0uRIQwubm585lpkYnQ], [R], s[STARTED]: 
Failed to execute 
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@5e30581e]
org.elasticsearch.transport.NodeDisconnectedException: [Herr 
Kleiser][inet[/10.176.41.55:9300]][indices/stats/s] disconnected
[2014-04-28 21:08:37,832][DEBUG][action.admin.indices.stats] [Harold 
"Happy" Hogan] [events-2014.04.27][2], node[Cin_0uRIQwubm585lpkYnQ], [P], 
s[STARTED]: Failed to execute 
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@7c82439e]
org.elasticsearch.transport.NodeDisconnectedException: [Herr 
Kleiser][inet[/10.176.41.55:9300]][indices/stats/s] disconnected
[2014-04-28 21:08:37,833][DEBUG][action.admin.indices.stats] [Harold 
"Happy" Hogan] [events-2014.04.25][0], node[Cin_0uRIQwubm585lpkYnQ], [R], 
s[STARTED]: Failed to execute 
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@bc812]
org.elasticsearch.transport.NodeDisconnectedException: [Herr 
Kleiser][inet[/10.176.41.55:9300]][indices/stats/s] disconnected
[2014-04-28 21:08:37,833][DEBUG][action.admin.indices.stats] [Harold 
"Happy" Hogan] [events-2014.04.25][3], node[Cin_0uRIQwubm585lpkYnQ], [R], 
s[STARTED]: Failed to execute 
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@bc812]
org.elasticsearch.transport.NodeDisconnectedException: [Herr 
Kleiser][inet[/10.176.41.55:9300]][indices/stats/s] disconnected
[2014-04-28 21:08:37,833][DEBUG][action.admin.indices.stats] [Harold 
"Happy" Hogan] [events-2014.04.27][2], node[Cin_0uRIQwubm585lpkYnQ], [P], 
s[STARTED]: Failed to execute 
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@5e30581e]
org.elasticsearch.transport.NodeDisconnectedException: [Herr 
Kleiser][inet[/10.176.41.55:9300]][indices/stats/s] disconnected
[2014-04-28 21:08:37,834][DEBUG][action.admin.indices.stats] [Harold 
"Happy" Hogan] [events-2014.04.26][1], node[Cin_0uRIQwubm585lpkYnQ], 
relocating [_xgiPJYmSuecN0--yDBmlg], [R], s[RELOCATING]: Failed to execute 
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@bc8
12]

and on of the nodes trying to talk with the iptable'd node:

[2014-04-28 21:09:29,346][DEBUG][action.admin.cluster.node.info] [Herr 
Kleiser] failed to execute on node [_xgiPJYmSuecN0--yDBmlg]
org.elasticsearch.transport.NodeDisconnectedException: [Harold "Happy" 
Hogan][inet[/10.168.250.15:9300]][cluster/nodes/info/n] disconnected
[2014-04-28 21:09:29,349][DEBUG][action.admin.cluster.node.info] [Herr 
Kleiser] failed to execute on node [_xgiPJYmSuecN0--yDBmlg]
org.elasticsearch.transport.NodeDisconnectedException: [Harold "Happy" 
Hogan][inet[/10.168.250.15:9300]][cluster/nodes/info/n] disconnected
[2014-04-28 21:09:29,349][DEBUG][action.admin.cluster.node.info] [Herr 
Kleiser] failed to execute on node [_xgiPJYmSuecN0--yDBmlg]
org.elasticsearch.transport.NodeDisconnectedException: [Harold "Happy" 
Hogan][inet[/10.168.250.15:9300]][cluster/nodes/info/n] disconnected
[2014-04-28 21:09:29,351][DEBUG][action.admin.cluster.node.info] [Herr 
Kleiser] failed to execute on node [_xgiPJYmSuecN0--yDBmlg]
org.elasticsearch.transport.NodeDisconnectedException: [Harold "Happy" 
Hogan][inet[/10.168.250.15:9300]][cluster/nodes/info/n] disconnected
[2014-04-28 21:09:29,350][DEBUG][action.admin.cluster.node.info] [Herr 
Kleiser] failed to execute on node [_xgiPJYmSuecN0--yDBmlg]
org.elasticsearch.transport.NodeDisconnectedException: [Harold "Happy" 
Hogan][inet[/10.168.250.15:9300]][cluster/nodes/info/n] disconnected
[2014-04-28 21:09:29,349][DEBUG][action.admin.cluster.node.info] [Herr 
Kleiser] failed to execute on node [_xgiPJYmSuecN0--yDBmlg]
org.elasticsearch.transport.NodeDisconnectedException: [Harold "Happy" 
Hogan][inet[/10.168.250.15:9300]][cluster/nodes/info/n] disconnected
[2014-04-28 21:12:26,195][DEBUG][action.admin.cluster.node.info] [Herr 
Kleiser] failed to execute on node [_xgiPJYmSuecN0--yDBmlg]
org.elasticsearch.transport.SendRequestTransportException: [Harold "Happy" 
Hogan][inet[/10.168.250.15:9300]][cluster/nodes/info/n]
        at 
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:202)
        at 
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.start(TransportNodesOperationAction.java:170)
        at 
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$300(TransportNodesOperationAction.java:102
)
        at 
org.elasticsearch.action.support.nodes.TransportNodesOperationAction.doExecute(TransportNodesOperationAction.java:73)
        at 
org.elasticsearch.action.support.nodes.TransportNodesOperationAction.doExecute(TransportNodesOperationAction.java:43)
        at 
org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
        at 
org.elasticsearch.client.node.NodeClusterAdminClient.execute(NodeClusterAdminClient.java:72)
        at 
org.elasticsearch.client.support.AbstractClusterAdminClient.nodesInfo(AbstractClusterAdminClient.java:183)
        at 
org.elasticsearch.rest.action.admin.cluster.node.info.RestNodesInfoAction.handleRequest(RestNodesInfoAction.java:105)
        at 
org.elasticsearch.rest.RestController.executeHandler(RestController.java:159)
        at 
org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:142)
        at 
org.elasticsearch.http.HttpServer.internalDispatchRequest(HttpServer.java:121)
        at 
org.elasticsearch.http.HttpServer$Dispatcher.dispatchRequest(HttpServer.java:83)
        at 
org.elasticsearch.http.netty.NettyHttpServerTransport.dispatchRequest(NettyHttpServerTransport.java:291)
        at 
org.elasticsearch.http.netty.HttpRequestHandler.messageReceived(HttpRequestHandler.java:43)
        at 
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at 
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at 
org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.jav
a:791)


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/82935f3d-2f01-42a9-afcb-5496e96daf42%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to