Re: EC2 + ZooKeeper Disco: Tips on Simulating Cluster Failures

[email protected] Wed, 30 Apr 2014 00:11:29 -0700

Why does minimum master did not work for you? It prevents effectively split
brain except one scenario: when the number of minimum master eligibility is
low and nodes disagree with the current master about the reachable number
of nodes in the cluster, and these numbers are all higher than minimum
master eligibility. So you could increase minimum master node eligibility
until it fits the purpose.


Even better, it is sufficient to launch just 3 small non-data nodes that
are master eligible, and all the other nodes can be big with data as you
want and non-master eligible. With just 3 master eligible nodes you can set
minimum master to 2 and let data nodes come and go as they want to.

And can you tell more about the "loooong time" with Zookeeper - is it
seconds, minutes, hours?

Just out of curiosity. I do not use EC2 because it is too unreliable and
slow for me, but it could also be Zookeeper specific.

Jörg


On Wed, Apr 30, 2014 at 1:42 AM, Rob Ottaway <[email protected]> wrote:

> Hello,
>
> I am doing a bit of work to harden our Elasticsearch cluster
> configuration. We've had issues with split brain scenarios in the past. We
> have clusters with 3 or 5 nodes where any node can be handling index and
> search operations. We have used the min master nodes to try and alleviate
> issues but it did not work for us, still split brains.
>
> We are using ES 1.1.0 with the latest ZooKeeper plugin. I have a test
> cluster of 3 nodes running on m3.large instances w/ 500 GB EBS volumes. I
> have tested a few scenarios so far which have performed as expected:
>
>    1. Communication between a node and ZK going down. After short timeout
>    (~30 seconds) node is eliminated from cluster.
>    2. The sudden death of a node (master or otherwise) via 'kill -9'.
>    Rebalancing and election worked out very well here.
>    3. Stopping a node cleanly, nothing odd here works every time and ZK
>    makes cluster state updates really fast.
>    4. Adding new nodes, again quick cluster state updates via ZK.
>
> The last scenario I am interested in is network partitions. In this case I
> am trying to sever the communication between two of the nodes and a third.
> I have been using iptables to DROP all in/out bound data from one of the 3
> nodes in the test cluster to the other 2. I basically make four entries on
> the node I want to cease communication with.
>
> After doing so it takes a very long time for the node to finally be
> evicted from cluster state. During this time a number of api methods will
> stop working, including /_stats and /_nodes but also search will time out
> on the node where com was severed. GOOD news is no split brains, bad news
> is eviction of the bad node takes a looooong time.
>
> Any help with explaining what is going on or how I can better test this
> sort of scenario is much appreciated.
>
> cheers,
> Rob
>
>
> The following is a bunch of the info on the exceptions I see when things
> start to time out finally. Until the node is removed from cluster state
> everything works kinda wonky.
>
> The exception on a node trying to talk to the iptable'd node looks like
> this:
>
> [2014-04-28 21:15:36,488][WARN ][cluster.service          ] [Kukulcan]
> failed to reconnect to node [Harold "Happy" Hogan][_xgiPJYmSuecN0--yDB
> mlg][zookeeper-test-builders-us-west-1-i-c9f6b095][inet[/10.168.250.15:9300
> ]]{availabilityzone=us-west-1b}
> org.elasticsearch.transport.ConnectTransportException: [Harold "Happy"
> Hogan][inet[/10.168.250.15:9300]] connect_timeout[30s]
>         at
> org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:773)
>         at
> org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:702)
>         at
> org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:670)
>         at
> org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:129)
>         at
> org.elasticsearch.cluster.service.InternalClusterService$ReconnectToNodes.run(InternalClusterService.java:515)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:724)
> Caused by: org.elasticsearch.common.netty.channel.ConnectTimeoutException:
> connection timed out: /10.168.250.15:9300
>         at
> org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.processConnectTimeout(NioClientBoss.java:137)
>         at
> org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:83)
>         at
> org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
>         at
> org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
>         at
> org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
>         at
> org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
>
> I get these messages on the iptable'd node:
>
> [2014-04-28 21:08:37,831][DEBUG][action.admin.indices.stats] [Harold
> "Happy" Hogan] [events-2014.04.27][0], node[Cin_0uRIQwubm585lpkYnQ], [P],
> s[STARTED]: Failed to execute
> [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@bc812]
> org.elasticsearch.transport.NodeDisconnectedException: [Herr
> Kleiser][inet[/10.176.41.55:9300]][indices/stats/s] disconnected
> [2014-04-28 21:08:37,832][DEBUG][action.admin.indices.stats] [Harold
> "Happy" Hogan] [events-2014.04.28][0], node[Cin_0uRIQwubm585lpkYnQ], [P],
> s[STARTED]: Failed to execute
> [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@7c82439e
> ]
> org.elasticsearch.transport.NodeDisconnectedException: [Herr
> Kleiser][inet[/10.176.41.55:9300]][indices/stats/s] disconnected
> [2014-04-28 21:08:37,831][DEBUG][action.admin.indices.stats] [Harold
> "Happy" Hogan] [_river][0], node[Cin_0uRIQwubm585lpkYnQ], [R], s[STARTED]:
> Failed to execute
> [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@5e30581e
> ]
> org.elasticsearch.transport.NodeDisconnectedException: [Herr
> Kleiser][inet[/10.176.41.55:9300]][indices/stats/s] disconnected
> [2014-04-28 21:08:37,832][DEBUG][action.admin.indices.stats] [Harold
> "Happy" Hogan] [events-2014.04.27][2], node[Cin_0uRIQwubm585lpkYnQ], [P],
> s[STARTED]: Failed to execute
> [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@7c82439e
> ]
> org.elasticsearch.transport.NodeDisconnectedException: [Herr
> Kleiser][inet[/10.176.41.55:9300]][indices/stats/s] disconnected
> [2014-04-28 21:08:37,833][DEBUG][action.admin.indices.stats] [Harold
> "Happy" Hogan] [events-2014.04.25][0], node[Cin_0uRIQwubm585lpkYnQ], [R],
> s[STARTED]: Failed to execute
> [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@bc812]
> org.elasticsearch.transport.NodeDisconnectedException: [Herr
> Kleiser][inet[/10.176.41.55:9300]][indices/stats/s] disconnected
> [2014-04-28 21:08:37,833][DEBUG][action.admin.indices.stats] [Harold
> "Happy" Hogan] [events-2014.04.25][3], node[Cin_0uRIQwubm585lpkYnQ], [R],
> s[STARTED]: Failed to execute
> [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@bc812]
> org.elasticsearch.transport.NodeDisconnectedException: [Herr
> Kleiser][inet[/10.176.41.55:9300]][indices/stats/s] disconnected
> [2014-04-28 21:08:37,833][DEBUG][action.admin.indices.stats] [Harold
> "Happy" Hogan] [events-2014.04.27][2], node[Cin_0uRIQwubm585lpkYnQ], [P],
> s[STARTED]: Failed to execute
> [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@5e30581e
> ]
> org.elasticsearch.transport.NodeDisconnectedException: [Herr
> Kleiser][inet[/10.176.41.55:9300]][indices/stats/s] disconnected
> [2014-04-28 21:08:37,834][DEBUG][action.admin.indices.stats] [Harold
> "Happy" Hogan] [events-2014.04.26][1], node[Cin_0uRIQwubm585lpkYnQ],
> relocating [_xgiPJYmSuecN0--yDBmlg], [R], s[RELOCATING]: Failed to execute
> [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@bc8
> 12]
>
> and on of the nodes trying to talk with the iptable'd node:
>
> [2014-04-28 21:09:29,346][DEBUG][action.admin.cluster.node.info] [Herr
> Kleiser] failed to execute on node [_xgiPJYmSuecN0--yDBmlg]
> org.elasticsearch.transport.NodeDisconnectedException: [Harold "Happy"
> Hogan][inet[/10.168.250.15:9300]][cluster/nodes/info/n] disconnected
> [2014-04-28 21:09:29,349][DEBUG][action.admin.cluster.node.info] [Herr
> Kleiser] failed to execute on node [_xgiPJYmSuecN0--yDBmlg]
> org.elasticsearch.transport.NodeDisconnectedException: [Harold "Happy"
> Hogan][inet[/10.168.250.15:9300]][cluster/nodes/info/n] disconnected
> [2014-04-28 21:09:29,349][DEBUG][action.admin.cluster.node.info] [Herr
> Kleiser] failed to execute on node [_xgiPJYmSuecN0--yDBmlg]
> org.elasticsearch.transport.NodeDisconnectedException: [Harold "Happy"
> Hogan][inet[/10.168.250.15:9300]][cluster/nodes/info/n] disconnected
> [2014-04-28 21:09:29,351][DEBUG][action.admin.cluster.node.info] [Herr
> Kleiser] failed to execute on node [_xgiPJYmSuecN0--yDBmlg]
> org.elasticsearch.transport.NodeDisconnectedException: [Harold "Happy"
> Hogan][inet[/10.168.250.15:9300]][cluster/nodes/info/n] disconnected
> [2014-04-28 21:09:29,350][DEBUG][action.admin.cluster.node.info] [Herr
> Kleiser] failed to execute on node [_xgiPJYmSuecN0--yDBmlg]
> org.elasticsearch.transport.NodeDisconnectedException: [Harold "Happy"
> Hogan][inet[/10.168.250.15:9300]][cluster/nodes/info/n] disconnected
> [2014-04-28 21:09:29,349][DEBUG][action.admin.cluster.node.info] [Herr
> Kleiser] failed to execute on node [_xgiPJYmSuecN0--yDBmlg]
> org.elasticsearch.transport.NodeDisconnectedException: [Harold "Happy"
> Hogan][inet[/10.168.250.15:9300]][cluster/nodes/info/n] disconnected
> [2014-04-28 21:12:26,195][DEBUG][action.admin.cluster.node.info] [Herr
> Kleiser] failed to execute on node [_xgiPJYmSuecN0--yDBmlg]
> org.elasticsearch.transport.SendRequestTransportException: [Harold "Happy"
> Hogan][inet[/10.168.250.15:9300]][cluster/nodes/info/n]
>         at
> org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:202)
>         at
> org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.start(TransportNodesOperationAction.java:170)
>         at
> org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$300(TransportNodesOperationAction.java:102
> )
>         at
> org.elasticsearch.action.support.nodes.TransportNodesOperationAction.doExecute(TransportNodesOperationAction.java:73)
>         at
> org.elasticsearch.action.support.nodes.TransportNodesOperationAction.doExecute(TransportNodesOperationAction.java:43)
>         at
> org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
>         at
> org.elasticsearch.client.node.NodeClusterAdminClient.execute(NodeClusterAdminClient.java:72)
>         at
> org.elasticsearch.client.support.AbstractClusterAdminClient.nodesInfo(AbstractClusterAdminClient.java:183)
>         at
> org.elasticsearch.rest.action.admin.cluster.node.info.RestNodesInfoAction.handleRequest(RestNodesInfoAction.java:105)
>         at
> org.elasticsearch.rest.RestController.executeHandler(RestController.java:159)
>         at
> org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:142)
>         at
> org.elasticsearch.http.HttpServer.internalDispatchRequest(HttpServer.java:121)
>         at
> org.elasticsearch.http.HttpServer$Dispatcher.dispatchRequest(HttpServer.java:83)
>         at
> org.elasticsearch.http.netty.NettyHttpServerTransport.dispatchRequest(NettyHttpServerTransport.java:291)
>         at
> org.elasticsearch.http.netty.HttpRequestHandler.messageReceived(HttpRequestHandler.java:43)
>         at
> org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
>         at
> org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>         at
> org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.jav
> a:791)
>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/82935f3d-2f01-42a9-afcb-5496e96daf42%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/82935f3d-2f01-42a9-afcb-5496e96daf42%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEubnC4UEGxBahSfQRWx58MWOf%3DfJufw56rkg3s1zKwYg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: EC2 + ZooKeeper Disco: Tips on Simulating Cluster Failures

Reply via email to