Re: [akka-user] Akka Cluster (with Sharding) not working without auto-down-unreachable-after

Konrad Malawski Thu, 04 Aug 2016 01:26:51 -0700

Just to re-affirm what Justin wrote there.

Auto downing is "auto". It's dumb. That's why it's not safe.
The safer automatic downing modes ones are in
doc.akka.io/docs/akka/rp-16s01p05/scala/split-brain-resolver.html
Yes, that's a commercial thing.


If you don't want to use these, use EC2's APIs - they have APIs from which
you can get information about state like that.

-- 
Konrad `ktoso` Malawski
Akka <http://akka.io> @ Lightbend <http://lightbend.com>

On 4 August 2016 at 04:00:34, Justin du coeur (jduco...@gmail.com) wrote:

The keyword here is "auto".  Autodowning is an *incredibly braindead*
algorithm for dealing with nodes coming out of service, and if you use it
in production you more or less guarantee disaster, because that algorithm
can't cope with cluster partition.  You *do* need to deal with downing, but
you have to get something smarter than that.

Frankly, if you're already hooking into AWS, I *suspect* the best approach
is to leverage that -- when a node goes offline, you have some code to
detect that through the ECS APIs, react to it, and manually down that node.
 (I'm planning on something along those lines for my system, but haven't
actually tried yet.)  But whether you do that or something else, you've got
to add *something* that does downing.

I believe the official party line is "Buy a Lightbend Subscription",
through which you can get their Split Brain Resolver, which is a fairly
battle-hardened module for dealing with this problem.  That's not strictly
necessary, but you *do* need to have a reliable solution...

On Wed, Aug 3, 2016 at 8:42 PM, Eric Swenson <e...@swenson.org> wrote:

> We have an akka-cluster/sharding application deployed an AWS/ECS, where
> each instance of the application is a Docker container.  An ECS service
> launches N instances of the application based on configuration data.  It is
> not possible to know, for certain, the IP addresses of the cluster
> members.  Upon startup, before the AkkaSystem is created, the code
> currently polls AWS and determines the IP addresses of all the Docker hosts
> (which potentially could run the akka application).  It sets these IP
> addresses as the seed nodes before bringing up the akka cluster system. The
> configuration for these has, up until yesterday always included the
> akka.cluster.auto-down-unreachable-after configuration setting.  And it has
> always worked.  Furthermore, it supports two very critical requirements:
>
> a) an instance of the application can be removed at any time, due to
> scaling or rolling updates
> b) an instance of the application can be added at any time, due to scaling
> or rolling updates
>
> On the advice of an Akka expert on the Gitter channel, I removed the
> auto-down-unreachable-after setting, which, as documented, is dangerous for
> production.  As a result the system no longer supports rolling updates.  A
> rolling update occurs thus:  a new version of the application is deployed
> (a new ECS task definition is created with a new Docker image).  The ECS
> service launches a new task (Docker container running on an available host)
> and once that container becomes stable, it kills one of the remaining
> instances (cluster members) to bring the number of instances to some
> configured value.
>
> When this happens, akka-cluster becomes very unhappy and becomes
> unresponsive.  Without the auto-down-unreachable-after setting, it keeps
> trying to talk to the old cluster members. which is no longer present.  It
> appears to NOT recover from this.  There is a constant barrage of messages
> of the form:
>
> [DEBUG] [08/04/2016 00:19:27.126]
> [ClusterSystem-cassandra-plugin-default-dispatcher-27]
> [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path
> sequence [/system/sharding/ExperimentInstance#-389574371] failed
> [DEBUG] [08/04/2016 00:19:27.140]
> [ClusterSystem-cassandra-plugin-default-dispatcher-27]
> [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path
> sequence [/system/sharding/ExperimentInstance#-389574371] failed
> [DEBUG] [08/04/2016 00:19:27.142]
> [ClusterSystem-cassandra-plugin-default-dispatcher-27]
> [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path
> sequence [/system/sharding/ExperimentInstance#-389574371] failed
> [DEBUG] [08/04/2016 00:19:27.143]
> [ClusterSystem-cassandra-plugin-default-dispatcher-27]
> [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path
> sequence [/system/sharding/ExperimentInstance#-389574371] failed
> [DEBUG] [08/04/2016 00:19:27.143]
> [ClusterSystem-cassandra-plugin-default-dispatcher-27]
> [akka.actor.LocalActorRefProvider(akka://ClusterSystem)] resolve of path
> sequence [/system/sharding/ExperimentInstance#-389574371] failed
>
> and of the form:
>
> [WARN] [08/04/2016 00:19:16.787]
> [ClusterSystem-akka.actor.default-dispatcher-9] [akka.tcp://
> ClusterSystem@10.0.3.103:2552/system/sharding/ExperimentInstance] Retry
> request for shard [5] homes from coordinator at [Actor[akka.tcp://
> ClusterSystem@10.0.3.100:2552/system/sharding/ExperimentInstanceCoordinator/singleton/coordinator#1679517511]].
> [1] buffered messages.
> [WARN] [08/04/2016 00:19:18.787]
> [ClusterSystem-akka.actor.default-dispatcher-9] [akka.tcp://
> ClusterSystem@10.0.3.103:2552/system/sharding/ExperimentInstance] Retry
> request for shard [23] homes from coordinator at [Actor[akka.tcp://
> ClusterSystem@10.0.3.100:2552/system/sharding/ExperimentInstanceCoordinator/singleton/coordinator#1679517511]].
> [1] buffered messages.
> [WARN] [08/04/2016 00:19:18.787]
> [ClusterSystem-akka.actor.default-dispatcher-9] [akka.tcp://
> ClusterSystem@10.0.3.103:2552/system/sharding/ExperimentInstance] Retry
> request for shard [1] homes from coordinator at [Actor[akka.tcp://
> ClusterSystem@10.0.3.100:2552/system/sharding/ExperimentInstanceCoordinator/singleton/coordinator#1679517511]].
> [1] buffered messages.
> [WARN] [08/04/2016 00:19:18.787]
> [ClusterSystem-akka.actor.default-dispatcher-9] [akka.tcp://
> ClusterSystem@10.0.3.103:2552/system/sharding/ExperimentInstance] Retry
> request for shard [14] homes from coordinator at [Actor[akka.tcp://
> ClusterSystem@10.0.3.100:2552/system/sharding/ExperimentInstanceCoordinator/singleton/coordinator#1679517511]].
> [1] buffered messages.
> [WARN] [08/04/2016 00:19:18.787]
> [ClusterSystem-akka.actor.default-dispatcher-9] [akka.tcp://
> ClusterSystem@10.0.3.103:2552/system/sharding/ExperimentInstance] Retry
> request for shard [5] homes from coordinator at [Actor[akka.tcp://
> ClusterSystem@10.0.3.100:2552/system/sharding/ExperimentInstanceCoordinator/singleton/coordinator#1679517511]].
> [1] buffered messages.
>
> and then a message like this:
>
> [WARN] [08/03/2016 23:50:34.690]
> [ClusterSystem-akka.remote.default-remote-dispatcher-11] [akka.tcp://
> ClusterSystem@10.0.3.103:2552/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%4010.0.3.100%3A2552-0]
> Association with remote system [akka.tcp://ClusterSystem@10.0.3.100:2552]
> has failed, address is now gated for [5000] ms. Reason: [Association failed
> with [akka.tcp://ClusterSystem@10.0.3.100:2552]] Caused by: [Connection
> refused: /10.0.3.100:2552]
>
> The 10.0.3.100 host is the one that was taken out of service.  The
> 10.0.3.103 host is the remaining instance (to narrow down the issue, we
> have one node, to which a second was added and then the first removed).
>
> But the fact remains that no messages to the sharding region are handled
> -- they all timeout like this:
>
> [ERROR] [08/03/2016 23:50:35.821]
> [ClusterSystem-akka.actor.default-dispatcher-6]
> [akka.actor.ActorSystemImpl(ClusterSystem)] Error during processing of
> request HttpRequest(HttpMethod(POST),
> http://eim.dev.genecloud.com/eim/v1/experimentinstance,List(Host:
> eim.dev.genecloud.com, X-Real-Ip: 10.0.3.157, X-Forwarded-For:
> 10.0.3.157, Connection: upgrade, Accept-Encoding: gzip, deflate, Accept:
> */*, User-Agent: python-requests/2.10.0, Accept-Type: application/json,
> Authorization: Bearer 9wDoBBnbzD7XHmWt8Qk-OaanGlaHPzul8PQrnzPrwW4,
> Timeout-Access:
> <function1>),HttpEntity.Strict(application/json,{}),HttpProtocol(HTTP/1.1))
> akka.pattern.AskTimeoutException: Ask timed out on
> [Actor[akka://ClusterSystem/system/sharding/ExperimentInstance#-2107641834]]
> after [10000 ms]. Sender[null] sent message of type
> "com.genecloud.eim.ExperimentInstance$Commands$NewExperimentInstanceV1".
> at
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
> at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
> at
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> at
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> at
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> at
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:331)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:282)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:286)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:238)
> at java.lang.Thread.run(Thread.java:745)
>
> And even though the cluster of one node (10.0.3.103) should be perfectly
> able to handle messages from akka-http, it doesn't.  They all timeout.  And
> the barrage of unhappy messages continues forever.  It never recovers and
> the service cannot handle any akka-http requests that send messages to the
> cluster-sharing region.
>
> I "fixed" everything, by simply adding back
> the auto-down-unreachable-after parameter.  Now, when a new node comes into
> the cluster, it is recognized and used, and when an old node is removed
> from the cluster, after the configured time period, it is removed the
> cluster and everything is happy again.
>
> What is the recommended way to deal with cluster nodes being added and
> removed outside the control of the application?
>
> -- Eric
>
> --
> >>>>>>>>>> Read the docs: http://akka.io/docs/
> >>>>>>>>>> Check the FAQ:
> http://doc.akka.io/docs/akka/current/additional/faq.html
> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
> ---
> You received this message because you are subscribed to the Google Groups
> "Akka User List" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to akka-user+unsubscr...@googlegroups.com.
> To post to this group, send email to akka-user@googlegroups.com.
> Visit this group at https://groups.google.com/group/akka-user.
> For more options, visit https://groups.google.com/d/optout.
>

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ:
http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups
"Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to akka-user+unsubscr...@googlegroups.com.
To post to this group, send email to akka-user@googlegroups.com.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to akka-user+unsubscr...@googlegroups.com.
To post to this group, send email to akka-user@googlegroups.com.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Re: [akka-user] Akka Cluster (with Sharding) not working without auto-down-unreachable-after

Reply via email to