Re: [akka-user] Akka clusters using cluster client causing quarantined nodes (v2.3.11)

Daniel Stoner Thu, 05 Jan 2017 01:34:42 -0800

In my organisation we had quite a number of problems with Cluster 
quarantine behaviour both in AWS and on our own local PCs (We launched a 
cluster of 5 actor systems during test so that tests emulated as closely as 
possible our intended production environment).


What we found was that quarantine happened when CPU usage on the box was 
very high. In essence the JVM was suffering from high CPU and for whatever 
reason the threads which would normally be communicating back and forth in 
the cluster did not get a chance to run at the interval they would need to 
in order to avoid Quarantine.

We also found that this occurred if there was a particularly long stop the 
world garbage collection.

Finally another cause can be if you starve the default fork join thread 
pool - for instance by using Java 8 parallel streams or if you starve the 
general akka thread pool by consuming threads in async processes but don't 
define dispatchers appropriately.

What helped us identify this issue was
1) Historic CPU and JVM garbage collection information [Leave JConsole or a 
similar free JMX Monitoring tool connected to your prod cluster overnight - 
chances are JMX will disconnect at the same time as the quarantine happens 
which means the useful info is right at the end of the graph when you come 
in the next day :)]

2) Lots of logging on various processes of the form timestamp|what job you 
are engaged in|what part of the process you are [Start/Middle/End etc]. We 
then processed this information into graphs to see that some processes had 
lots of work available but were not actually processing [Thread 
starvation]. We solved this by using Akka streams more and specifying the 
dispatcher explicitly wherever we felt possible.

3) Ensuring that all Futures were handled with defined dispatchers when 
doing things such as mapAsync or the likes. At least then if starvation was 
to occur it would be isolated and easier to identify and wouldn't 
compromise the default features of the cluster.

Finally - you can mess around with the transport-failure-detector to set 
acceptable pauses and heartbeat intervals [I'm no expert so this may not be 
directly related to quarantine but I believe it helped us]
akka.remote.transport-failure-detector {
      acceptable-heartbeat-pause = 1 s
      heartbeat-interval = 200 ms
    }

We did feel however that quarantined services was just a symptom of another 
hard to debug issue and things like this config just make that symptom 
appear less often (Great if you want production to work for the next day or 
two but bad next week when it properly collapses anyway as the root cause 
got worse).

And I would also 100% recommend not using auto-downing. Coming up with a 
good strategy for downing can either be quite easy (Can I reach the 'leader 
node'? [The node in my cluster who claims to have the first startup time] 
Yes -> Great I can live i'm part of the Cluster! No -> I should really 
suicide myself as i'm part of a split brain on the bad side of a network 
partition) or incredibly difficult if you are in an auto-scaling 
environment like AWS and you wish to ensure that always the majority side 
of a network partition survives and everyone doesn't decide to suicide 
themselves.
You can subscribe to events such as MemberUp or MemberDown to then initiate 
your detection/suicide strategy quite easily :)

[My team had a lot of fun setting up meeting rooms with bookings for 
'Suicide Pact Discussion' but it wasn't nearly as fun as the meetings on 
what we do when orphaned children in actor trees need to be sent Poison 
commands].

On Wednesday, 4 January 2017 17:02:55 UTC, Tyler Brummett wrote:
>
> Thank you for that suggestion. We are stuck on 2.3.x for the time being, 
> but have plans to move to 2.4.x in the not so distant future, but it would 
> still be beneficial to solve this problem now.
>
> We've upgraded our Akka version to 2.3.15 and it seems that the quarantine 
> messages are no longer showing up in our logs. Which is great!
> However, some of the acknowledgement messages that we normally get during 
> one of our nightly processes are not making it back to the requesting 
> actor. This behavior is different since the upgrade happened and seems to 
> be the only thing preventing us from moving forward.
>
> Context:
> A request message is sent from one actor in actorSystemA to another actor 
> in actorSystemB. The actor in actorSystemB delegates work to be done by 
> some other back end process. Said back end process may take a few minutes 
> depending on the size of the data result coming back. Therefore, we have 
> 'hand-rolled' a solution to use a private ActorRef on the responding actor 
> in actorSystemB to periodically send 'update' messages every few seconds to 
> let the requesting actor in actorSystemA know that the back end process is 
> still processing its request.
>
> In every case, before the upgrade, we were always getting these 'update' 
> messages or acknowledgements that the backend process is still working. Now 
> it seems to fail on simple .tell() back to the requesting actor in 
> actorSystemA. The line of code below shows where it seems to fail in the 
> actor of actorSystemB:
>
> *originalSender.tell(updateMessage, getSelf());*
> where *originalSender* is the ActorRef that is set to *getSender()* with 
> the requesting actor sends a message to the actor in actorSystemB (where 
> the above code lives). During this nightly process, we do this many times 
> with different parameters and it works most of the time. How could it work 
> most of the time, but not all, after upgrading?
>
> Nothing in the logs seem to indicate why this might have failed, but we 
> did find that some of the nodes in the cluster system became unreachable 
> for a little while, then reachable again. This is consistent with how often 
> the nodes went into quarantine before. We just aren't sure why it is 
> happening and was wondering if we could attribute our lost acknowledgement 
> messages to this:
>
> ===
> *01/04/2017 06:44:44,840  INFO 
> [AppClusterSystem-akka.actor.default-dispatcher-2] 
> Cluster(akka://AppClusterSystem) - Cluster Node 
> [akka.tcp://[email protected]:12345 
> <http://[email protected]:12345>] - Ignoring received 
> gossip from unreachable 
> [UniqueAddress(akka.tcp://[email protected]:12345 
> <http://[email protected]:12345>,-1669472141)]*
> *01/04/2017 06:44:45,355  INFO 
> [AppClusterSystem-akka.actor.default-dispatcher-2] 
> Cluster(akka://AppClusterSystem) - Cluster Node 
> [akka.tcp://[email protected]:12345 
> <http://[email protected]:12345>] - Marking node(s) as 
> REACHABLE [Member(address = 
> akka.tcp://[email protected]:12345 
> <http://[email protected]:12345>, status = Up)]*
> ===
>
> Let me know if I have not provided enough context or information to 
> determine what our problem could be.
> Thanks again!
>
>
> On Saturday, December 10, 2016 at 2:12:28 AM UTC-6, Patrik Nordwall wrote:
>>
>> First step is to use latest version. Preferably 2.4.14, but if you are 
>> stuck on 2.3.x it is 2.3.15. Updating to 2.4.x should be fairly easy, see 
>> migration guide in docs.
>>
>> You need a version with this fix 
>> https://github.com/akka/akka/issues/13909 and there are many other bug 
>> fixes since 2.3.11
>>
>> /Patrik
>>
>> fre 9 dec. 2016 kl. 22:58 skrev Justin du coeur <[email protected]>:
>>
>>> Hmm.  I'm not sufficiently expert in large-cluster behavior to guess 
>>> about the problem, but note that you should *never* use 
>>> auto-down-unreachable-after in production code. (I actually don't even 
>>> recommend it in test code.)  While I don't *think* it causes the problem 
>>> you're describing, it can cause much more severe "split-brain" issues that 
>>> can lead to data corruption.  You're going to need to come up with a more 
>>> nuanced approach to the problem of downing; I recommend reading the 
>>> documentation sections on Downing 
>>> <http://doc.akka.io/docs/akka/2.4.14/scala/cluster-usage.html#Downing> 
>>> and Split Brain 
>>> <http://doc.akka.io/docs/akka/akka-commercial-addons-1.0/scala/split-brain-resolver.html>
>>>  
>>> -- it's important to get this stuff right to have a stable environment.
>>>
>>> On Fri, Dec 9, 2016 at 3:44 PM, Tyler Brummett <[email protected]> 
>>> wrote:
>>>
>>>> Hey Akka experts, I need your help! Currently my company is using Akka 
>>>> as a part of a partial CQRS pattern. We have service adapters that consume 
>>>> source system events in the form of JMS messages, while producing commands 
>>>> to be asynchronously distributed to our command service. Our command 
>>>> service consumes all of these messages asynchronously based on a given 
>>>> group ID, so that no two commands with the same group ID are being 
>>>> processed at the same time. 
>>>>
>>>> We have designed an approach that allows us to have each deployable 
>>>> component in its own cluster and use a clusterClient to talk across 
>>>> clusters. Below is another diagram illustrating the service architecture 
>>>> with the Akka configuration reflecting separate clusters.
>>>>
>>>> [diagram]
>>>> (see attached please)
>>>>
>>>> Errors we are seeing on appbox01: UI sends commands to command service
>>>> 11/11/2016 09:48:46,056  INFO 
>>>> [AppClusterSystem-akka.actor.default-dispatcher-29] CommandHandlerActor - 
>>>> received master ack.
>>>> 11/11/2016 09:48:52,045  INFO 
>>>> [AppClusterSystem-akka.actor.default-dispatcher-35] CommandHandlerActor 
>>>> work timeout. For commandX
>>>> 11/11/2016 09:48:52,046 ERROR [tomcat-http--33] AppController - X 
>>>> update failed
>>>> com.company.appA.package.AkkaWorkFailedException: Timeout for X
>>>>  
>>>> Errors we are seeing on servicebox01: UI sends commands to command 
>>>> service
>>>> 11/11/2016 09:48:46,715  WARN 
>>>> [CommandClusterSystem-akka.actor.default-dispatcher-2] 
>>>> ClusterStatusListenerActor - Problem has occurred associating local host: 
>>>> servicebox01.company.com and remote host: appbox01.company.com
>>>> 11/11/2016 09:48:46,716  WARN 
>>>> [CommandClusterSystem-akka.actor.default-dispatcher-2] 
>>>> ClusterStatusListenerActor - Problem has occurred associating local host: 
>>>> servicebox01.company.com and remote host: appbox01.company.com
>>>> 11/11/2016 09:48:46,716  WARN 
>>>> [CommandClusterSystem-akka.actor.default-dispatcher-2] Remoting - Tried to 
>>>> associate with unreachable remote address [akka.tcp://
>>>> [email protected]:12345]. Address is now gated for 
>>>> 5000 ms, all messages to this address will be delivered to dead letters. 
>>>> Reason: [The remote system has quarantined this system. No further 
>>>> associations to the remote system are possible until this system is 
>>>> restarted.]
>>>>
>>>>
>>>> We are interested in seeing this new implementation through and finding 
>>>> solutions where we can decouple our services and apps from one another as 
>>>> we move towards a micro-service architecture. So if you have 
>>>> suggestions/solutions, we are all ears!
>>>>
>>>> So the main question is: why are our nodes being quarantined? We have 
>>>> restarted nodes and stabalized the environment over and over, but the 
>>>> quarantine problem resurfaces after a few hours. Typically it's in a bad 
>>>> state by the next day. As part of this post I have provided our typical 
>>>> application.conf file for a given service, which corresponds with our new 
>>>> "separate cluster" implementation (diagram). Hopefully someone out there 
>>>> can help us shed some light to this problem. Please see the 
>>>> application.conf below.
>>>>
>>>> Thanks!
>>>>
>>>> =====================
>>>> application.conf
>>>> =====================
>>>>
>>>> # bulkhead workers
>>>> my-worker-exec-dispatcher {
>>>>    type = Dispatcher
>>>>    executor = "fork-join-executor"
>>>>    fork-join-executor {
>>>>       parallelism-min = 2
>>>>       parallelism-factor = 2.0
>>>>       parallelism-max = 10
>>>>    }
>>>>    throughput =1
>>>> }
>>>>
>>>> # dedicate resources to the master actor
>>>> my-master-dispatcher {
>>>>    type = Dispatcher
>>>>    executor = "fork-join-executor"
>>>>    fork-join-executor {
>>>>       parallelism-min = 2
>>>>       parallelism-factor = 2.0
>>>>       parallelism-max = 10
>>>>    }
>>>>    throughput =20
>>>> }
>>>>
>>>> akka {
>>>>    loggers = ["akka.event.slf4j.Slf4jLogger"]
>>>>    loglevel = "INFO"
>>>>    stdout-loglevel = "OFF"
>>>>
>>>>    actor.provider = "akka.cluster.ClusterActorRefProvider"
>>>>
>>>>   # Log the complete configuration at INFO level when the actor system 
>>>> is started.
>>>>   # This is useful when you are uncertain of what configuration is used.
>>>>   log-config-on-start = off
>>>>   
>>>>    remote {
>>>>       log-remote-lifecycle-events = off
>>>>
>>>>  # If this is "on", Akka will log all outbound messages at DEBUG level,
>>>>       # if off then they are not logged
>>>>       log-sent-messages = off
>>>>  # If this is "on", Akka will log all inbound messages at DEBUG level,
>>>>       # if off then they are not logged
>>>>       log-received-messages = off
>>>>       netty.tcp {
>>>>          # hostname is injected programmatically in AppConfiguration.
>>>>          port = ${akka.node.port}
>>>>          send-buffer-size = 10240000b
>>>>          receive-buffer-size = 10240000b
>>>>          maximum-frame-size = 5120000b
>>>>       }
>>>>    }
>>>>    
>>>>    contrib {
>>>> cluster {
>>>>  pub-sub {
>>>> # How often the DistributedPubSubMediator should send out gossip 
>>>> information
>>>> gossip-interval = 5s
>>>>  }
>>>>   }
>>>>    }
>>>>
>>>>    cluster {
>>>>       # seed-nodes is injected programmatically 
>>>>       # seed-nodes = [${akka.seed.nodes}]
>>>>       # 30 minute auto down for a crashed master
>>>>       # a long network outage requires restarting the cluster after 30 
>>>> minutes
>>>>       auto-down-unreachable-after = 1800s
>>>>       roles = [${akka.cluster.roles}]
>>>>    }
>>>>
>>>>    actor {
>>>>       bounded-mailbox {
>>>>          mailbox-type = "akka.dispatch.BoundedMailbox"
>>>>          mailbox-capacity = 3000
>>>>          mailbox-push-timeout-time = 100ms
>>>>       }
>>>>  
>>>>  debug {
>>>>       # enable function of LoggingReceive, which is to log any received 
>>>> message at
>>>>       # DEBUG level
>>>>       receive = off
>>>>   # enable DEBUG logging of all AutoReceiveMessages (Kill, PoisonPill 
>>>> et.c.)
>>>>       autoreceive = off
>>>>  # enable DEBUG logging of actor lifecycle changes
>>>>       lifecycle = off
>>>>   # enable DEBUG logging of all LoggingFSMs for events, transitions and 
>>>> timers
>>>>       fsm = off
>>>>   # enable DEBUG logging of subscription changes on the eventStream
>>>>       event-stream = off
>>>>     }
>>>>    }
>>>> }
>>>>
>>>> akka.extensions = ["akka.contrib.pattern.ClusterReceptionistExtension"]
>>>>
>>>> akka.contrib.cluster.receptionist {
>>>>    name = receptionist
>>>>    number-of-contacts = 3
>>>>    response-tunnel-receive-timeout = 30s
>>>> }
>>>>
>>>> akka.cluster.client {
>>>>    heartbeat-interval = 2s
>>>>    acceptable-heartbeat-pause = 10s
>>>>    buffer = 0
>>>> }
>>>>
>>>> -- 
>>>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>>>> >>>>>>>>>> Check the FAQ: 
>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>> >>>>>>>>>> Search the archives: 
>>>> https://groups.google.com/group/akka-user
>>>> --- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "Akka User List" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/akka-user.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> -- 
>>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>>> >>>>>>>>>> Check the FAQ: 
>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>> >>>>>>>>>> Search the archives: 
>>> https://groups.google.com/group/akka-user
>>> --- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "Akka User List" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/akka-user.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
-- 


Notice:  This email is confidential and may contain copyright material of 
members of the Ocado Group. Opinions and views expressed in this message 
may not necessarily reflect the opinions and views of the members of the 
Ocado Group. 

 

If you are not the intended recipient, please notify us immediately and 
delete all copies of this message. Please note that it is your 
responsibility to scan this message for viruses. 

 

Fetch and Sizzle are trading names of Speciality Stores Limited and Fabled 
is a trading name of Marie Claire Beauty Limited, both members of the Ocado 
Group.

 

References to the “Ocado Group” are to Ocado Group plc (registered in 
England and Wales with number 7098618) and its subsidiary undertakings (as 
that expression is defined in the Companies Act 2006) from time to time. 
 The registered office of Ocado Group plc is Titan Court, 3 Bishops Square, 
Hatfield Business Park, Hatfield, Herts. AL10 9NE.

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Re: [akka-user] Akka clusters using cluster client causing quarantined nodes (v2.3.11)

Reply via email to