Re: [akka-user] Akka clusters using cluster client causing quarantined nodes (v2.3.11)

Patrik Nordwall Thu, 05 Jan 2017 06:04:21 -0800

Thanks for sharing great advice, Daniel.

I would not change the transport-failure-detector to 1 second. The purpose
of this failure detector is to find broken TCP connections and restart
them. TCP handles this by itself in most cases. I would guess 1 second
could result in a lot of false positives causing unnecessary message loss.


Apart from that I agree with all you said.

/Patrik

On Thu, Jan 5, 2017 at 10:34 AM, Daniel Stoner <[email protected]>
wrote:

> In my organisation we had quite a number of problems with Cluster
> quarantine behaviour both in AWS and on our own local PCs (We launched a
> cluster of 5 actor systems during test so that tests emulated as closely as
> possible our intended production environment).
>
> What we found was that quarantine happened when CPU usage on the box was
> very high. In essence the JVM was suffering from high CPU and for whatever
> reason the threads which would normally be communicating back and forth in
> the cluster did not get a chance to run at the interval they would need to
> in order to avoid Quarantine.
>
> We also found that this occurred if there was a particularly long stop the
> world garbage collection.
>
> Finally another cause can be if you starve the default fork join thread
> pool - for instance by using Java 8 parallel streams or if you starve the
> general akka thread pool by consuming threads in async processes but don't
> define dispatchers appropriately.
>
> What helped us identify this issue was
> 1) Historic CPU and JVM garbage collection information [Leave JConsole or
> a similar free JMX Monitoring tool connected to your prod cluster overnight
> - chances are JMX will disconnect at the same time as the quarantine
> happens which means the useful info is right at the end of the graph when
> you come in the next day :)]
>
> 2) Lots of logging on various processes of the form timestamp|what job you
> are engaged in|what part of the process you are [Start/Middle/End etc]. We
> then processed this information into graphs to see that some processes had
> lots of work available but were not actually processing [Thread
> starvation]. We solved this by using Akka streams more and specifying the
> dispatcher explicitly wherever we felt possible.
>
> 3) Ensuring that all Futures were handled with defined dispatchers when
> doing things such as mapAsync or the likes. At least then if starvation was
> to occur it would be isolated and easier to identify and wouldn't
> compromise the default features of the cluster.
>
> Finally - you can mess around with the transport-failure-detector to set
> acceptable pauses and heartbeat intervals [I'm no expert so this may not be
> directly related to quarantine but I believe it helped us]
> akka.remote.transport-failure-detector {
>       acceptable-heartbeat-pause = 1 s
>       heartbeat-interval = 200 ms
>     }
>
> We did feel however that quarantined services was just a symptom of
> another hard to debug issue and things like this config just make that
> symptom appear less often (Great if you want production to work for the
> next day or two but bad next week when it properly collapses anyway as the
> root cause got worse).
>
> And I would also 100% recommend not using auto-downing. Coming up with a
> good strategy for downing can either be quite easy (Can I reach the 'leader
> node'? [The node in my cluster who claims to have the first startup time]
> Yes -> Great I can live i'm part of the Cluster! No -> I should really
> suicide myself as i'm part of a split brain on the bad side of a network
> partition) or incredibly difficult if you are in an auto-scaling
> environment like AWS and you wish to ensure that always the majority side
> of a network partition survives and everyone doesn't decide to suicide
> themselves.
> You can subscribe to events such as MemberUp or MemberDown to then
> initiate your detection/suicide strategy quite easily :)
>
> [My team had a lot of fun setting up meeting rooms with bookings for
> 'Suicide Pact Discussion' but it wasn't nearly as fun as the meetings on
> what we do when orphaned children in actor trees need to be sent Poison
> commands].
>
> On Wednesday, 4 January 2017 17:02:55 UTC, Tyler Brummett wrote:
>>
>> Thank you for that suggestion. We are stuck on 2.3.x for the time being,
>> but have plans to move to 2.4.x in the not so distant future, but it would
>> still be beneficial to solve this problem now.
>>
>> We've upgraded our Akka version to 2.3.15 and it seems that the
>> quarantine messages are no longer showing up in our logs. Which is great!
>> However, some of the acknowledgement messages that we normally get during
>> one of our nightly processes are not making it back to the requesting
>> actor. This behavior is different since the upgrade happened and seems to
>> be the only thing preventing us from moving forward.
>>
>> Context:
>> A request message is sent from one actor in actorSystemA to another actor
>> in actorSystemB. The actor in actorSystemB delegates work to be done by
>> some other back end process. Said back end process may take a few minutes
>> depending on the size of the data result coming back. Therefore, we have
>> 'hand-rolled' a solution to use a private ActorRef on the responding actor
>> in actorSystemB to periodically send 'update' messages every few seconds to
>> let the requesting actor in actorSystemA know that the back end process is
>> still processing its request.
>>
>> In every case, before the upgrade, we were always getting these 'update'
>> messages or acknowledgements that the backend process is still working. Now
>> it seems to fail on simple .tell() back to the requesting actor in
>> actorSystemA. The line of code below shows where it seems to fail in the
>> actor of actorSystemB:
>>
>> *originalSender.tell(updateMessage, getSelf());*
>> where *originalSender* is the ActorRef that is set to *getSender()* with
>> the requesting actor sends a message to the actor in actorSystemB (where
>> the above code lives). During this nightly process, we do this many times
>> with different parameters and it works most of the time. How could it work
>> most of the time, but not all, after upgrading?
>>
>> Nothing in the logs seem to indicate why this might have failed, but we
>> did find that some of the nodes in the cluster system became unreachable
>> for a little while, then reachable again. This is consistent with how often
>> the nodes went into quarantine before. We just aren't sure why it is
>> happening and was wondering if we could attribute our lost acknowledgement
>> messages to this:
>>
>> ===
>> *01/04/2017 06:44:44,840  INFO
>> [AppClusterSystem-akka.actor.default-dispatcher-2]
>> Cluster(akka://AppClusterSystem) - Cluster Node
>> [akka.tcp://[email protected]:12345
>> <http://[email protected]:12345>] - Ignoring received
>> gossip from unreachable
>> [UniqueAddress(akka.tcp://[email protected]:12345
>> <http://[email protected]:12345>,-1669472141)]*
>> *01/04/2017 06:44:45,355  INFO
>> [AppClusterSystem-akka.actor.default-dispatcher-2]
>> Cluster(akka://AppClusterSystem) - Cluster Node
>> [akka.tcp://[email protected]:12345
>> <http://[email protected]:12345>] - Marking node(s) as
>> REACHABLE [Member(address =
>> akka.tcp://[email protected]:12345
>> <http://[email protected]:12345>, status = Up)]*
>> ===
>>
>> Let me know if I have not provided enough context or information to
>> determine what our problem could be.
>> Thanks again!
>>
>>
>> On Saturday, December 10, 2016 at 2:12:28 AM UTC-6, Patrik Nordwall wrote:
>>>
>>> First step is to use latest version. Preferably 2.4.14, but if you are
>>> stuck on 2.3.x it is 2.3.15. Updating to 2.4.x should be fairly easy, see
>>> migration guide in docs.
>>>
>>> You need a version with this fix https://github.com/akka/akka/i
>>> ssues/13909 and there are many other bug fixes since 2.3.11
>>>
>>> /Patrik
>>>
>>> fre 9 dec. 2016 kl. 22:58 skrev Justin du coeur <[email protected]>:
>>>
>>>> Hmm.  I'm not sufficiently expert in large-cluster behavior to guess
>>>> about the problem, but note that you should *never* use
>>>> auto-down-unreachable-after in production code. (I actually don't even
>>>> recommend it in test code.)  While I don't *think* it causes the problem
>>>> you're describing, it can cause much more severe "split-brain" issues that
>>>> can lead to data corruption.  You're going to need to come up with a more
>>>> nuanced approach to the problem of downing; I recommend reading the
>>>> documentation sections on Downing
>>>> <http://doc.akka.io/docs/akka/2.4.14/scala/cluster-usage.html#Downing>
>>>> and Split Brain
>>>> <http://doc.akka.io/docs/akka/akka-commercial-addons-1.0/scala/split-brain-resolver.html>
>>>> -- it's important to get this stuff right to have a stable environment.
>>>>
>>>> On Fri, Dec 9, 2016 at 3:44 PM, Tyler Brummett <[email protected]>
>>>> wrote:
>>>>
>>>>> Hey Akka experts, I need your help! Currently my company is using Akka
>>>>> as a part of a partial CQRS pattern. We have service adapters that consume
>>>>> source system events in the form of JMS messages, while producing commands
>>>>> to be asynchronously distributed to our command service. Our command
>>>>> service consumes all of these messages asynchronously based on a given
>>>>> group ID, so that no two commands with the same group ID are being
>>>>> processed at the same time.
>>>>>
>>>>> We have designed an approach that allows us to have each deployable
>>>>> component in its own cluster and use a clusterClient to talk across
>>>>> clusters. Below is another diagram illustrating the service architecture
>>>>> with the Akka configuration reflecting separate clusters.
>>>>>
>>>>> [diagram]
>>>>> (see attached please)
>>>>>
>>>>> Errors we are seeing on appbox01: UI sends commands to command service
>>>>> 11/11/2016 09:48:46,056  INFO 
>>>>> [AppClusterSystem-akka.actor.default-dispatcher-29]
>>>>> CommandHandlerActor - received master ack.
>>>>> 11/11/2016 09:48:52,045  INFO 
>>>>> [AppClusterSystem-akka.actor.default-dispatcher-35]
>>>>> CommandHandlerActor work timeout. For commandX
>>>>> 11/11/2016 09:48:52,046 ERROR [tomcat-http--33] AppController - X
>>>>> update failed
>>>>> com.company.appA.package.AkkaWorkFailedException: Timeout for X
>>>>>
>>>>> Errors we are seeing on servicebox01: UI sends commands to command
>>>>> service
>>>>> 11/11/2016 09:48:46,715  WARN 
>>>>> [CommandClusterSystem-akka.actor.default-dispatcher-2]
>>>>> ClusterStatusListenerActor - Problem has occurred associating local host:
>>>>> servicebox01.company.com and remote host: appbox01.company.com
>>>>> 11/11/2016 09:48:46,716  WARN 
>>>>> [CommandClusterSystem-akka.actor.default-dispatcher-2]
>>>>> ClusterStatusListenerActor - Problem has occurred associating local host:
>>>>> servicebox01.company.com and remote host: appbox01.company.com
>>>>> 11/11/2016 09:48:46,716  WARN 
>>>>> [CommandClusterSystem-akka.actor.default-dispatcher-2]
>>>>> Remoting - Tried to associate with unreachable remote address [akka.tcp://
>>>>> [email protected]:12345]. Address is now gated
>>>>> for 5000 ms, all messages to this address will be delivered to dead
>>>>> letters. Reason: [The remote system has quarantined this system. No 
>>>>> further
>>>>> associations to the remote system are possible until this system is
>>>>> restarted.]
>>>>>
>>>>>
>>>>> We are interested in seeing this new implementation through and
>>>>> finding solutions where we can decouple our services and apps from one
>>>>> another as we move towards a micro-service architecture. So if you have
>>>>> suggestions/solutions, we are all ears!
>>>>>
>>>>> So the main question is: why are our nodes being quarantined? We have
>>>>> restarted nodes and stabalized the environment over and over, but the
>>>>> quarantine problem resurfaces after a few hours. Typically it's in a bad
>>>>> state by the next day. As part of this post I have provided our typical
>>>>> application.conf file for a given service, which corresponds with our new
>>>>> "separate cluster" implementation (diagram). Hopefully someone out there
>>>>> can help us shed some light to this problem. Please see the
>>>>> application.conf below.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> =====================
>>>>> application.conf
>>>>> =====================
>>>>>
>>>>> # bulkhead workers
>>>>> my-worker-exec-dispatcher {
>>>>>    type = Dispatcher
>>>>>    executor = "fork-join-executor"
>>>>>    fork-join-executor {
>>>>>       parallelism-min = 2
>>>>>       parallelism-factor = 2.0
>>>>>       parallelism-max = 10
>>>>>    }
>>>>>    throughput =1
>>>>> }
>>>>>
>>>>> # dedicate resources to the master actor
>>>>> my-master-dispatcher {
>>>>>    type = Dispatcher
>>>>>    executor = "fork-join-executor"
>>>>>    fork-join-executor {
>>>>>       parallelism-min = 2
>>>>>       parallelism-factor = 2.0
>>>>>       parallelism-max = 10
>>>>>    }
>>>>>    throughput =20
>>>>> }
>>>>>
>>>>> akka {
>>>>>    loggers = ["akka.event.slf4j.Slf4jLogger"]
>>>>>    loglevel = "INFO"
>>>>>    stdout-loglevel = "OFF"
>>>>>
>>>>>    actor.provider = "akka.cluster.ClusterActorRefProvider"
>>>>>
>>>>>   # Log the complete configuration at INFO level when the actor system
>>>>> is started.
>>>>>   # This is useful when you are uncertain of what configuration is
>>>>> used.
>>>>>   log-config-on-start = off
>>>>>
>>>>>    remote {
>>>>>       log-remote-lifecycle-events = off
>>>>>
>>>>>  # If this is "on", Akka will log all outbound messages at DEBUG level,
>>>>>       # if off then they are not logged
>>>>>       log-sent-messages = off
>>>>>  # If this is "on", Akka will log all inbound messages at DEBUG level,
>>>>>       # if off then they are not logged
>>>>>       log-received-messages = off
>>>>>       netty.tcp {
>>>>>          # hostname is injected programmatically in AppConfiguration.
>>>>>          port = ${akka.node.port}
>>>>>          send-buffer-size = 10240000b
>>>>>          receive-buffer-size = 10240000b
>>>>>          maximum-frame-size = 5120000b
>>>>>       }
>>>>>    }
>>>>>
>>>>>    contrib {
>>>>> cluster {
>>>>>  pub-sub {
>>>>> # How often the DistributedPubSubMediator should send out gossip
>>>>> information
>>>>> gossip-interval = 5s
>>>>>  }
>>>>>   }
>>>>>    }
>>>>>
>>>>>    cluster {
>>>>>       # seed-nodes is injected programmatically
>>>>>       # seed-nodes = [${akka.seed.nodes}]
>>>>>       # 30 minute auto down for a crashed master
>>>>>       # a long network outage requires restarting the cluster after 30
>>>>> minutes
>>>>>       auto-down-unreachable-after = 1800s
>>>>>       roles = [${akka.cluster.roles}]
>>>>>    }
>>>>>
>>>>>    actor {
>>>>>       bounded-mailbox {
>>>>>          mailbox-type = "akka.dispatch.BoundedMailbox"
>>>>>          mailbox-capacity = 3000
>>>>>          mailbox-push-timeout-time = 100ms
>>>>>       }
>>>>>
>>>>>  debug {
>>>>>       # enable function of LoggingReceive, which is to log any
>>>>> received message at
>>>>>       # DEBUG level
>>>>>       receive = off
>>>>>   # enable DEBUG logging of all AutoReceiveMessages (Kill, PoisonPill
>>>>> et.c.)
>>>>>       autoreceive = off
>>>>>  # enable DEBUG logging of actor lifecycle changes
>>>>>       lifecycle = off
>>>>>   # enable DEBUG logging of all LoggingFSMs for events, transitions
>>>>> and timers
>>>>>       fsm = off
>>>>>   # enable DEBUG logging of subscription changes on the eventStream
>>>>>       event-stream = off
>>>>>     }
>>>>>    }
>>>>> }
>>>>>
>>>>> akka.extensions = ["akka.contrib.pattern.Cluster
>>>>> ReceptionistExtension"]
>>>>>
>>>>> akka.contrib.cluster.receptionist {
>>>>>    name = receptionist
>>>>>    number-of-contacts = 3
>>>>>    response-tunnel-receive-timeout = 30s
>>>>> }
>>>>>
>>>>> akka.cluster.client {
>>>>>    heartbeat-interval = 2s
>>>>>    acceptable-heartbeat-pause = 10s
>>>>>    buffer = 0
>>>>> }
>>>>>
>>>>> --
>>>>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>> >>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/c
>>>>> urrent/additional/faq.html
>>>>> >>>>>>>>>> Search the archives: https://groups.google.com/grou
>>>>> p/akka-user
>>>>> ---
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Akka User List" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at https://groups.google.com/group/akka-user.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>> --
>>>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>>>> >>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/c
>>>> urrent/additional/faq.html
>>>> >>>>>>>>>> Search the archives: https://groups.google.com/grou
>>>> p/akka-user
>>>> ---
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Akka User List" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/akka-user.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
> Notice:  This email is confidential and may contain copyright material of
> members of the Ocado Group. Opinions and views expressed in this message
> may not necessarily reflect the opinions and views of the members of the
> Ocado Group.
>
>
>
> If you are not the intended recipient, please notify us immediately and
> delete all copies of this message. Please note that it is your
> responsibility to scan this message for viruses.
>
>
>
> Fetch and Sizzle are trading names of Speciality Stores Limited and Fabled
> is a trading name of Marie Claire Beauty Limited, both members of the Ocado
> Group.
>
>
>
> References to the “Ocado Group” are to Ocado Group plc (registered in
> England and Wales with number 7098618) and its subsidiary undertakings (as
> that expression is defined in the Companies Act 2006) from time to time.
> The registered office of Ocado Group plc is Titan Court, 3 Bishops Square,
> Hatfield Business Park, Hatfield, Herts. AL10 9NE.
>
> --
> >>>>>>>>>> Read the docs: http://akka.io/docs/
> >>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/
> current/additional/faq.html
> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
> ---
> You received this message because you are subscribed to the Google Groups
> "Akka User List" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/akka-user.
> For more options, visit https://groups.google.com/d/optout.
>



-- 

Patrik Nordwall
Akka Tech Lead
Lightbend <http://www.lightbend.com/> -  Reactive apps on the JVM
Twitter: @patriknw

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Re: [akka-user] Akka clusters using cluster client causing quarantined nodes (v2.3.11)

Reply via email to