Re: [akka-user] Akka clusters using cluster client causing quarantined nodes (v2.3.11)

Tyler Brummett Wed, 04 Jan 2017 09:03:24 -0800

Thank you for that suggestion. We are stuck on 2.3.x for the time being, 
but have plans to move to 2.4.x in the not so distant future, but it would 
still be beneficial to solve this problem now.


We've upgraded our Akka version to 2.3.15 and it seems that the quarantine 
messages are no longer showing up in our logs. Which is great!
However, some of the acknowledgement messages that we normally get during 
one of our nightly processes are not making it back to the requesting 
actor. This behavior is different since the upgrade happened and seems to 
be the only thing preventing us from moving forward.

Context:
A request message is sent from one actor in actorSystemA to another actor 
in actorSystemB. The actor in actorSystemB delegates work to be done by 
some other back end process. Said back end process may take a few minutes 
depending on the size of the data result coming back. Therefore, we have 
'hand-rolled' a solution to use a private ActorRef on the responding actor 
in actorSystemB to periodically send 'update' messages every few seconds to 
let the requesting actor in actorSystemA know that the back end process is 
still processing its request.

In every case, before the upgrade, we were always getting these 'update' 
messages or acknowledgements that the backend process is still working. Now 
it seems to fail on simple .tell() back to the requesting actor in 
actorSystemA. The line of code below shows where it seems to fail in the 
actor of actorSystemB:

*originalSender.tell(updateMessage, getSelf());*
where *originalSender* is the ActorRef that is set to *getSender()* with 
the requesting actor sends a message to the actor in actorSystemB (where 
the above code lives). During this nightly process, we do this many times 
with different parameters and it works most of the time. How could it work 
most of the time, but not all, after upgrading?

Nothing in the logs seem to indicate why this might have failed, but we did 
find that some of the nodes in the cluster system became unreachable for a 
little while, then reachable again. This is consistent with how often the 
nodes went into quarantine before. We just aren't sure why it is happening 
and was wondering if we could attribute our lost acknowledgement messages 
to this:

===
*01/04/2017 06:44:44,840  INFO 
[AppClusterSystem-akka.actor.default-dispatcher-2] 
Cluster(akka://AppClusterSystem) - Cluster Node 
[akka.tcp://[email protected]:12345] - Ignoring received 
gossip from unreachable 
[UniqueAddress(akka.tcp://[email protected]:12345,-1669472141)]*
*01/04/2017 06:44:45,355  INFO 
[AppClusterSystem-akka.actor.default-dispatcher-2] 
Cluster(akka://AppClusterSystem) - Cluster Node 
[akka.tcp://[email protected]:12345] - Marking node(s) as 
REACHABLE [Member(address = 
akka.tcp://[email protected]:12345, status = Up)]*
===

Let me know if I have not provided enough context or information to 
determine what our problem could be.
Thanks again!


On Saturday, December 10, 2016 at 2:12:28 AM UTC-6, Patrik Nordwall wrote:
>
> First step is to use latest version. Preferably 2.4.14, but if you are 
> stuck on 2.3.x it is 2.3.15. Updating to 2.4.x should be fairly easy, see 
> migration guide in docs.
>
> You need a version with this fix https://github.com/akka/akka/issues/13909 
> and there are many other bug fixes since 2.3.11
>
> /Patrik
>
> fre 9 dec. 2016 kl. 22:58 skrev Justin du coeur <[email protected] 
> <javascript:>>:
>
>> Hmm.  I'm not sufficiently expert in large-cluster behavior to guess 
>> about the problem, but note that you should *never* use 
>> auto-down-unreachable-after in production code. (I actually don't even 
>> recommend it in test code.)  While I don't *think* it causes the problem 
>> you're describing, it can cause much more severe "split-brain" issues that 
>> can lead to data corruption.  You're going to need to come up with a more 
>> nuanced approach to the problem of downing; I recommend reading the 
>> documentation sections on Downing 
>> <http://doc.akka.io/docs/akka/2.4.14/scala/cluster-usage.html#Downing> 
>> and Split Brain 
>> <http://doc.akka.io/docs/akka/akka-commercial-addons-1.0/scala/split-brain-resolver.html>
>>  
>> -- it's important to get this stuff right to have a stable environment.
>>
>> On Fri, Dec 9, 2016 at 3:44 PM, Tyler Brummett <[email protected] 
>> <javascript:>> wrote:
>>
>>> Hey Akka experts, I need your help! Currently my company is using Akka 
>>> as a part of a partial CQRS pattern. We have service adapters that consume 
>>> source system events in the form of JMS messages, while producing commands 
>>> to be asynchronously distributed to our command service. Our command 
>>> service consumes all of these messages asynchronously based on a given 
>>> group ID, so that no two commands with the same group ID are being 
>>> processed at the same time. 
>>>
>>> We have designed an approach that allows us to have each deployable 
>>> component in its own cluster and use a clusterClient to talk across 
>>> clusters. Below is another diagram illustrating the service architecture 
>>> with the Akka configuration reflecting separate clusters.
>>>
>>> [diagram]
>>> (see attached please)
>>>
>>> Errors we are seeing on appbox01: UI sends commands to command service
>>> 11/11/2016 09:48:46,056  INFO 
>>> [AppClusterSystem-akka.actor.default-dispatcher-29] CommandHandlerActor - 
>>> received master ack.
>>> 11/11/2016 09:48:52,045  INFO 
>>> [AppClusterSystem-akka.actor.default-dispatcher-35] CommandHandlerActor 
>>> work timeout. For commandX
>>> 11/11/2016 09:48:52,046 ERROR [tomcat-http--33] AppController - X update 
>>> failed
>>> com.company.appA.package.AkkaWorkFailedException: Timeout for X
>>>  
>>> Errors we are seeing on servicebox01: UI sends commands to command 
>>> service
>>> 11/11/2016 09:48:46,715  WARN 
>>> [CommandClusterSystem-akka.actor.default-dispatcher-2] 
>>> ClusterStatusListenerActor - Problem has occurred associating local host: 
>>> servicebox01.company.com and remote host: appbox01.company.com
>>> 11/11/2016 09:48:46,716  WARN 
>>> [CommandClusterSystem-akka.actor.default-dispatcher-2] 
>>> ClusterStatusListenerActor - Problem has occurred associating local host: 
>>> servicebox01.company.com and remote host: appbox01.company.com
>>> 11/11/2016 09:48:46,716  WARN 
>>> [CommandClusterSystem-akka.actor.default-dispatcher-2] Remoting - Tried to 
>>> associate with unreachable remote address [akka.tcp://
>>> [email protected]:12345]. Address is now gated for 
>>> 5000 ms, all messages to this address will be delivered to dead letters. 
>>> Reason: [The remote system has quarantined this system. No further 
>>> associations to the remote system are possible until this system is 
>>> restarted.]
>>>
>>>
>>> We are interested in seeing this new implementation through and finding 
>>> solutions where we can decouple our services and apps from one another as 
>>> we move towards a micro-service architecture. So if you have 
>>> suggestions/solutions, we are all ears!
>>>
>>> So the main question is: why are our nodes being quarantined? We have 
>>> restarted nodes and stabalized the environment over and over, but the 
>>> quarantine problem resurfaces after a few hours. Typically it's in a bad 
>>> state by the next day. As part of this post I have provided our typical 
>>> application.conf file for a given service, which corresponds with our new 
>>> "separate cluster" implementation (diagram). Hopefully someone out there 
>>> can help us shed some light to this problem. Please see the 
>>> application.conf below.
>>>
>>> Thanks!
>>>
>>> =====================
>>> application.conf
>>> =====================
>>>
>>> # bulkhead workers
>>> my-worker-exec-dispatcher {
>>>    type = Dispatcher
>>>    executor = "fork-join-executor"
>>>    fork-join-executor {
>>>       parallelism-min = 2
>>>       parallelism-factor = 2.0
>>>       parallelism-max = 10
>>>    }
>>>    throughput =1
>>> }
>>>
>>> # dedicate resources to the master actor
>>> my-master-dispatcher {
>>>    type = Dispatcher
>>>    executor = "fork-join-executor"
>>>    fork-join-executor {
>>>       parallelism-min = 2
>>>       parallelism-factor = 2.0
>>>       parallelism-max = 10
>>>    }
>>>    throughput =20
>>> }
>>>
>>> akka {
>>>    loggers = ["akka.event.slf4j.Slf4jLogger"]
>>>    loglevel = "INFO"
>>>    stdout-loglevel = "OFF"
>>>
>>>    actor.provider = "akka.cluster.ClusterActorRefProvider"
>>>
>>>   # Log the complete configuration at INFO level when the actor system 
>>> is started.
>>>   # This is useful when you are uncertain of what configuration is used.
>>>   log-config-on-start = off
>>>   
>>>    remote {
>>>       log-remote-lifecycle-events = off
>>>
>>>  # If this is "on", Akka will log all outbound messages at DEBUG level,
>>>       # if off then they are not logged
>>>       log-sent-messages = off
>>>  # If this is "on", Akka will log all inbound messages at DEBUG level,
>>>       # if off then they are not logged
>>>       log-received-messages = off
>>>       netty.tcp {
>>>          # hostname is injected programmatically in AppConfiguration.
>>>          port = ${akka.node.port}
>>>          send-buffer-size = 10240000b
>>>          receive-buffer-size = 10240000b
>>>          maximum-frame-size = 5120000b
>>>       }
>>>    }
>>>    
>>>    contrib {
>>> cluster {
>>>  pub-sub {
>>> # How often the DistributedPubSubMediator should send out gossip 
>>> information
>>> gossip-interval = 5s
>>>  }
>>>   }
>>>    }
>>>
>>>    cluster {
>>>       # seed-nodes is injected programmatically 
>>>       # seed-nodes = [${akka.seed.nodes}]
>>>       # 30 minute auto down for a crashed master
>>>       # a long network outage requires restarting the cluster after 30 
>>> minutes
>>>       auto-down-unreachable-after = 1800s
>>>       roles = [${akka.cluster.roles}]
>>>    }
>>>
>>>    actor {
>>>       bounded-mailbox {
>>>          mailbox-type = "akka.dispatch.BoundedMailbox"
>>>          mailbox-capacity = 3000
>>>          mailbox-push-timeout-time = 100ms
>>>       }
>>>  
>>>  debug {
>>>       # enable function of LoggingReceive, which is to log any received 
>>> message at
>>>       # DEBUG level
>>>       receive = off
>>>   # enable DEBUG logging of all AutoReceiveMessages (Kill, PoisonPill 
>>> et.c.)
>>>       autoreceive = off
>>>  # enable DEBUG logging of actor lifecycle changes
>>>       lifecycle = off
>>>   # enable DEBUG logging of all LoggingFSMs for events, transitions and 
>>> timers
>>>       fsm = off
>>>   # enable DEBUG logging of subscription changes on the eventStream
>>>       event-stream = off
>>>     }
>>>    }
>>> }
>>>
>>> akka.extensions = ["akka.contrib.pattern.ClusterReceptionistExtension"]
>>>
>>> akka.contrib.cluster.receptionist {
>>>    name = receptionist
>>>    number-of-contacts = 3
>>>    response-tunnel-receive-timeout = 30s
>>> }
>>>
>>> akka.cluster.client {
>>>    heartbeat-interval = 2s
>>>    acceptable-heartbeat-pause = 10s
>>>    buffer = 0
>>> }
>>>
>>> -- 
>>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>>> >>>>>>>>>> Check the FAQ: 
>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>> >>>>>>>>>> Search the archives: 
>>> https://groups.google.com/group/akka-user
>>> --- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "Akka User List" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected] <javascript:>.
>>> To post to this group, send email to [email protected] 
>>> <javascript:>.
>>> Visit this group at https://groups.google.com/group/akka-user.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> -- 
>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>> >>>>>>>>>> Check the FAQ: 
>> http://doc.akka.io/docs/akka/current/additional/faq.html
>> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "Akka User List" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/akka-user.
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Re: [akka-user] Akka clusters using cluster client causing quarantined nodes (v2.3.11)

Reply via email to