Re: [akka-user] Consistent Dissasociation, small cluster

Jordan Messec Thu, 05 Jan 2017 12:58:23 -0800

As you mentioned might be the case Patrik, the added configuration was not 
adequate to solve the problem. I have followed the advice on the 
dispatchers documentation page, and set up our actors that have blocking 
behavior to use either a pinned dispatcher, or in the case of a router 
actor, to use a pool-dispatcher.executor = "thread-pool-executor" setting. 
I will again followup with the results.


Francesco, looks like you've started a separate thread for your issues, 
however I will state that we also have cluster-singletons in our setup.
Serg, we have not had issues with nodes shutting down, but I do suggest you 
turn on akka debug logging as well as the heartbeat logging.

Jordan

On Thursday, January 5, 2017 at 5:49:29 AM UTC-8, Patrik Nordwall wrote:
>
> That is an excellent analysis, Jordan. The verbose-heartbeat-logging is 
> useful for exactly this kind of debugging. You need to find why NODE-1 was 
> "paused". You said that you might be doing some blocking activity in your 
> actors. I strongly recommend that you eliminate such blocking or assign a 
> dedicated dispatcher for the actors that are blocking. Blocking must not be 
> done on the default-dispatcher, since it might starve other Akka internal 
> tasks. It is normally not enough to configure akka.cluster.use-dispatcher. 
> It's better too use dedicated dispatchers for the things in the application 
> that is blocking, because there might always be some other thing that will 
> be starved on the default-dispatcher.
>
> Here is to how to configure a dispatcher for blocking: 
> http://doc.akka.io/docs/akka/2.4/scala/dispatchers.html#More_dispatcher_configuration_examples
>
> On Thu, Jan 5, 2017 at 12:25 PM, 'Francesco laTorre' via Akka User List <
> [email protected] <javascript:>> wrote:
>
>> Hi Jordan,
>>
>> It looks very related to the issue we are facing, with the difference we 
>> are not able to recover from the UNREACHABLE mark, probably because the 
>> cluster specs are different : in our scenario we have 3 cluster singletons 
>> and <outrageous> we use auto-downing </outrageous>.
>>
>> Cheers,
>> Francesco
>>
>> On 4 January 2017 at 21:01, Jordan Messec <[email protected] <javascript:>
>> > wrote:
>>
>>> Here is an update:
>>>
>>> I moved to Akka 2.4.16 and still encountered the problem. 
>>>
>>> Therefore, I turned on "akka.cluster.debug.verbose-heartbeat-logging = 
>>> on".
>>>
>>> This allowed me to notice that when nodes started entering UNREACHABLE 
>>> status from each other, that *outgoing *heartbeat messages (the initial 
>>> message not the response) were suddenly failing to send from one node, lets 
>>> call it NODE-1. Some significant time later, NODE-1 would get a handful of:
>>> [INFO] [12/30/2016 01:36:06.681] [akka.remote.transport.
>>> ProtocolStateActor] [$X{akkaSource:-*}] No response from remote. 
>>> Transport failure detector triggered. (internal state was Open)
>>>  messages. Which I assume means that a heartbeat response was not 
>>> received within the 'acceptable-heartbeat-response' parameter.
>>>
>>> Looking at the logs of the other nodes, I can see that around the time 
>>> that NODE-1 stopped sending outgoing heartbeats, the logs from a peer, 
>>> NODE-2 stopped receiving responses from its outgoing heartbeats to NODE-1. 
>>> The first thing NODE-2 does is mark NODE-1 as UNREACHABLE, and then a short 
>>> bit later, outputs one of the above 'No response from remote' messages.
>>>
>>> At the same time that NODE-1 resumed outgoing heartbeat messages, NODE-2 
>>> gets flooded with heartbeat responses from NODE-1 and then shortly moves 
>>> NODE-1 back to REACHABLE and the cluster heals.
>>>
>>> It seems that something is causing a hiccup in my nodes which derails 
>>> the cluster monitoring threads. I am using the "-XX:MaxGCPauseMillis=300" 
>>> option in my startup script, however looking at the GC logs this doesn't 
>>> seem to be getting honored. However none of the GC pauses are lasting 
>>> anywhere near as long as the hiccup in NODE-1. It could be that I am doing 
>>> some blocking activity in my actors which is conflicting with the heartbeat 
>>> monitor actor. I have now added the 'akka.cluster.use-dispatcher' lines to 
>>> my configuration.
>>>
>>> I'll keep monitoring and report back as I get more information.
>>>
>>>
>>> On Tuesday, December 27, 2016 at 9:04:32 AM UTC-8, Serg wrote:
>>>>
>>>> Hello Jordan,
>>>>
>>>> I also would like to hear from you if updating to the latest version 
>>>> has fixed the problem. 
>>>>
>>>> We have a similar issue when cluster nodes suddenly become unreachable 
>>>> (though they are running on the same host, no cpu/memory/GC spikes) and 
>>>> then shut down themselves for no reason (auto-shutdown is disabled for all 
>>>> nodes). We are running on Akka 2.4.10, old Netty transport.
>>>>
>>>>
>>>> On Thursday, December 22, 2016 at 11:59:22 PM UTC+2, Jordan Messec 
>>>> wrote:
>>>>>
>>>>> Thank you for your response and time. I have updated to version 2.4.16 
>>>>> and have Akka debug logging enabled. I will keep a further eye on this 
>>>>> and 
>>>>> update as appropriate.
>>>>>
>>>>>
>>>>> On Saturday, December 17, 2016 at 3:28:22 AM UTC-8, √ wrote:
>>>>>>
>>>>>> Hi!
>>>>>>
>>>>>> Update to most recent version and report back.
>>>>>>
>>>>>> -- 
>>>>>> Cheers,
>>>>>> √
>>>>>>
>>>>>> On Dec 17, 2016 08:20, "Jordan Messec" <[email protected]> wrote:
>>>>>>
>>>>>>> Hello, I am struggling with a problem I have spent days trying to 
>>>>>>> resolve. I was hoping someone here may have some input that could help 
>>>>>>> me 
>>>>>>> look in the right direction.
>>>>>>>
>>>>>>> I am running a small cluster with 3 nodes. Two nodes reside on one 
>>>>>>> machine, while the third resides on a separate machine. This cluster is 
>>>>>>> formed between two applications. Call them Web and DataDig. DataDig and 
>>>>>>> Web 
>>>>>>> co-reside on Machine1 and Web is duplicated on machine two.
>>>>>>>
>>>>>>> Both use Akka 2.4.4, with Web's dependencies being transitive 
>>>>>>> through Play 2.5.4 
>>>>>>>
>>>>>>> My problem is that after sometime of running without issue, the 
>>>>>>> nodes start having trouble communicating with each other. Within 24 
>>>>>>> hours 
>>>>>>> of bringing the cluster members online, the logs start to display the 
>>>>>>> following:
>>>>>>>
>>>>>>> [WARN] [12/16/2016 21:07:32.645] [a.r.ReliableDeliverySupervisor] [
>>>>>>> akka.tcp://application@host1:2552/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fapplication%40host2%3A2552-588]
>>>>>>>  
>>>>>>> Association with remote system [akka.tcp://host2:2552] has failed, 
>>>>>>> address 
>>>>>>> is now gated for [5000] ms. Reason: [Disassociated]
>>>>>>>
>>>>>>> A service is used to monitor cluster health, at this time it starts 
>>>>>>> to report that cluster members are unreachable from each other.
>>>>>>>
>>>>>>> Obviously this starts to cause problems with cluster behavior, and 
>>>>>>> also results in messages stating the Leader can currently not perform 
>>>>>>> its 
>>>>>>> duties:
>>>>>>>
>>>>>>> [INFO] [12/16/2016 21:05:48.440] [a.c.Cluster(akka://application)] 
>>>>>>> [akka.cluster.Cluster(akka://application)] Cluster Node 
>>>>>>> [akka.tcp://application@host1:2552] - Leader can currently not perform 
>>>>>>> its 
>>>>>>> duties, reachability status: [akka.tcp://application@host1:2552 -> 
>>>>>>> akka.tcp://application@host2:2552: Unreachable [Unreachable] (328), 
>>>>>>> akka.tcp://application@host1:37770 -> 
>>>>>>> akka.tcp://application@host2:2552: 
>>>>>>> Unreachable [Unreachable] (1), akka.tcp://application@host2:2552 -> 
>>>>>>> akka.tcp://application@host1:2552: Reachable [Reachable] (616), 
>>>>>>> akka.tcp://application@host2:2552 -> 
>>>>>>> akka.tcp://application@host1:37770: 
>>>>>>> Unreachable [Unreachable] (617)], member status: 
>>>>>>> [akka.tcp://application@host1:2552 Up seen=true, 
>>>>>>> akka.tcp://application@host1:37770 Up seen=false, 
>>>>>>> akka.tcp://application@host2:2552 Leaving seen=false]
>>>>>>>
>>>>>>>
>>>>>>> I have turned on Akka debug logging but the only further messages 
>>>>>>> around the time of Disassociation I see are:
>>>>>>>
>>>>>>> [DEBUG] [12/16/2016 21:42:58.893] [application-akka.actor.default-
>>>>>>> dispatcher-24] 
>>>>>>> [akka.tcp://application@host1:37770/system/cluster/core/daemon] 
>>>>>>> Cluster Node [akka.tcp://application@host1:37770] - Receiving gossip 
>>>>>>> from 
>>>>>>> [UniqueAddress(akka.tcp://application@host2:2552,921200398)]
>>>>>>>
>>>>>>> and
>>>>>>>
>>>>>>> [DEBUG] [12/16/2016 21:06:03.310] [a.r.EndpointWriter] 
>>>>>>> [akka.tcp://application@host1:2552/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fapplication%40host2%3A2552-588/endpointWriter]
>>>>>>>  
>>>>>>> Drained buffer with maxWriteCount: 50, fullBackoffCount: 1, 
>>>>>>> smallBackoffCount: 0, noBackoffCount: 0 , adaptiveBackoff: 1000
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Here is the configuration being used for Web:
>>>>>>>
>>>>>>> akka {
>>>>>>>   actor {
>>>>>>>     provider = "akka.cluster.ClusterActorRefProvider"
>>>>>>>   }
>>>>>>>
>>>>>>>   remote {
>>>>>>>     secure-cookie = "9C7BBB890AB2C39691FC7B2A34F616C1D87FCC5B"
>>>>>>>     require-cookie = on
>>>>>>>     netty.tcp {
>>>>>>>       hostname = "localhost"
>>>>>>>       hostname = *$*{?HOSTNAME}
>>>>>>>       port = 2552
>>>>>>>     }
>>>>>>>     log-remote-lifecycle-events = off
>>>>>>>   }
>>>>>>>
>>>>>>>   cluster {
>>>>>>>     failure-detector.threshold = 10
>>>>>>>     pub-sub {
>>>>>>>       name = distributedPubSubMediator
>>>>>>>       routing-logic = round-robin
>>>>>>>       gossip-interval = 1s
>>>>>>>       removed-time-to-live = 60s
>>>>>>>       max-delta-elements = 3000
>>>>>>>     }
>>>>>>>
>>>>>>>     roles = ["Web"]
>>>>>>>
>>>>>>>     seed-nodes = "akka.tcp://application@host1:2552"
>>>>>>>
>>>>>>>   }
>>>>>>>
>>>>>>>   loglevel = "DEBUG"
>>>>>>>   log-dead-letters-during-shutdown = off
>>>>>>>   log-dead-letters = off
>>>>>>>
>>>>>>>   extensions = ["akka.cluster.pubsub.DistributedPubSub"]
>>>>>>> }
>>>>>>>
>>>>>>> with JAVA_OPTS="
>>>>>>> ...
>>>>>>>
>>>>>>> -XX:HeapDumpPath=$HOME/log/ \
>>>>>>> -XX:+UseG1GC \
>>>>>>> -XX:MaxGCPauseMillis=300 \
>>>>>>> -XX:G1HeapWastePercent=20 \
>>>>>>> -XX:InitiatingHeapOccupancyPercent=75 \
>>>>>>> -XX:ConcGCThreads=32 \
>>>>>>> -XX:ParallelGCThreads=48 \
>>>>>>> -XX:NewRatio=1 \
>>>>>>> -verbose:gc \
>>>>>>> -XX:+UseGCLogFileRotation \
>>>>>>> -XX:NumberOfGCLogFiles=1 \
>>>>>>> -XX:GCLogFileSize=512M \
>>>>>>> -XX:+PrintGCDetails \
>>>>>>> -XX:+PrintGCTimeStamps \
>>>>>>> -Xloggc:$HOME/log/services_web_gc.log
>>>>>>> "
>>>>>>>
>>>>>>> With *very* similar config for DataDig.
>>>>>>>
>>>>>>> These hosts are very powerful machines that are not running any 
>>>>>>> other resource heavy processes (in fact they're barely running anything 
>>>>>>> else at all). There are a few GC pauses that are longer than I would 
>>>>>>> expect.
>>>>>>>
>>>>>>>
>>>>>>> Any help is appreciated, and I can provide any further 
>>>>>>> context/information.
>>>>>>>
>>>>>>> -- 
>>>>>>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>> >>>>>>>>>> Check the FAQ: 
>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>> >>>>>>>>>> Search the archives: 
>>>>>>> https://groups.google.com/group/akka-user
>>>>>>> --- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "Akka User List" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to [email protected].
>>>>>>> To post to this group, send email to [email protected].
>>>>>>> Visit this group at https://groups.google.com/group/akka-user.
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>> -- 
>>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>>> >>>>>>>>>> Check the FAQ: 
>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>> >>>>>>>>>> Search the archives: 
>>> https://groups.google.com/group/akka-user
>>> --- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "Akka User List" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected] <javascript:>.
>>> To post to this group, send email to [email protected] 
>>> <javascript:>.
>>> Visit this group at https://groups.google.com/group/akka-user.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> -- 
>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>> >>>>>>>>>> Check the FAQ: 
>> http://doc.akka.io/docs/akka/current/additional/faq.html
>> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "Akka User List" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/akka-user.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> -- 
>
> Patrik Nordwall
> Akka Tech Lead
> Lightbend <http://www.lightbend.com/> -  Reactive apps on the JVM
> Twitter: @patriknw
>
>

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Re: [akka-user] Consistent Dissasociation, small cluster

Reply via email to