Hi all,

Robert, you mentioned that you have already tried to change
"heartbeat-interval" setting. Did you change that on
"watch-failure-detector" or on "transport-failure-detector"? Could you try
changing that on "transport-failure-detector"?

If you can still reproduce it can you provide either a reproducible code
sample or logs from both of the systems when messages can only propagate to
one direction?

I have tried various situations while restarting one akka node with some
decent load and I can't reproduce it.

On Fri, Nov 7, 2014 at 10:00 PM, Dragisa Krsmanovic <[email protected]>
wrote:

>  Martynas & Robert,
>
> To me, the most suspicious thing is that, in this case, connection only
> works one way. A can talk to B but B can’t reply back to A.
>
> There is not that much custom code in that class. It’s not subscribed to
> Association/Disassociation or any other Akka system event.
>
> Actor one node is sending message to actor (ActorRef/ActorSelection) on
> another node. Receiver clearly receives the messages and replies with
> "sender ! msg”. We can see that from our logs.
>
> But the sender does not receive the message because the link in that
> direction is disassociated.
>
> This is just plain akka-remote. Not clustering.
>
> It seems like connection health checks that you added in 2.2 are causing
> us trouble with false negatives.
>
> What are configuration options that we can try ? Can we assign a different
> dispatcher to heartbeat failure detector to rule out thread starvation ?
>
> Dragisa Krsmanovic
> Ticketfly
>
> —
> Sent from Mailbox <https://www.dropbox.com/mailbox>
>
> On Friday, Nov 7, 2014 at 9:47 AM, Martynas Mickevičius <
> [email protected]>, wrote:
>
>> Hello Martynas,
>>
>> This test is already the simplest scenario we can come up with. It comes
>> from a cluster simulation test framework we have developed to simulate our
>> business needs.
>>
>> If we find the time we can write a simple ping-pong test. but not sure if
>> this is possible. is there any more logging we can try? or changing
>> parameters? (however, i already tried changing the "heartbeat-interval",
>> etc.)
>>
>> Are there any plans to make this more robust in Akka 2.3.7? i fear we
>> need to revert back to older Akka versions.
>>
>> and, can it be that we experience similar issues as reported in this
>> ticket?
>> https://github.com/akka/akka/issues/13860
>>
>> Thanks,
>> Robert
>>
>> On Friday, November 7, 2014 9:47:42 AM UTC-8, Martynas Mickevičius wrote:
>>>
>>> I think these messages are fine. After node{1,2} comes back on node3
>>> should associate with new nodes.
>>>
>>> From the logs I see that there is quite a lot of custom code running
>>> (such as diva.core.engine.PaxosDistributedKeyManager) which is listening
>>> for Association/Disassociation Events. Have you tried the restart scenario
>>> with some load with simplest actors possible and see if you can reproduce
>>> the issue?
>>>
>>> On Fri, Nov 7, 2014 at 7:37 PM, Robert Preissl <[email protected]>
>>> wrote:
>>>
>>>> Hello Martynas,
>>>>
>>>> Well, I think I can rule this option out because:
>>>> - without any load on the system (my scenario 1 in my orig. post) a
>>>> restart works fine.
>>>> - also, most of the time node3 can send back messages to node2. but
>>>> node3 does not send to node1. (however, sometimes both nodes, node1 and
>>>> node2 do not hear back from node3)
>>>> - and we also tried with ActorSelection and it did not work.
>>>>
>>>> is it suspicious to see Disassociated messages? or is this just a
>>>> symptom?
>>>>
>>>> Thanks,
>>>> Robert
>>>>
>>>> On Friday, November 7, 2014 9:15:42 AM UTC-8, Martynas Mickevičius
>>>> wrote:
>>>>>
>>>>> Hi Robert,
>>>>>
>>>>> as you mentioned and from the logs you provided its seems that
>>>>> messages are flowing from node{1,2} to node3 after restart, but not to the
>>>>> other direction.
>>>>>
>>>>> Would it be possible that your application tries to send messages from
>>>>> node3 to node{1,2} using ActorRefs which were resolved before the restart
>>>>> of node{1,2}? ActorRef includes actor UID which changes after Actor is
>>>>> stopped and started again, which happens upon node restart. Here is
>>>>> <https://github.com/2m/sandbox-akka-remote/blob/fd61036875bcb622e6a5657c16d879bcc7b6b21b/src/main/scala/NodeRestart.scala>
>>>>> a quick example code that illustrates that situation.
>>>>>
>>>>> If so, you should send messages using ActorSelection or re-resolve
>>>>> ActorRefs after node restart or periodically.
>>>>>
>>>>> On Fri, Nov 7, 2014 at 3:58 AM, Robert Preissl <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hello Endre,
>>>>>>
>>>>>> First of all, thanks for replying so quickly!
>>>>>>
>>>>>> Second, I need to mention that we use Akka remoting. and not Akka
>>>>>> clustering (yet). not sure if this makes a difference.
>>>>>>
>>>>>> What I mean that in our restart scenario (where first node1 and node2
>>>>>> are simultaneously restarted. and then node3) when node1 and node2 are
>>>>>> coming back up, it seems that the connection node1 -> node3 works fine. 
>>>>>> but
>>>>>> the connection node3 -> node1 does not.
>>>>>>
>>>>>> So, to answer your question, yes, it seems we loose messages from
>>>>>> node3.
>>>>>>
>>>>>> I attached more detail logs below. (and please excuse the many log
>>>>>> lines; i tried to clean it up as much as possible)
>>>>>>
>>>>>> what is interesting to see is this line:
>>>>>> *processing Event(Disassociated
>>>>>> [akka.tcp://[email protected]:8900
>>>>>> <http://[email protected]:8900>] ->
>>>>>> [akka.tcp://[email protected]:8900
>>>>>> <http://[email protected]:8900>]*
>>>>>>
>>>>>> 10.57.0.43 is node3. and 10.57.0.41 is node1, by the way.
>>>>>>
>>>>>> so, the connection between node3 and node1 is Disassociated; which
>>>>>> explains maybe why node1 never hears back from node3 when it tries to 
>>>>>> sync
>>>>>> up.
>>>>>>
>>>>>> We looked a bit in the akka source code and found that stopping an
>>>>>> EndpointWriter (I think) triggers a "Disassociated" to be fired, right? 
>>>>>> and
>>>>>> we can see this stop in a few log lines above:
>>>>>>
>>>>>> *[akka://DivaPCluster/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FDivaPCluster%4010.57.0.41%3A8900-0/endpointWriter]
>>>>>> akka.remote.EndpointWriter - stopping*
>>>>>>
>>>>>> so, why is it stopping? is this our problem here?
>>>>>>
>>>>>> the logs from node3 are attached as a file.
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> Robert
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wednesday, November 5, 2014 12:55:04 PM UTC-8, Robert Preissl
>>>>>> wrote:
>>>>>>>
>>>>>>> hello!
>>>>>>>
>>>>>>> I am having a problem in my remote Akka production system, which
>>>>>>> consists of 3 nodes running with the latest version of Akka (2.3.6.):
>>>>>>>
>>>>>>> In more details, I am experiencing errors with "*rolling restarts*"
>>>>>>> of the cluster (for deployment, etc.  we cannot afford any downtime), 
>>>>>>> where
>>>>>>> a restart happens in the following sequence
>>>>>>> 1.) restart node1 and node2.
>>>>>>> 2.) once 1. completed, restart node3.
>>>>>>>
>>>>>>> *but we only observe failures once there is a load (even small load)
>>>>>>> on the system*. So, I want to describe two scenarios:
>>>>>>>
>>>>>>>
>>>>>>> *Scenario 1 - no load on the system: Restart works.*
>>>>>>>
>>>>>>> if there is no load on the system at all, the restarting seems to
>>>>>>> work fine. I.e., with detailed logging I can observe that node3 logs the
>>>>>>> following events: (in chronological order)
>>>>>>>
>>>>>>> 13:09:48.769 WARN  [akka.tcp://DivaPCluster@NODE_
>>>>>>> 3:8900/system/endpointManager/reliableEndpointWriter-akka.tcp0-1]
>>>>>>> akka.remote.ReliableDeliverySupervisor - Association with remote
>>>>>>> system [akka.tcp://DivaPCluster@NODE_2:8900] has failed, address is
>>>>>>> now gated for [5000] ms. Reason is: [Disassociated].
>>>>>>> 13:09:48.823 WARN  [akka.tcp://DivaPCluster@NODE_
>>>>>>> 3:8900/system/endpointManager/reliableEndpointWriter-akka.tcp0-0]
>>>>>>> akka.remote.ReliableDeliverySupervisor - Association with remote
>>>>>>> system [akka.tcp://DivaPCluster@NODE_1:8900] has failed, address is
>>>>>>> now gated for [5000] ms. Reason is: [Disassociated].
>>>>>>>
>>>>>>> 13:10:10.661 DEBUG [Remoting] Remoting - Associated
>>>>>>> [akka.tcp://DivaPCluster@NODE_3:8900] <-
>>>>>>> [akka.tcp://DivaPCluster@NODE_2:8900]
>>>>>>> 13:10:10.987 DEBUG [Remoting] Remoting - Associated
>>>>>>> [akka.tcp://DivaPCluster@NODE_3:8900] <-
>>>>>>> [akka.tcp://DivaPCluster@NODE_1:8900]
>>>>>>>
>>>>>>> Since node1 and node2 restart, it is fine that the association is
>>>>>>> gated between node3 -> node1 (and between node3 -> node2) for a while.
>>>>>>> And I assume it becomes active again since "a successful inbound
>>>>>>> connection is accepted from a remote system during Gate it automatically
>>>>>>> transitions to Active" (as you describe in
>>>>>>> http://doc.akka.io/docs/akka/snapshot/java/remoting.html)
>>>>>>>
>>>>>>> this can be verified since I can see the logs on node1 that it tries
>>>>>>> to connect at this point in time after the restart: 13:10:10.861 (and 
>>>>>>> the
>>>>>>> connection becomes active on node3; managing node3 -> node1; at time
>>>>>>> 13:10:10.987 as you can see above)
>>>>>>>
>>>>>>> so, everything cool here and the system restarts fine!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Scenario 2 - easy load on the system: Restart fails due to
>>>>>>> Unrecoverable "gated" state*
>>>>>>>
>>>>>>> Similar to Scenario 1 above, I can observe the "gated" messages for
>>>>>>> links  node3 -> node1 and node3 -> node2.
>>>>>>>
>>>>>>> However, I never see that the links become active again! and the
>>>>>>> restart never recovers and I need to manually stop my nodes and start up
>>>>>>> again.
>>>>>>>
>>>>>>> This is surprising since I clearly see that node1 and node2 (after
>>>>>>> they restarted) send message to node3. and node3 successfully logs the
>>>>>>> reception of these messages.
>>>>>>>
>>>>>>> So, why does in this scenario the connection not become active
>>>>>>> again?? It is a successful inbound connection that should make the link
>>>>>>> active again as you describe on your site?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Any help on this is greatly appreciated. otherwise we need to roll
>>>>>>> back to Scala 2.10 (or 2.9) and an older version of Akka.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Robert
>>>>>>>
>>>>>>    --
>>>>>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>> >>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/
>>>>>> current/additional/faq.html
>>>>>> >>>>>>>>>> Search the archives: https://groups.google.com/
>>>>>> group/akka-user
>>>>>> ---
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "Akka User List" group.
>>>>>>  To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> To post to this group, send email to [email protected].
>>>>>> Visit this group at http://groups.google.com/group/akka-user.
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Martynas Mickevičius
>>>>> Typesafe <http://typesafe.com/> – Reactive
>>>>> <http://www.reactivemanifesto.org/> Apps on the JVM
>>>>>
>>>>   --
>>>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>>>> >>>>>>>>>> Check the FAQ:
>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>> >>>>>>>>>> Search the archives:
>>>> https://groups.google.com/group/akka-user
>>>> ---
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Akka User List" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at http://groups.google.com/group/akka-user.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>>
>>> --
>>> Martynas Mickevičius
>>> Typesafe <http://typesafe.com/> – Reactive
>>> <http://www.reactivemanifesto.org/> Apps on the JVM
>>>
>>  --
>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>> >>>>>>>>>> Check the FAQ:
>> http://doc.akka.io/docs/akka/current/additional/faq.html
>> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "Akka User List" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at http://groups.google.com/group/akka-user.
>> For more options, visit https://groups.google.com/d/optout.
>>
>  --
> >>>>>>>>>> Read the docs: http://akka.io/docs/
> >>>>>>>>>> Check the FAQ:
> http://doc.akka.io/docs/akka/current/additional/faq.html
> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
> ---
> You received this message because you are subscribed to the Google Groups
> "Akka User List" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/akka-user.
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Martynas Mickevičius
Typesafe <http://typesafe.com/> – Reactive
<http://www.reactivemanifesto.org/> Apps on the JVM

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Reply via email to