Re: [akka-user] Re: Unrecoverable "gated" state in remote system

Björn Antonsson Tue, 09 Dec 2014 06:18:44 -0800

Hi Jim,

Thanks for sharing. Good luck with hunting down the issue. I've added some 
comments below.

B/

On 8 December 2014 at 22:23:29, Jim Washburn ([email protected]) wrote:

Hi again,

I ran more experiments, and now have seen a scenario where we execute the XXX 
code path, yet don't have a problem with node to node remoting. So my 
hypothesis regarding this code path being correlated with our issue, must have 
been incorrect. Sorry for the false alarm. Back to square one...

Jim

On Mon, Dec 8, 2014 at 9:18 AM, Jim Washburn <[email protected]> wrote:
Hi Björn,
Thanks for getting back to us!

On Mon, Dec 8, 2014 at 6:11 AM, Björn Antonsson <[email protected]> 
wrote:
Hi Jim,

The UID in question is the UID of the remote actor system that is trying to 
connect in to this actor system. It will only be changed on a complete restart 
of the actor system.

Just to make sure we are using the same terminology,

referring to the "Actor LifeCycle" diagram in

http://doc.akka.io/docs/akka/2.3.7/java/untyped-actors.html

- a restart of an actor, creating a new "instance", will *not* change the UID
- a Stop or PoisonPill of an actor, followed by an ActorOf(), resulting in a 
new actor "incarnation",
*will* change the UID

The UID in the code in the remoting is the UID of the ActorSystem, not the UID 
of a single ActorRef. The ActorSystem UID is chaged on restart of the 
ActorSystem.

Those two scenarios are separate in my mind from a restart of the entire actor 
system. Is this is accord with your view?

If so, in order to avoid the XXX code path, by making the UID equality false, 
we could consider creating a new incarnation of the actor on the remote system. 
Since this happens to be an important actor in our app (Paxos leader), we would 
have to consider all the repercussions of doing that.

The other alternative in terms of making the UID equality false would be to 
avoid having the local actor system having the remote system's UID stored in 
its endpoint database. At this point I do not know enough about the remoting to 
understand why it has this UID stored in the first place, given that the whole 
JVM was restarted.

Which branch did you build exactly? The latest stable bugfix branch is 
release-2.3.

>From github, for akka,  when I do a 'git log', the version I have is

commit 1312ca61396d2f4e4bb38318bca333e6ec6b62d8
Merge: a570872 6e6f92f
Author: Martynas Mickevičius <[email protected]>
Date:   Fri Nov 14 10:22:14 2014 +0200

from AkkaBuild.scala,

 version      := "2.3-SNAPSHOT",

Hope that works as a means of letting you know what I am running. When I did a 
git clone from github,
 that is what I got.

If you do "git checkout release-2.3" you will get the latest stable bugfix 
branch.

Also which of the machines in you failure scenario is running the XXX code path?

OK for our app, we run currently run three nodes on three JVMS. For this 
discussion, it comes down to an interaction between two nodes. One node we can 
call the Proposer, which stays running running in our case. The other node, 
which we can call the Acceptor , is stopped and restarted at the JVM level. 
That is the machine which running the XXX code path.

Regards,

Jim

B/

On 6 December 2014 at 08:54:29, Jim Washburn ([email protected]) wrote:

Hi Martynas,

I discussed this with Robert and Dragisa and took a look into it. I  cloned the 
Akka repo (2.3-SNAPSHOT) and put some debug logs into the akka remoting project.
Not to imply that there is a bug in akka remoting, but to get a more precise 
idea of what is happening in terms of the interaction between our application 
and akka.
I found a difference in a certain code path taken in Remoting.scala, between 
the cases of when we have a problem in restarting our app, and when we don't 
have a problem.
(The restart problem occurs intermittently for us.)

The function is

 def handleInboundAssociation(ia: InboundAssociation): Unit = ia match {
    case ia @ InboundAssociation(handle: AkkaProtocolHandle) ⇒ 
endpoints.readOnlyEndpointFor(handle.remoteAddress) match {
      case Some(endpoint) ⇒
        pendingReadHandoffs.get(endpoint) foreach (_.disassociate())
        pendingReadHandoffs += endpoint -> handle
        endpoint ! EndpointWriter.TakeOver(handle, self)
      case None ⇒
        if (endpoints.isQuarantined(handle.remoteAddress, 
handle.handshakeInfo.uid)) {
          handle.disassociate(AssociationHandle.Quarantined)
        } else endpoints.writableEndpointWithPolicyFor(handle.remoteAddress) 
match {
          case Some(Pass(ep, None, _)) ⇒
            stashedInbound += ep -> (stashedInbound.getOrElse(ep, Vector.empty) 
:+ ia)
          case Some(Pass(ep, Some(uid), _)) ⇒
            if (handle.handshakeInfo.uid == uid) {
              pendingReadHandoffs.get(ep) foreach (_.disassociate())    // XXX 
we have a restart problem
              pendingReadHandoffs += ep -> handle
              ep ! EndpointWriter.StopReading(ep, self)
            } else {
              context.stop(ep)               // YYY we don't have a restart 
problem
              endpoints.unregisterEndpoint(ep)
              pendingReadHandoffs -= ep
              createAndRegisterEndpoint(handle, refuseUid = Some(uid))
            }
          case state ⇒
            createAndRegisterEndpoint(handle, refuseUid = 
endpoints.refuseUid(handle.remoteAddress))
        }
    }
  }

As shown above, akka executes the code path labelled XXX above , we have a 
problem with our nodes communicating after a restart.
That is the case when the UIDs being compared are equal.  In the other case, 
labelled YYY, when then UIDs are unequal, a new Endpoint is created.
It appears that when the UIDs are equal, akka thinks that the endpoint is 
already set up.

I understand that the UIDs get recreated when actors are restarted. And 
therefore we should use the actorSelection API instead of actorFor.
We have gone ahead and replaced actorFor with actorSelection in our app. Still 
, unfortunately the problem persists.

My hypothesis is that there is still some so-to-speak  "stale" UID information 
somewhere in the system.
I can understand why an actor on one JVM node (the one which was not restarted) 
would have the same UID as before.
I don't know however how the other node (the one which is restarted)
would have that UID in its EndpointRegistry map, and therefore lead to a uid 
equality.

Any other ideas about how to get around this, would be much appreciated.

Regards,

Jim Washburn

On Wednesday, November 12, 2014 4:29:59 PM UTC-8, Robert Preissl wrote:
Hi Martynas,

Yes, I also tried manipulating the "transport-failure-detector" values 
(basically multiplied the default interval & pause settings by a factor of 2 
and also by 4 and it did not change anything).

I am attaching the original server logs for all three nodes. this is the 
scenario where node1 and node2 are restarting. node3 stays up. and once node1 
and node2 come back up, only node2 successfully syncs up with node3. node1 
sends messages to node3 but never gets responses back.

I appreciate you taking a look and curious if you see something suspicious.

Thanks,
Robert 

On Tuesday, November 11, 2014 5:36:40 AM UTC-8, Martynas Mickevičius wrote:
Hi all,

Robert, you mentioned that you have already tried to change 
"heartbeat-interval" setting. Did you change that on "watch-failure-detector" 
or on "transport-failure-detector"? Could you try changing that on 
"transport-failure-detector"?

If you can still reproduce it can you provide either a reproducible code sample 
or logs from both of the systems when messages can only propagate to one 
direction?

I have tried various situations while restarting one akka node with some decent 
load and I can't reproduce it.

On Fri, Nov 7, 2014 at 10:00 PM, Dragisa Krsmanovic <[email protected]> wrote:
Martynas & Robert,

To me, the most suspicious thing is that, in this case, connection only works 
one way. A can talk to B but B can’t reply back to A.

There is not that much custom code in that class. It’s not subscribed to 
Association/Disassociation or any other Akka system event. 

Actor one node is sending message to actor (ActorRef/ActorSelection) on another 
node. Receiver clearly receives the messages and replies with "sender ! msg”. 
We can see that from our logs.

But the sender does not receive the message because the link in that direction 
is disassociated.

This is just plain akka-remote. Not clustering. 

It seems like connection health checks that you added in 2.2 are causing us 
trouble with false negatives. 

What are configuration options that we can try ? Can we assign a different 
dispatcher to heartbeat failure detector to rule out thread starvation ?

Dragisa Krsmanovic
Ticketfly

—
Sent from Mailbox

On Friday, Nov 7, 2014 at 9:47 AM, Martynas Mickevičius 
<[email protected]>, wrote:
Hello Martynas,

This test is already the simplest scenario we can come up with. It comes from a 
cluster simulation test framework we have developed to simulate our business 
needs.

If we find the time we can write a simple ping-pong test. but not sure if this 
is possible. is there any more logging we can try? or changing parameters? 
(however, i already tried changing the "heartbeat-interval", etc.)

Are there any plans to make this more robust in Akka 2.3.7? i fear we need to 
revert back to older Akka versions.

and, can it be that we experience similar issues as reported in this ticket?
https://github.com/akka/akka/issues/13860

Thanks,
Robert

On Friday, November 7, 2014 9:47:42 AM UTC-8, Martynas Mickevičius wrote:
I think these messages are fine. After node{1,2} comes back on node3 should 
associate with new nodes.

>From the logs I see that there is quite a lot of custom code running (such as 
>diva.core.engine.PaxosDistributedKeyManager) which is listening for 
>Association/Disassociation Events. Have you tried the restart scenario with 
>some load with simplest actors possible and see if you can reproduce the issue?

On Fri, Nov 7, 2014 at 7:37 PM, Robert Preissl <[email protected]> wrote:
Hello Martynas,

Well, I think I can rule this option out because:
- without any load on the system (my scenario 1 in my orig. post) a restart 
works fine.
- also, most of the time node3 can send back messages to node2. but node3 does 
not send to node1. (however, sometimes both nodes, node1 and node2 do not hear 
back from node3)
- and we also tried with ActorSelection and it did not work.

is it suspicious to see Disassociated messages? or is this just a symptom?

Thanks,
Robert

On Friday, November 7, 2014 9:15:42 AM UTC-8, Martynas Mickevičius wrote:
Hi Robert,

as you mentioned and from the logs you provided its seems that messages are 
flowing from node{1,2} to node3 after restart, but not to the other direction.

Would it be possible that your application tries to send messages from node3 to 
node{1,2} using ActorRefs which were resolved before the restart of node{1,2}? 
ActorRef includes actor UID which changes after Actor is stopped and started 
again, which happens upon node restart. Here is a quick example code that 
illustrates that situation.

If so, you should send messages using ActorSelection or re-resolve ActorRefs 
after node restart or periodically.

On Fri, Nov 7, 2014 at 3:58 AM, Robert Preissl <[email protected]> wrote:
Hello Endre,

First of all, thanks for replying so quickly!

Second, I need to mention that we use Akka remoting. and not Akka clustering 
(yet). not sure if this makes a difference.

What I mean that in our restart scenario (where first node1 and node2 are 
simultaneously restarted. and then node3) when node1 and node2 are coming back 
up, it seems that the connection node1 -> node3 works fine. but the connection 
node3 -> node1 does not.

So, to answer your question, yes, it seems we loose messages from node3.

I attached more detail logs below. (and please excuse the many log lines; i 
tried to clean it up as much as possible)

what is interesting to see is this line:
processing Event(Disassociated [akka.tcp://[email protected]:8900] -> 
[akka.tcp://[email protected]:8900]

10.57.0.43 is node3. and 10.57.0.41 is node1, by the way.

so, the connection between node3 and node1 is Disassociated; which explains 
maybe why node1 never hears back from node3 when it tries to sync up.

We looked a bit in the akka source code and found that stopping an 
EndpointWriter (I think) triggers a "Disassociated" to be fired, right? and we 
can see this stop in a few log lines above:

[akka://DivaPCluster/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FDivaPCluster%4010.57.0.41%3A8900-0/endpointWriter]
 akka.remote.EndpointWriter - stopping

so, why is it stopping? is this our problem here?

the logs from node3 are attached as a file.

Thank you!

Robert

On Wednesday, November 5, 2014 12:55:04 PM UTC-8, Robert Preissl wrote:
hello!

I am having a problem in my remote Akka production system, which consists of 3 
nodes running with the latest version of Akka (2.3.6.):

In more details, I am experiencing errors with "rolling restarts" of the 
cluster (for deployment, etc.  we cannot afford any downtime), where a restart 
happens in the following sequence
1.) restart node1 and node2.
2.) once 1. completed, restart node3.

but we only observe failures once there is a load (even small load) on the 
system. So, I want to describe two scenarios:

Scenario 1 - no load on the system: Restart works.

if there is no load on the system at all, the restarting seems to work fine. 
I.e., with detailed logging I can observe that node3 logs the following events: 
(in chronological order)

13:09:48.769 WARN  
[akka.tcp://DivaPCluster@NODE_3:8900/system/endpointManager/reliableEndpointWriter-akka.tcp0-1]
 akka.remote.ReliableDeliverySupervisor - Association with remote system 
[akka.tcp://DivaPCluster@NODE_2:8900] has failed, address is now gated for 
[5000] ms. Reason is: [Disassociated].
13:09:48.823 WARN  
[akka.tcp://DivaPCluster@NODE_3:8900/system/endpointManager/reliableEndpointWriter-akka.tcp0-0]
 akka.remote.ReliableDeliverySupervisor - Association with remote system 
[akka.tcp://DivaPCluster@NODE_1:8900] has failed, address is now gated for 
[5000] ms. Reason is: [Disassociated].

13:10:10.661 DEBUG [Remoting] Remoting - Associated 
[akka.tcp://DivaPCluster@NODE_3:8900] <- [akka.tcp://DivaPCluster@NODE_2:8900]
13:10:10.987 DEBUG [Remoting] Remoting - Associated 
[akka.tcp://DivaPCluster@NODE_3:8900] <- [akka.tcp://DivaPCluster@NODE_1:8900]

Since node1 and node2 restart, it is fine that the association is gated between 
node3 -> node1 (and between node3 -> node2) for a while.
And I assume it becomes active again since "a successful inbound connection is 
accepted from a remote system during Gate it automatically transitions to 
Active" (as you describe in 
http://doc.akka.io/docs/akka/snapshot/java/remoting.html)

this can be verified since I can see the logs on node1 that it tries to connect 
at this point in time after the restart: 13:10:10.861 (and the connection 
becomes active on node3; managing node3 -> node1; at time 13:10:10.987 as you 
can see above)

so, everything cool here and the system restarts fine!

Scenario 2 - easy load on the system: Restart fails due to Unrecoverable 
"gated" state

Similar to Scenario 1 above, I can observe the "gated" messages for links  
node3 -> node1 and node3 -> node2.

However, I never see that the links become active again! and the restart never 
recovers and I need to manually stop my nodes and start up again.

This is surprising since I clearly see that node1 and node2 (after they 
restarted) send message to node3. and node3 successfully logs the reception of 
these messages.

So, why does in this scenario the connection not become active again?? It is a 
successful inbound connection that should make the link active again as you 
describe on your site?

Any help on this is greatly appreciated. otherwise we need to roll back to 
Scala 2.10 (or 2.9) and an older version of Akka.

Thanks,
Robert
--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

--
Martynas Mickevičius
Typesafe – Reactive Apps on the JVM
--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

--
Martynas Mickevičius
Typesafe – Reactive Apps on the JVM
--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.
--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

--
Martynas Mickevičius
Typesafe – Reactive Apps on the JVM
--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

-- 
Björn Antonsson
Typesafe – Reactive Apps on the JVM
twitter: @bantonsson

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to a topic in the Google 
Groups "Akka User List" group.
To unsubscribe from this topic, visit 
https://groups.google.com/d/topic/akka-user/i_fWDe8GN0I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to 
[email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

-- 
Björn Antonsson
Typesafe – Reactive Apps on the JVM
twitter: @bantonsson

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Re: [akka-user] Re: Unrecoverable "gated" state in remote system

Reply via email to