I'm following up on this topic after upgrading to akka 2.3.15. I'm reasonably confident that the issue is the resullt of using akka along with another library that causes the netty dependency to be upgraded from 3.9.2.Final to 3.10.0.Final. For now I have removed the dependency on the newer version of netty, but I thought I'd report what I was seeing in the logs. I am running five nodes for a few hours with no issue, and then two nodes fall out of the cluster. Here are the logs from each node:
IP: 160 13:59:57.252 INFO [geyser-akka.actor.default-dispatcher-6] AngelOfTheAbyss - Unreachable member (Member(address = akka.tcp://[email protected]:7000, status = Up)|Size:4) 13:59:58.541 INFO [geyser-akka.actor.default-dispatcher-306] AngelOfTheAbyss - Unreachable member (Member(address = akka.tcp://[email protected]:7000, status = Up)|Size:3) 14:00:11.540 INFO [geyser-akka.actor.default-dispatcher-282] AngelOfTheAbyss - Member removed (Member(address = akka.tcp://[email protected]:7000, status = Removed)|Size:3) 14:00:11.541 INFO [geyser-akka.actor.default-dispatcher-282] AngelOfTheAbyss - Member removed (Member(address = akka.tcp://[email protected]:7000, status = Removed)|Size:3) 14:00:11.545 WARN [geyser-akka.remote.default-remote-dispatcher-8] Remoting - Association to [akka.tcp://[email protected]:7000] having UID [-477546934] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation. 14:00:11.546 WARN [geyser-akka.remote.default-remote-dispatcher-8] Remoting - Association to [akka.tcp://[email protected]:7000] having UID [-1471771858] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation. IP: 42 13:59:57.326 WARN [geyser-cluster-dispatcher-15] a.c.ClusterCoreDaemon - Cluster Node [akka.tcp://[email protected]:7000] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://[email protected]:7000, status = Up)] 13:59:57.328 INFO [geyser-akka.actor.default-dispatcher-46] AngelOfTheAbyss - Unreachable member (Member(address = akka.tcp://[email protected]:7000, status = Up)|Size:4) 14:00:07.345 INFO [geyser-cluster-dispatcher-15] Cluster(akka://geyser) - Cluster Node [akka.tcp://[email protected]:7000] - Leader is auto-downing unreachable node [akka.tcp://[email protected]:7000] 14:00:07.346 INFO [geyser-cluster-dispatcher-15] Cluster(akka://geyser) - Cluster Node [akka.tcp://[email protected]:7000] - Marking unreachable node [akka.tcp://[email protected]:7000] as [Down] 14:00:07.694 INFO [geyser-cluster-dispatcher-15] Cluster(akka://geyser) - Cluster Node [akka.tcp://[email protected]:7000] - Shutting down... 14:00:07.695 INFO [geyser-cluster-dispatcher-15] Cluster(akka://geyser) - Cluster Node [akka.tcp://[email protected]:7000] - Successfully shut down 14:00:07.703 WARN [geyser-akka.remote.default-remote-dispatcher-27] Remoting - Association to [akka.tcp://[email protected]:7000] having UID [-1471771858] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation. 14:00:10.360 WARN [geyser-akka.remote.default-remote-dispatcher-7] a.r.ReliableDeliverySupervisor - Association with remote system [akka.tcp://[email protected]:7000] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 14:00:11.361 WARN [geyser-akka.remote.default-remote-dispatcher-7] a.r.ReliableDeliverySupervisor - Association with remote system [akka.tcp://[email protected]:7000] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 14:00:11.544 WARN [geyser-akka.remote.default-remote-dispatcher-7] a.r.ReliableDeliverySupervisor - Association with remote system [akka.tcp://[email protected]:7000] has failed, address is now gated for [5000] ms. Reason: [Disassociated] IP: 13 13:59:57.244 WARN [geyser-cluster-dispatcher-17] a.c.ClusterCoreDaemon - Cluster Node [akka.tcp://[email protected]:7000] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://[email protected]:7000, status = Up)] 13:59:57.245 INFO [geyser-akka.actor.default-dispatcher-61] AngelOfTheAbyss - Unreachable member (Member(address = akka.tcp://[email protected]:7000, status = Up)|Size:4) 13:59:57.326 INFO [geyser-cluster-dispatcher-15] Cluster(akka://geyser) - Cluster Node [akka.tcp://[email protected]:7000] - Ignoring received gossip status from unreachable [UniqueAddress(akka.tcp://[email protected]:7000,-477546934)] 14:00:07.711 WARN [geyser-akka.remote.default-remote-dispatcher-7] a.r.ReliableDeliverySupervisor - Association with remote system [akka.tcp://[email protected]:7000] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 14:00:09.243 INFO [geyser-cluster-dispatcher-17] Cluster(akka://geyser) - Cluster Node [akka.tcp://[email protected]:7000] - Shutting down... 14:00:09.246 INFO [geyser-cluster-dispatcher-17] Cluster(akka://geyser) - Cluster Node [akka.tcp://[email protected]:7000] - Successfully shut down 14:00:09.253 WARN [geyser-akka.remote.default-remote-dispatcher-7] Remoting - Association to [akka.tcp://[email protected]:7000] having UID [-477546934] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation. 14:00:10.361 WARN [geyser-akka.remote.default-remote-dispatcher-7] a.r.ReliableDeliverySupervisor - Association with remote system [akka.tcp://[email protected]:7000] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 14:00:10.394 ERROR [geyser-akka.remote.default-remote-dispatcher-26] a.r.EndpointWriter - AssociationError [akka.tcp://[email protected]:7000] <- [akka.tcp://[email protected]:7000]: Error [Invalid address: akka.tcp://[email protected]:7000] [ akka.remote.InvalidAssociation: Invalid address: akka.tcp://[email protected]:7000 Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted. ] 14:00:10.394 WARN [geyser-akka.remote.default-remote-dispatcher-26] Remoting - Tried to associate with unreachable remote address [akka.tcp://[email protected]:7000]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted.] 14:00:11.364 WARN [geyser-akka.remote.default-remote-dispatcher-7] a.r.ReliableDeliverySupervisor - Association with remote system [akka.tcp://[email protected]:7000] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 14:00:11.546 WARN [geyser-akka.remote.default-remote-dispatcher-26] a.r.ReliableDeliverySupervisor - Association with remote system [akka.tcp://[email protected]:7000] has failed, address is now gated for [5000] ms. Reason: [Disassociated] IP: 46 13:59:57.358 INFO [geyser-akka.actor.default-dispatcher-2] AngelOfTheAbyss - Unreachable member (Member(address = akka.tcp://[email protected]:7000, status = Up)|Size:4) 13:59:58.329 INFO [geyser-akka.actor.default-dispatcher-7] AngelOfTheAbyss - Unreachable member (Member(address = akka.tcp://[email protected]:7000, status = Up)|Size:3) 14:00:07.372 INFO [geyser-cluster-dispatcher-21] Cluster(akka://geyser) - Cluster Node [akka.tcp://[email protected]:7000] - Leader is auto-downing unreachable node [akka.tcp://[email protected]:7000] 14:00:07.373 INFO [geyser-cluster-dispatcher-21] Cluster(akka://geyser) - Cluster Node [akka.tcp://[email protected]:7000] - Marking unreachable node [akka.tcp://[email protected]:7000] as [Down] 14:00:08.342 INFO [geyser-cluster-dispatcher-21] Cluster(akka://geyser) - Cluster Node [akka.tcp://[email protected]:7000] - Leader is auto-downing unreachable node [akka.tcp://[email protected]:7000] 14:00:08.342 INFO [geyser-cluster-dispatcher-21] Cluster(akka://geyser) - Cluster Node [akka.tcp://[email protected]:7000] - Marking unreachable node [akka.tcp://[email protected]:7000] as [Down] 14:00:10.352 INFO [geyser-cluster-dispatcher-21] Cluster(akka://geyser) - Cluster Node [akka.tcp://[email protected]:7000] - Leader is removing unreachable node [akka.tcp://[email protected]:7000] 14:00:10.353 INFO [geyser-cluster-dispatcher-21] Cluster(akka://geyser) - Cluster Node [akka.tcp://[email protected]:7000] - Leader is removing unreachable node [akka.tcp://[email protected]:7000] 14:00:10.353 INFO [geyser-akka.actor.default-dispatcher-2] AngelOfTheAbyss - Member removed (Member(address = akka.tcp://[email protected]:7000, status = Removed)|Size:3) 14:00:10.353 INFO [geyser-akka.actor.default-dispatcher-2] AngelOfTheAbyss - Member removed (Member(address = akka.tcp://[email protected]:7000, status = Removed)|Size:3) 14:00:10.353 INFO [geyser-akka.actor.default-dispatcher-5] a.c.p.ClusterSingletonManager - Member removed [akka.tcp://[email protected]:7000] 14:00:10.354 INFO [geyser-akka.actor.default-dispatcher-5] a.c.p.ClusterSingletonManager - Member removed [akka.tcp://[email protected]:7000] 14:00:10.356 WARN [geyser-akka.remote.default-remote-dispatcher-9] Remoting - Association to [akka.tcp://[email protected]:7000] having UID [-477546934] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation. 14:00:10.356 WARN [geyser-akka.remote.default-remote-dispatcher-9] Remoting - Association to [akka.tcp://[email protected]:7000] having UID [-1471771858] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation. 14:00:10.385 WARN [geyser-akka.remote.default-remote-dispatcher-10] a.r.EndpointWriter - AssociationError [akka.tcp://[email protected]:7000] -> [akka.tcp://[email protected]:7000]: Error [Invalid address: akka.tcp://[email protected]:7000] [ akka.remote.InvalidAssociation: Invalid address: akka.tcp://[email protected]:7000 Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system has a UID that has been quarantined. Association aborted. ] 14:00:10.386 INFO [geyser-akka.remote.default-remote-dispatcher-27] Remoting - Quarantined address [akka.tcp://[email protected]:7000] is still unreachable or has not been restarted. Keeping it quarantined. IP: 139 13:59:57.544 INFO [geyser-akka.actor.default-dispatcher-187] AngelOfTheAbyss - Unreachable member (Member(address = akka.tcp://[email protected]:7000, status = Up)|Size:4) 13:59:58.359 INFO [geyser-akka.actor.default-dispatcher-178] AngelOfTheAbyss - Unreachable member (Member(address = akka.tcp://[email protected]:7000, status = Up)|Size:3) 14:00:11.358 INFO [geyser-akka.actor.default-dispatcher-32] AngelOfTheAbyss - Member removed (Member(address = akka.tcp://[email protected]:7000, status = Removed)|Size:3) 14:00:11.359 INFO [geyser-akka.actor.default-dispatcher-32] AngelOfTheAbyss - Member removed (Member(address = akka.tcp://[email protected]:7000, status = Removed)|Size:3) 14:00:11.361 WARN [geyser-akka.remote.default-remote-dispatcher-27] Remoting - Association to [akka.tcp://[email protected]:7000] having UID [-477546934] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation. 14:00:11.361 WARN [geyser-akka.remote.default-remote-dispatcher-27] Remoting - Association to [akka.tcp://[email protected]:7000] having UID [-1471771858] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation. Is there anything abnormal in the logs? Regards, Ben On Wednesday, March 23, 2016 at 9:33:02 AM UTC-4, Benjamin Black wrote: > > I look forward to trying out the new version. Not totally sure it is the > same issue I'm seeing this happen on a cluster where no node is being > restarted. I shall continue to investigate what has changed on my side, > because I wasn't see this before I upgraded other libraries. > > On Wednesday, March 23, 2016 at 2:08:10 AM UTC-4, Patrik Nordwall wrote: >> >> We have fixed the issue that is noticed as >> "Error encountered while processing system message acknowledgement >> buffer: [-1 {}] ack: ACK[6, {}]" >> >> https://github.com/akka/akka/pull/20093 >> >> It will be released in 2.4.3 and 2.3.15, probably by end of next week. >> >> /Patrik >> tis 22 mars 2016 kl. 23:39 skrev Guido Medina <[email protected]>: >> >>> Yeah sorry I thought it was related with rolling restart. >>> >>> As for Netty, I'm using a *non-published yet* Netty with the following >>> fixes: >>> >>> https://github.com/netty/netty/issues?q=milestone%3A3.10.6.Final+is%3Aclosed >>> >>> You can just get it from Git and: >>> >>> $ git checkout 3.10 >>> $ mvn versions:set -DnewVersion=3.10.6.Final -DgenerateBackupPoms=false >>> $ mvn clean install >>> >>> And see if your problem goes away, >>> >>> Guido. >>> >>> On Tuesday, March 22, 2016 at 10:27:26 PM UTC, Benjamin Black wrote: >>>> >>>> Hi Guido, yes I'm aware of the leaving cluster conversation as I >>>> started it :-) This is separate issue. I am observing this behavior whilst >>>> the cluster seems stable with no nodes being added/removed. I suspect that >>>> this issue was first observed when I upgraded a different library that >>>> brought in a new version of the netty library. >>>> >>>> On Tuesday, March 22, 2016 at 6:23:14 PM UTC-4, Guido Medina wrote: >>>>> >>>>> Hi Benjamin, >>>>> >>>>> You have nodes with predefined ports, one thing I have which >>>>> eliminates that problem for these nodes is that >>>>> only my seed node(s) have the port set, the rest will just get a >>>>> dynamic and available port, making it get a different port when you >>>>> do a rolling restart. >>>>> >>>>> I suspect you are doing a rolling restart right? so you need to wait >>>>> for that node with that address to completely leave the cluster (I'm also >>>>> doing that), >>>>> basically you terminate your system when you receive the message >>>>> *MemberRemoved* for *_self_* address. >>>>> >>>>> I think I saw a discussion related to quarantine nodes when they are >>>>> re-joining using the same address, not sure if here or if it is an actual >>>>> Git ticket. >>>>> >>>>> HTH, >>>>> >>>>> Guido. >>>>> >>>> -- >>> >>>>>>>>>> Read the docs: http://akka.io/docs/ >>> >>>>>>>>>> Check the FAQ: >>> http://doc.akka.io/docs/akka/current/additional/faq.html >>> >>>>>>>>>> Search the archives: >>> https://groups.google.com/group/akka-user >>> --- >>> You received this message because you are subscribed to the Google >>> Groups "Akka User List" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/akka-user. >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- >>>>>>>>>> Read the docs: http://akka.io/docs/ >>>>>>>>>> Check the FAQ: >>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user --- You received this message because you are subscribed to the Google Groups "Akka User List" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/akka-user. For more options, visit https://groups.google.com/d/optout.
