Here is the log. 2015-05-05 09:51:15,029 WARN [channelservice-akka.actor.default-dispatcher-4] Association with remote system [akka.tcp://system@host2] has failed, address is now gated for [3000] ms. Reason: [Disassociated]
2015-05-05 09:51:17,697 WARN [channelservice-akka.actor.default-dispatcher-54] Detected unreachable: [akka.tcp://system@host2] 2015-05-05 09:51:17,699 WARN [channelservice-akka.actor.default-dispatcher-56] Association to [akka.tcp://system@host2] having UID [-648515237] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation. 2015-05-05 09:51:17,731 WARN [channelservice-akka.actor.default-dispatcher-3] AssociationError [akka.tcp://system@host1] -> [akka.tcp://system@host2]: Error [Invalid address: akka.tcp://system@host2] [ akka.remote.InvalidAssociation: Invalid address: akka.tcp://system@host2 Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system has a UID that has been quarantined. Association aborted. ] The Gated event usually happens first. There is no backpressuring right now. Doesn't heartbeats have higher priority than the normal remote messages? We did see ResendBufferCapacityReachedExceptionreached exception before and we increased the buffer size then. Does this means the receiver is overwhelmed? GC should not be a problem in this case. We have been monitoring the GC overhead. I am almost about to remove the remote watch code to rule out the Quarantine event entirely. On Tuesday, May 5, 2015 at 12:27:17 PM UTC-7, drewhk wrote: > > Also, are you sure that you are backpressuring the sender properly and not > overwhelming remoting itself? If remoting is building up buffer size due to > it not being able to send messages fast enough, then heartbeats can get > delayed arbitrarily long (although we take some measures to mitigate that). > > You can also try incresing the dispatcher thread pool size for remoting > and Netty. > > You should also look into GC activity, since you mentioned that you see > this under load. Many cases similar to yours turn out to be caused by actor > mailbox buildup (lack of backpressure) and resulting high GC pauses. > > We can give much deeper help with access to source, but that is a > commercial service. > > -Endre > > On Tue, May 5, 2015 at 9:13 PM, Endre Varga <[email protected] > <javascript:>> wrote: > >> What is the actual log message when the quarantine happens? Can you show >> snippets of your logs around the quarantine event? Can it be that your >> system message redelivery buffer gets filled because of Terminated messages? >> >> Without seeing a log snippet it is impossible to say anything more >> concrete. >> >> -Endre >> >> On Tue, May 5, 2015 at 9:11 PM, Zhuchen Wang <[email protected] >> <javascript:>> wrote: >> >>> Upgrading to akka 2.3.10 doesn't help a lot. >>> >>> As I mentioned in >>> https://groups.google.com/forum/#!topic/akka-user/NGLi9GTZ42o, we do >>> not actually rely on akka to form the cluster. >>> >>> We use Zookeeper to do cluster management and partition allocation but >>> use akka-remote to communicate between nodes. >>> >>> Let's say we have node1, node2, node3 and partition P conatins (node1 >>> and node2) >>> >>> Each node has a partitionManager actor. >>> >>> In node1 >>> partitionManager will have a child actor >>> akka://node1/actorsystem/partitionManager/P and a ActorSelectionRoutee for >>> akka://node2/actorsystem/partitionManager/P >>> >>> In node2 >>> partitionManager will have a child actor >>> akka://node2/actorsystem/partitionManager/P and a ActorSelectionRoutee for >>> akka://node1/actorsystem/partitionManager/P >>> >>> In node3 >>> partitionManager will have 2 ActorSelectionRoutees for >>> akka://node1/actorsystem/partitionManager/P and >>> akka://node2/actorsystem/partitionManager/P >>> >>> All the actors are started locally thus no remote deployment is involved. >>> >>> Channels can be created under a partition and channel actor is >>> replicated under all partition actors >>> >>> For example chnl1 >>> >>> There will be akka://node1/actorsystem/partitionManager/P/chnl1 and >>> akka://node2/actorsystem/partitionManager/P/chnl1 created in node1 and node2 >>> >>> Now subscribers can subscribe to the channel. If the subscribers come to >>> node1 and node2 there will be no remote involving. >>> >>> If subscribers come to node3, the partitionManager will pick up on >>> ActorSelectionRoutee to forward the subscription. >>> >>> In this case we have remote death watch involved. >>> >>> akka://node3/actorsystem/subA *watches* >>> akka://node1/actorsystem/partitionManager/P/chnl1 and vis versa because if >>> the channel actor dies the subscribers can be notified and do a >>> re-subscribe to another partition member and in a graceful stop case, >>> channel actor needs to wait for all subscribers get terminated and stop >>> itself. >>> >>> Now the main logic is creating channel, subscribing to channel, >>> publishing to channel and stopping channel. >>> >>> In this use case, we get the Quarantined event almost daily. >>> >>> And our settings for the failure detector is >>> >>> watch-failure-detector { >>> heartbeat-interval = 10s >>> acceptable-heartbeat-pause = 30s >>> min-std-deviation = 200ms >>> threshold = 12.0 >>> } >>> >>> Thanks, >>> >>> -- >>> >>>>>>>>>> Read the docs: http://akka.io/docs/ >>> >>>>>>>>>> Check the FAQ: >>> http://doc.akka.io/docs/akka/current/additional/faq.html >>> >>>>>>>>>> Search the archives: >>> https://groups.google.com/group/akka-user >>> --- >>> You received this message because you are subscribed to the Google >>> Groups "Akka User List" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected] <javascript:>. >>> To post to this group, send email to [email protected] >>> <javascript:>. >>> Visit this group at http://groups.google.com/group/akka-user. >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> > -- >>>>>>>>>> Read the docs: http://akka.io/docs/ >>>>>>>>>> Check the FAQ: >>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user --- You received this message because you are subscribed to the Google Groups "Akka User List" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/akka-user. For more options, visit https://groups.google.com/d/optout.
