Hi Chris, you can subscribe to EventStream <http://doc.akka.io/docs/akka/2.3.7/scala/event-bus.html#event-stream> for QuarantinedEvent which is published when system gets quarantined <https://github.com/akka/akka/blob/6e0d2a5522324f26a0af6332628af25257d00698/akka-remote/src/main/scala/akka/remote/Remoting.scala#L464> .
You can also increase acceptable-heartbeat-pause of the transport failure detector of the local system to be marginally more than the longest GC you get on the remote system to prevent quarantines. On Wed, Nov 12, 2014 at 4:44 PM, oxbow_lakes <[email protected]> wrote: > Hi - > > I'm being bitten by a client/server application where multiple clients > subscribe for updates from a server where, upon a long GC pause, the > clients are being quarantined. Here is some client logs for an attempt to > rediscover the server actor using "server ? Identify", which times out. I > can see that this is because the client has quarantined the server. > > 24-Oct-2014 11:30:28:310: [(akka)Remoting - WARNING] [31]: Tried to > associate with unreachable remote address [akka.tcp:// > [email protected]:35411]. Address is now gated > for 5000 ms, all messages to this address will be delivered to dead > letters. Reason: The remote system has quarantined this system. No further > associations to the remote system are possible until this system is > restarted. > > > I can guess (it's not a guess, the times line up perfectly) in the server > when the original disconnect happened: > > 1.611: [Full GC (Metadata GC Threshold) 87M->14M(52M), 0.1731620 secs] > 7.622: [Full GC (Metadata GC Threshold) 175M->85M(220M), 0.4527592 secs] > 8658.035: [Full GC (Metadata GC Threshold) 3638M->419M(4192M), 1.8916884 > secs] > 244257.679: [Full GC (Allocation Failure) 6516M->2622M(14G), 9.2884735 > secs] > *391857.856: [Full GC (Allocation Failure) 7390M->3758M(12G), 13.7533193 > secs]* > > > the server's akka logs state the following happendd at this time (the > server quarantines my client): > > [WARN] [10/24/2014 10:57:22.416] > [gekkoRemoting-akka.remote.default-remote-dispatcher-7] [akka.tcp:// > [email protected]:35411/system/remote-watcher] > Detected unreachable: [akka.tcp://[email protected]:60091] > [WARN] [10/24/2014 10:57:22.416] > [gekkoRemoting-akka.remote.default-remote-dispatcher-11015] [Remoting] > Association to [akka.tcp://[email protected]:60091] having UID > [1196983173] is irrecoverably failed. UID is now quarantined and all > messages to this UID will be delivered to dead letters. Remote actorsystem > must be restarted to recover from this situation. > > > My question is "how on earth do I code against this?" - currently I have a > "canary" which creates the client connection. The client connection picks > up the Terminated message from the server and stops itself; the canary > picks this Termination up and spools up (after a pause for a few minutes) a > new client to attempt to connect to the server. Except the new client > cannot connect to the server because it has been quarantined. How is my > client supposed to know this? There's no "you've been quarantined" > callback, just a timeout looking up a server. Do I need to just assume that > the failure to lookup a server might indicate a quarantine? > > > here's the server's remote configuration: > > remote { > log-remote-lifecycle-events = on > retry-gate-closed-for = 5 s > enabled-transports = ["akka.remote.netty.tcp"] > netty.tcp { > maximum-frame-size = 100 MiB > } > watch-failure-detector { > acceptable-heartbeat-pause = 20 s > heartbeat-interval = 5 s > } > transport-failure-detector { > acceptable-heartbeat-pause = 10 s > heartbeat-interval = 3 s > } > } > > here's the client's remote configuration: > > remote { > log-remote-lifecycle-events = on > gate-invalid-addresses-for = 5 s > > enabled-transports = ["akka.remote.netty.tcp"] > netty.tcp { > port = 0 > maximum-frame-size = 100 MiB > } > watch-failure-detector { > acceptable-heartbeat-pause = 20 s > heartbeat-interval = 5 s > } > transport-failure-detector { > acceptable-heartbeat-pause = 12 s > heartbeat-interval = 3 s > } > } > > Chris > > -- > >>>>>>>>>> Read the docs: http://akka.io/docs/ > >>>>>>>>>> Check the FAQ: > http://doc.akka.io/docs/akka/current/additional/faq.html > >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user > --- > You received this message because you are subscribed to the Google Groups > "Akka User List" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/akka-user. > For more options, visit https://groups.google.com/d/optout. > -- Martynas Mickevičius Typesafe <http://typesafe.com/> – Reactive <http://www.reactivemanifesto.org/> Apps on the JVM -- >>>>>>>>>> Read the docs: http://akka.io/docs/ >>>>>>>>>> Check the FAQ: >>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user --- You received this message because you are subscribed to the Google Groups "Akka User List" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/akka-user. For more options, visit https://groups.google.com/d/optout.
