Hmm. I'm not sufficiently expert in large-cluster behavior to guess about the problem, but note that you should *never* use auto-down-unreachable-after in production code. (I actually don't even recommend it in test code.) While I don't *think* it causes the problem you're describing, it can cause much more severe "split-brain" issues that can lead to data corruption. You're going to need to come up with a more nuanced approach to the problem of downing; I recommend reading the documentation sections on Downing <http://doc.akka.io/docs/akka/2.4.14/scala/cluster-usage.html#Downing> and Split Brain <http://doc.akka.io/docs/akka/akka-commercial-addons-1.0/scala/split-brain-resolver.html> -- it's important to get this stuff right to have a stable environment.
On Fri, Dec 9, 2016 at 3:44 PM, Tyler Brummett <[email protected]> wrote: > Hey Akka experts, I need your help! Currently my company is using Akka as > a part of a partial CQRS pattern. We have service adapters that consume > source system events in the form of JMS messages, while producing commands > to be asynchronously distributed to our command service. Our command > service consumes all of these messages asynchronously based on a given > group ID, so that no two commands with the same group ID are being > processed at the same time. > > We have designed an approach that allows us to have each deployable > component in its own cluster and use a clusterClient to talk across > clusters. Below is another diagram illustrating the service architecture > with the Akka configuration reflecting separate clusters. > > [diagram] > (see attached please) > > Errors we are seeing on appbox01: UI sends commands to command service > 11/11/2016 09:48:46,056 INFO > [AppClusterSystem-akka.actor.default-dispatcher-29] > CommandHandlerActor - received master ack. > 11/11/2016 09:48:52,045 INFO > [AppClusterSystem-akka.actor.default-dispatcher-35] > CommandHandlerActor work timeout. For commandX > 11/11/2016 09:48:52,046 ERROR [tomcat-http--33] AppController - X update > failed > com.company.appA.package.AkkaWorkFailedException: Timeout for X > > Errors we are seeing on servicebox01: UI sends commands to command service > 11/11/2016 09:48:46,715 WARN > [CommandClusterSystem-akka.actor.default-dispatcher-2] > ClusterStatusListenerActor - Problem has occurred associating local host: > servicebox01.company.com and remote host: appbox01.company.com > 11/11/2016 09:48:46,716 WARN > [CommandClusterSystem-akka.actor.default-dispatcher-2] > ClusterStatusListenerActor - Problem has occurred associating local host: > servicebox01.company.com and remote host: appbox01.company.com > 11/11/2016 09:48:46,716 WARN > [CommandClusterSystem-akka.actor.default-dispatcher-2] > Remoting - Tried to associate with unreachable remote address [akka.tcp:// > [email protected]:12345]. Address is now gated for > 5000 ms, all messages to this address will be delivered to dead letters. > Reason: [The remote system has quarantined this system. No further > associations to the remote system are possible until this system is > restarted.] > > > We are interested in seeing this new implementation through and finding > solutions where we can decouple our services and apps from one another as > we move towards a micro-service architecture. So if you have > suggestions/solutions, we are all ears! > > So the main question is: why are our nodes being quarantined? We have > restarted nodes and stabalized the environment over and over, but the > quarantine problem resurfaces after a few hours. Typically it's in a bad > state by the next day. As part of this post I have provided our typical > application.conf file for a given service, which corresponds with our new > "separate cluster" implementation (diagram). Hopefully someone out there > can help us shed some light to this problem. Please see the > application.conf below. > > Thanks! > > ===================== > application.conf > ===================== > > # bulkhead workers > my-worker-exec-dispatcher { > type = Dispatcher > executor = "fork-join-executor" > fork-join-executor { > parallelism-min = 2 > parallelism-factor = 2.0 > parallelism-max = 10 > } > throughput =1 > } > > # dedicate resources to the master actor > my-master-dispatcher { > type = Dispatcher > executor = "fork-join-executor" > fork-join-executor { > parallelism-min = 2 > parallelism-factor = 2.0 > parallelism-max = 10 > } > throughput =20 > } > > akka { > loggers = ["akka.event.slf4j.Slf4jLogger"] > loglevel = "INFO" > stdout-loglevel = "OFF" > > actor.provider = "akka.cluster.ClusterActorRefProvider" > > # Log the complete configuration at INFO level when the actor system is > started. > # This is useful when you are uncertain of what configuration is used. > log-config-on-start = off > > remote { > log-remote-lifecycle-events = off > > # If this is "on", Akka will log all outbound messages at DEBUG level, > # if off then they are not logged > log-sent-messages = off > # If this is "on", Akka will log all inbound messages at DEBUG level, > # if off then they are not logged > log-received-messages = off > netty.tcp { > # hostname is injected programmatically in AppConfiguration. > port = ${akka.node.port} > send-buffer-size = 10240000b > receive-buffer-size = 10240000b > maximum-frame-size = 5120000b > } > } > > contrib { > cluster { > pub-sub { > # How often the DistributedPubSubMediator should send out gossip > information > gossip-interval = 5s > } > } > } > > cluster { > # seed-nodes is injected programmatically > # seed-nodes = [${akka.seed.nodes}] > # 30 minute auto down for a crashed master > # a long network outage requires restarting the cluster after 30 > minutes > auto-down-unreachable-after = 1800s > roles = [${akka.cluster.roles}] > } > > actor { > bounded-mailbox { > mailbox-type = "akka.dispatch.BoundedMailbox" > mailbox-capacity = 3000 > mailbox-push-timeout-time = 100ms > } > > debug { > # enable function of LoggingReceive, which is to log any received > message at > # DEBUG level > receive = off > # enable DEBUG logging of all AutoReceiveMessages (Kill, PoisonPill > et.c.) > autoreceive = off > # enable DEBUG logging of actor lifecycle changes > lifecycle = off > # enable DEBUG logging of all LoggingFSMs for events, transitions and > timers > fsm = off > # enable DEBUG logging of subscription changes on the eventStream > event-stream = off > } > } > } > > akka.extensions = ["akka.contrib.pattern.ClusterReceptionistExtension"] > > akka.contrib.cluster.receptionist { > name = receptionist > number-of-contacts = 3 > response-tunnel-receive-timeout = 30s > } > > akka.cluster.client { > heartbeat-interval = 2s > acceptable-heartbeat-pause = 10s > buffer = 0 > } > > -- > >>>>>>>>>> Read the docs: http://akka.io/docs/ > >>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/ > current/additional/faq.html > >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user > --- > You received this message because you are subscribed to the Google Groups > "Akka User List" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/akka-user. > For more options, visit https://groups.google.com/d/optout. > -- >>>>>>>>>> Read the docs: http://akka.io/docs/ >>>>>>>>>> Check the FAQ: >>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user --- You received this message because you are subscribed to the Google Groups "Akka User List" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/akka-user. For more options, visit https://groups.google.com/d/optout.
