In the meantime, I have looked at the logs a bit more. If I do my test with 3 nodes, then kill 1 node, the ClusterClient seems to always recover correctly, i.e. it continues using the 2 surviving nodes and there are no timeouts. If I start with 4 nodes, then kill 1 node, it nearly always causes problems. The logs of the 3 surviving nodes show the following error forever repeating
10:27:35.862 [VolArbService-akka.actor.default-dispatcher-23] ERROR akka.remote.EndpointWriter - AssociationError [akka.tcp://[email protected]:2552] -> [akka.tcp://[email protected]:51801]: Error [Association failed with [akka.tcp://[email protected]:51801]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://[email protected]:51801] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: /127.0.0.1:51801 ] where port 51801 is the killed node. Despite the fact that the error continues forever, I can join a 2 new nodes to the 3 surviving nodes and the new nodes correctly report the new cluster size as 4 and 5 using Cluster.readView().members().size(). In other words, it seems the killed node was correctly downed and yet the associationerrors never cease. T. On Wednesday, February 5, 2014 9:40:41 PM UTC+1, Björn Antonsson wrote: > > Hi, > > On 5 February 2014 at 16:42:48, Tycho Lamerigts > ([email protected]<javascript:>) > wrote: > > I have a client that fires many requests to my cluster using > ClusterClient's Send(), one request at a time, with a 1 second timeout > waiting for a response. While all the cluster nodes are up, the requests > are correctly (randomly) distributed across the nodes and promptly receive > a response, no timeouts. If I kill a cluster node, then I expect one or two > of the requests to timeout because they end up being sent to the now-dead > node. After that, I expect ClusterClient to realize that the node has died > and I expect to no longer get timeouts (the workload can easily be handled > by the remaining nodes). Sometimes it works. Unfortunately, more often than > not it doesn't work and I continue getting timed out requests until I > restart every node in the cluster and the client. > > Any idea what causes this behavior? > > > Which version of akka are you using? Is the failing node corrcetly downed? > Have you enabled any debug logging to diagnose the behavior? > > B/ > > > My ClusterClient is initialized with two receptionist addresses. My > cluster actually has more than 2 nodes and each node has a receptionist > with a registered destination actor. I tried playing with > contrib.cluster.receptionist.number-of-contacts but it did not seem to make > any difference. > -- > >>>>>>>>>> Read the docs: http://akka.io/docs/ > >>>>>>>>>> Check the FAQ: http://akka.io/faq/ > >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user > --- > You received this message because you are subscribed to the Google Groups > "Akka User List" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected] <javascript:>. > To post to this group, send email to [email protected]<javascript:> > . > Visit this group at http://groups.google.com/group/akka-user. > For more options, visit https://groups.google.com/groups/opt_out. > > > -- > Björn Antonsson > Typesafe <http://typesafe.com/> – Reactive Apps on the JVM > twitter: @bantonsson <http://twitter.com/#!/bantonsson> > > -- >>>>>>>>>> Read the docs: http://akka.io/docs/ >>>>>>>>>> Check the FAQ: http://akka.io/faq/ >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user --- You received this message because you are subscribed to the Google Groups "Akka User List" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/akka-user. For more options, visit https://groups.google.com/groups/opt_out.
