I tried making the scenario a bit simpler:

- start 4 nodes
- start client, observe no timeouts
- kill 1 node
- observe node is downed
- observe timeouts on client continue indefinitely
- kill client
- start new client
- observe timeouts on newly started client continue indefinitely

The fact that I start a new client means that ClusterClient state longer 
plays any role. So the problem seems to be with the state of Receptionist 
on the server-side. I have copied the bits of the logs that I thought most 
relevant below, they are easily correlated by timestamp.

NODE1 (seed, port 2551)
-----------------
10:51:01.315 [VolArbService-akka.actor.default-dispatcher-13] DEBUG 
akka.cluster.ClusterHeartbeatSender - Cluster Node 
[akka.tcp://[email protected]:2551] - Heartbeat to 
[akka.tcp://[email protected]:2552]
10:51:01.315 [VolArbService-akka.actor.default-dispatcher-13] DEBUG 
akka.cluster.ClusterHeartbeatSender - Cluster Node 
[akka.tcp://[email protected]:2551] - Heartbeat to 
[akka.tcp://[email protected]:37209]
10:51:01.315 [VolArbService-akka.actor.default-dispatcher-13] DEBUG 
akka.cluster.ClusterHeartbeatSender - Cluster Node 
[akka.tcp://[email protected]:2551] - Heartbeat to 
[akka.tcp://[email protected]:34165]
...
10:51:17.666 [VolArbService-akka.actor.default-dispatcher-15] DEBUG 
akka.remote.EndpointWriter - Disassociated 
[akka.tcp://[email protected]:2551] -> 
[akka.tcp://[email protected]:34165]
10:51:17.666 [VolArbService-akka.actor.default-dispatcher-15] INFO  
a.r.RemoteActorRefProvider$RemoteDeadLetterActorRef - Message 
[com.imc.medusa.server.messages.Update] from 
Actor[akka://VolArbService/user/receptionist/akka.tcp%3A%2F%2FABC%40127.0.0.1%3A48362%2Ftemp%2F%24Ed#-1362311011]
 
to Actor[akka://VolArbService/deadLetters] was not delivered. [6] dead 
letters encountered. This logging can be turned off or adjusted with 
configuration settings 'akka.log-dead-letters' and 
'akka.log-dead-letters-during-shutdown'.
...
10:51:22.279 [VolArbService-akka.actor.default-dispatcher-26] INFO  
Cluster(akka://VolArbService) - Cluster Node 
[akka.tcp://[email protected]:2551] - Leader is marking unreachable 
node [akka.tcp://[email protected]:34165] as [Down]
...
10:51:22.315 [VolArbService-akka.actor.default-dispatcher-14] DEBUG 
akka.cluster.ClusterHeartbeatSender - Cluster Node 
[akka.tcp://[email protected]:2551] - Heartbeat to 
[akka.tcp://[email protected]:2552]
10:51:22.315 [VolArbService-akka.actor.default-dispatcher-14] DEBUG 
akka.cluster.ClusterHeartbeatSender - Cluster Node 
[akka.tcp://[email protected]:2551] - Heartbeat to 
[akka.tcp://[email protected]:37209]
10:51:22.315 [VolArbService-akka.actor.default-dispatcher-14] DEBUG 
akka.cluster.ClusterHeartbeatSender - Cluster Node 
[akka.tcp://[email protected]:2551] - EndHeartbeat to 
[akka.tcp://[email protected]:34165]
..
10:51:24.276 [VolArbService-akka.actor.default-dispatcher-24] INFO  
Cluster(akka://VolArbService) - Cluster Node 
[akka.tcp://[email protected]:2551] - Leader is removing unreachable 
node [akka.tcp://[email protected]:34165]
10:51:24.277 [VolArbService-akka.actor.default-dispatcher-29] INFO  
Remoting - Address [akka.tcp://[email protected]:34165] is now 
quarantined, all messages to this address will be delivered to dead letters.

NODE2 (seed, port 2552)
-----------------
10:51:00.860 [VolArbService-akka.actor.default-dispatcher-16] DEBUG 
akka.cluster.ClusterHeartbeatSender - Cluster Node 
[akka.tcp://[email protected]:2552] - Heartbeat to 
[akka.tcp://[email protected]:2551]
10:51:00.861 [VolArbService-akka.actor.default-dispatcher-16] DEBUG 
akka.cluster.ClusterHeartbeatSender - Cluster Node 
[akka.tcp://[email protected]:2552] - Heartbeat to 
[akka.tcp://[email protected]:37209]
10:51:00.861 [VolArbService-akka.actor.default-dispatcher-16] DEBUG 
akka.cluster.ClusterHeartbeatSender - Cluster Node 
[akka.tcp://[email protected]:2552] - Heartbeat to 
[akka.tcp://[email protected]:34165]
...
10:51:17.729 [VolArbService-akka.actor.default-dispatcher-2] DEBUG 
akka.remote.EndpointWriter - Disassociated 
[akka.tcp://[email protected]:2552] -> 
[akka.tcp://[email protected]:34165]
10:51:17.730 [VolArbService-akka.actor.default-dispatcher-2] INFO  
a.r.RemoteActorRefProvider$RemoteDeadLetterActorRef - Message 
[com.imc.medusa.server.messages.Update] from 
Actor[akka.tcp://[email protected]:2551/user/receptionist/akka.tcp%3A%2F%2FABC%40127.0.0.1%3A48362%2Ftemp%2F%24Hd#-1292428908]
 
to Actor[akka://VolArbService/deadLetters] was not delivered. [2] dead 
letters encountered. This logging can be turned off or adjusted with 
configuration settings 'akka.log-dead-letters' and 
'akka.log-dead-letters-during-shutdown'.
...
10:51:24.646 [VolArbService-akka.actor.default-dispatcher-16] INFO  
Remoting - Address [akka.tcp://[email protected]:34165] is now 
quarantined, all messages to this address will be delivered to dead letters.
10:51:24.860 [VolArbService-akka.actor.default-dispatcher-4] DEBUG 
akka.cluster.ClusterHeartbeatSender - Cluster Node 
[akka.tcp://[email protected]:2552] - Heartbeat to 
[akka.tcp://[email protected]:2551]
10:51:24.860 [VolArbService-akka.actor.default-dispatcher-4] DEBUG 
akka.cluster.ClusterHeartbeatSender - Cluster Node 
[akka.tcp://[email protected]:2552] - Heartbeat to 
[akka.tcp://[email protected]:37209]
10:51:24.860 [VolArbService-akka.actor.default-dispatcher-4] DEBUG 
akka.cluster.ClusterHeartbeatSender - Cluster Node 
[akka.tcp://[email protected]:2552] - EndHeartbeat to 
[akka.tcp://[email protected]:34165]
...

NODE3 (port 34165)
---------------
10:51:17.500 [VolArbService-akka.actor.default-dispatcher-24] DEBUG 
akka.cluster.ClusterHeartbeatSender - Cluster Node 
[akka.tcp://[email protected]:34165] - Heartbeat to 
[akka.tcp://[email protected]:2551]
10:51:17.500 [VolArbService-akka.actor.default-dispatcher-24] DEBUG 
akka.cluster.ClusterHeartbeatSender - Cluster Node 
[akka.tcp://[email protected]:34165] - Heartbeat to 
[akka.tcp://[email protected]:2552]
10:51:17.501 [VolArbService-akka.actor.default-dispatcher-24] DEBUG 
akka.cluster.ClusterHeartbeatSender - Cluster Node 
[akka.tcp://[email protected]:34165] - Heartbeat to 
[akka.tcp://[email protected]:37209]
...
***KILLED by me at 10:51:17:592***

NODE4 (port 37209)
--------------
10:51:01.317 [VolArbService-akka.actor.default-dispatcher-14] DEBUG 
akka.cluster.ClusterHeartbeatSender - Cluster Node 
[akka.tcp://[email protected]:37209] - Heartbeat to 
[akka.tcp://[email protected]:2551]
10:51:01.317 [VolArbService-akka.actor.default-dispatcher-14] DEBUG 
akka.cluster.ClusterHeartbeatSender - Cluster Node 
[akka.tcp://[email protected]:37209] - Heartbeat to 
[akka.tcp://[email protected]:2552]
10:51:01.317 [VolArbService-akka.actor.default-dispatcher-14] DEBUG 
akka.cluster.ClusterHeartbeatSender - Cluster Node 
[akka.tcp://[email protected]:37209] - Heartbeat to 
[akka.tcp://[email protected]:34165]
...
10:51:17.666 [VolArbService-akka.actor.default-dispatcher-3] DEBUG 
akka.remote.EndpointWriter - Disassociated 
[akka.tcp://[email protected]:37209] -> 
[akka.tcp://[email protected]:34165]
10:51:17.666 [VolArbService-akka.actor.default-dispatcher-3] INFO  
a.r.RemoteActorRefProvider$RemoteDeadLetterActorRef - Message 
[com.imc.medusa.server.messages.Update] from 
Actor[akka.tcp://[email protected]:2551/user/receptionist/akka.tcp%3A%2F%2FABC%40127.0.0.1%3A48362%2Ftemp%2F%24Dd#868424619]
 
to Actor[akka://VolArbService/deadLetters] was not delivered. [2] dead 
letters encountered. This logging can be turned off or adjusted with 
configuration settings 'akka.log-dead-letters' and 
'akka.log-dead-letters-during-shutdown'.
...
10:51:25.175 [VolArbService-akka.actor.default-dispatcher-16] INFO  
Remoting - Address [akka.tcp://[email protected]:34165] is now 
quarantined, all messages to this address will be delivered to dead letters.
10:51:25.317 [VolArbService-akka.actor.default-dispatcher-18] DEBUG 
akka.cluster.ClusterHeartbeatSender - Cluster Node 
[akka.tcp://[email protected]:37209] - Heartbeat to 
[akka.tcp://[email protected]:2551]
10:51:25.317 [VolArbService-akka.actor.default-dispatcher-18] DEBUG 
akka.cluster.ClusterHeartbeatSender - Cluster Node 
[akka.tcp://[email protected]:37209] - Heartbeat to 
[akka.tcp://[email protected]:2552]
10:51:25.317 [VolArbService-akka.actor.default-dispatcher-18] DEBUG 
akka.cluster.ClusterHeartbeatSender - Cluster Node 
[akka.tcp://[email protected]:37209] - EndHeartbeat to 
[akka.tcp://[email protected]:34165]

Client
-------
*** STARTED new client after NODE3 has been downed ***
[DEBUG] [02/06/2014 10:52:02.646] [main] [EventStream] StandardOutLogger 
started
10:52:02.882 [ABC-akka.actor.default-dispatcher-3] INFO  
akka.event.slf4j.Slf4jLogger - Slf4jLogger started
[DEBUG] [02/06/2014 10:52:02.888] [main] [EventStream(akka://ABC)] logger 
log1-Slf4jLogger started
[DEBUG] [02/06/2014 10:52:02.888] [main] [EventStream(akka://ABC)] Default 
Loggers started
10:52:02.889 [ABC-akka.actor.default-dispatcher-3] DEBUG 
akka.event.EventStream - logger log1-Slf4jLogger started
10:52:02.889 [ABC-akka.actor.default-dispatcher-3] DEBUG 
akka.event.EventStream - Default Loggers started
10:52:02.930 [ABC-akka.actor.default-dispatcher-3] INFO  Remoting - 
Starting remoting
10:52:03.063 [ABC-akka.actor.default-dispatcher-2] INFO  Remoting - 
Remoting started; listening on addresses :[akka.tcp://[email protected]:42625]
10:52:03.189 [ABC-akka.actor.default-dispatcher-15] DEBUG 
akka.remote.EndpointWriter - Associated [akka.tcp://[email protected]:42625] -> 
[akka.tcp://[email protected]:2552]
10:52:03.190 [ABC-akka.actor.default-dispatcher-15] DEBUG 
akka.remote.EndpointWriter - Associated [akka.tcp://[email protected]:42625] -> 
[akka.tcp://[email protected]:2551]
10:52:03.191 [ABC-akka.actor.default-dispatcher-15] DEBUG 
a.s.Serialization(akka://ABC) - Using 
serializer[akka.remote.serialization.MessageContainerSerializer] for 
message [akka.actor.SelectChildName]
10:52:03.197 [ABC-akka.actor.default-dispatcher-4] DEBUG 
a.s.Serialization(akka://ABC) - Using 
serializer[akka.serialization.JavaSerializer] for message 
[akka.contrib.pattern.ClusterReceptionist$Internal$GetContacts$]
10:52:03.238 [ABC-akka.actor.default-dispatcher-3] DEBUG 
a.s.Serialization(akka://ABC) - Using 
serializer[akka.serialization.JavaSerializer] for message 
[akka.actor.Identify]
10:52:03.248 [ABC-akka.actor.default-dispatcher-12] DEBUG 
akka.remote.EndpointWriter - Associated [akka.tcp://[email protected]:42625] -> 
[akka.tcp://[email protected]:37209]
10:52:03.249 [ABC-akka.actor.default-dispatcher-2] INFO  
akka.contrib.pattern.ClusterClient - Connected to 
[akka.tcp://[email protected]:2551/user/receptionist]
10:52:03.249 [ABC-akka.actor.default-dispatcher-2] DEBUG 
a.s.Serialization(akka://ABC) - Using 
serializer[akka.serialization.JavaSerializer] for message 
[akka.dispatch.sysmsg.Watch]
10:52:03.250 [ABC-akka.actor.default-dispatcher-2] DEBUG 
akka.remote.RemoteWatcher - Watching: [akka://ABC/user/$a -> 
akka.tcp://[email protected]:2551/user/receptionist]
10:52:04.085 [ABC-akka.actor.default-dispatcher-2] DEBUG 
akka.remote.RemoteWatcher - Sending first Heartbeat to 
[akka.tcp://[email protected]:2551]
10:52:04.086 [ABC-akka.actor.default-dispatcher-3] DEBUG 
a.s.Serialization(akka://ABC) - Using 
serializer[akka.serialization.JavaSerializer] for message 
[akka.remote.RemoteWatcher$Heartbeat$]
10:52:04.090 [ABC-akka.actor.default-dispatcher-2] DEBUG 
akka.remote.RemoteWatcher - Received first heartbeat rsp from 
[akka.tcp://[email protected]:2551]
10:52:04.090 [ABC-akka.actor.default-dispatcher-2] DEBUG 
akka.remote.RemoteWatcher - Re-watch [Actor[akka://ABC/user/$a#-1761251245] 
-> 
Actor[akka.tcp://[email protected]:2551/user/receptionist#-1054121998]]
10:52:04.091 [ABC-akka.actor.default-dispatcher-2] DEBUG 
akka.remote.RemoteWatcher - Re-watch 
[Actor[akka://ABC/system/remote-watcher#449313331] -> 
Actor[akka.tcp://[email protected]:2551/user/receptionist#-1054121998]]
10:52:04.092 [ABC-akka.actor.default-dispatcher-4] DEBUG 
akka.remote.RemoteWatcher - Watching: [akka://ABC/user/$a -> 
akka.tcp://[email protected]:2551/user/receptionist]
10:52:05.085 [ABC-akka.actor.default-dispatcher-3] DEBUG 
akka.remote.RemoteWatcher - Sending Heartbeat to 
[akka.tcp://[email protected]:2551]
10:52:05.087 [ABC-akka.actor.default-dispatcher-12] DEBUG 
akka.remote.RemoteWatcher - Received heartbeat rsp from 
[akka.tcp://[email protected]:2551]
10:52:06.085 [ABC-akka.actor.default-dispatcher-12] DEBUG 
akka.remote.RemoteWatcher - Sending Heartbeat to 
[akka.tcp://[email protected]:2551]
10:52:06.089 [ABC-akka.actor.default-dispatcher-12] DEBUG 
akka.remote.RemoteWatcher - Received heartbeat rsp from 
[akka.tcp://[email protected]:2551]
10:52:06.125 [ABC-akka.actor.default-dispatcher-3] DEBUG 
a.s.Serialization(akka://ABC) - Using 
serializer[akka.serialization.JavaSerializer] for message 
[akka.contrib.pattern.ClusterReceptionist$Internal$Ping$]
10:52:07.085 [ABC-akka.actor.default-dispatcher-4] DEBUG 
akka.remote.RemoteWatcher - Sending Heartbeat to 
[akka.tcp://[email protected]:2551]
10:52:07.088 [ABC-akka.actor.default-dispatcher-3] DEBUG 
akka.remote.RemoteWatcher - Received heartbeat rsp from 
[akka.tcp://[email protected]:2551]
10:52:08.085 [ABC-akka.actor.default-dispatcher-3] DEBUG 
akka.remote.RemoteWatcher - Sending Heartbeat to 
[akka.tcp://[email protected]:2551]
10:52:08.087 [ABC-akka.actor.default-dispatcher-3] DEBUG 
akka.remote.RemoteWatcher - Received heartbeat rsp from 
[akka.tcp://[email protected]:2551]
10:52:08.107 [ABC-akka.actor.default-dispatcher-3] DEBUG 
a.s.Serialization(akka://ABC) - Using 
serializer[akka.serialization.JavaSerializer] for message 
[akka.contrib.pattern.DistributedPubSubMediator$Send]
10:52:09.085 [ABC-akka.actor.default-dispatcher-2] DEBUG 
akka.remote.RemoteWatcher - Sending Heartbeat to 
[akka.tcp://[email protected]:2551]
10:52:09.087 [ABC-akka.actor.default-dispatcher-2] DEBUG 
akka.remote.RemoteWatcher - Received heartbeat rsp from 
[akka.tcp://[email protected]:2551]
*** first message is sent and times out ****

T.

On Thursday, February 6, 2014 10:53:44 AM UTC+1, Patrik Nordwall wrote:
>
>
>
>
> On Thu, Feb 6, 2014 at 10:31 AM, Tycho Lamerigts 
> <[email protected]<javascript:>
> > wrote:
>
>> In the meantime, I have looked at the logs a bit more. If I do my test 
>> with 3 nodes, then kill 1 node, the ClusterClient seems to always recover 
>> correctly, i.e. it continues using the 2 surviving nodes and there are no 
>> timeouts. If I start with 4 nodes, then kill 1 node, it nearly always 
>> causes problems. The logs of the 3 surviving nodes show the following error 
>> forever repeating
>>
>>  10:27:35.862 [VolArbService-akka.actor.default-dispatcher-23] ERROR 
>> akka.remote.EndpointWriter - AssociationError [akka.tcp://
>> [email protected]:2552] -> [akka.tcp://
>> [email protected]:51801]: Error [Association failed with 
>> [akka.tcp://[email protected]:51801]] [
>> akka.remote.EndpointAssociationException: Association failed with 
>> [akka.tcp://[email protected]:51801]
>> Caused by: 
>> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
>> Connection refused: /127.0.0.1:51801
>> ]
>>
>> where port 51801 is the killed node. Despite the fact that the error 
>> continues forever, I can join a 2 new nodes to the 3 surviving nodes and 
>> the new nodes correctly report the new cluster size as 4 and 5 using 
>> Cluster.readView().members().size(). In other words, it seems the killed 
>> node was correctly downed and yet the associationerrors never cease.
>>
>
> Do you see "Leader is removing unreachable node" in the cluster logs?
>
> On the client side you should see the following if it is the connected 
> receptionist node that is killed.
> "Lost contact with"
> "Connected to"
>  
> /Patrik
>
>
>>
>> T.
>>
>>
>>
>> On Wednesday, February 5, 2014 9:40:41 PM UTC+1, Björn Antonsson wrote:
>>
>>> Hi,
>>>
>>> On 5 February 2014 at 16:42:48, Tycho Lamerigts ([email protected]) 
>>> wrote:
>>>
>>> I have a client that fires many requests to my cluster using 
>>> ClusterClient's Send(), one request at a time, with a 1 second timeout 
>>> waiting for a response.  While all the cluster nodes are up, the requests 
>>> are correctly (randomly) distributed across the nodes and promptly receive 
>>> a response, no timeouts. If I kill a cluster node, then I expect one or two 
>>> of the requests to timeout because they end up being sent to the now-dead 
>>> node. After that, I expect ClusterClient to realize that the node has died 
>>> and I expect to no longer get timeouts (the workload can easily be handled 
>>> by the remaining nodes). Sometimes it works. Unfortunately, more often than 
>>> not it doesn't work and I continue getting timed out requests until I 
>>> restart every node in the cluster and the client.
>>>
>>> Any idea what causes this behavior?
>>>
>>>
>>> Which version of akka are you using? Is the failing node corrcetly 
>>> downed? Have you enabled any debug logging to diagnose the behavior?
>>>
>>> B/
>>>
>>>
>>> My ClusterClient is initialized with two receptionist addresses. My 
>>> cluster actually has more than 2 nodes and each node has a receptionist 
>>> with a registered destination actor. I tried playing with 
>>> contrib.cluster.receptionist.number-of-contacts but it did not seem to 
>>> make any difference.
>>> --
>>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>>> >>>>>>>>>> Check the FAQ: http://akka.io/faq/
>>> >>>>>>>>>> Search the archives: https://groups.google.com/
>>> group/akka-user
>>> ---
>>> You received this message because you are subscribed to the Google 
>>> Groups "Akka User List" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>>
>>> Visit this group at http://groups.google.com/group/akka-user.
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>
>>>
>>> -- 
>>> Björn Antonsson
>>> Typesafe <http://typesafe.com/> – Reactive Apps on the JVM
>>> twitter: @bantonsson <http://twitter.com/#!/bantonsson>
>>>
>>>  -- 
>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>> >>>>>>>>>> Check the FAQ: http://akka.io/faq/
>> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "Akka User List" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected]<javascript:>
>> .
>> Visit this group at http://groups.google.com/group/akka-user.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
>
>
> -- 
>
> Patrik Nordwall
> Typesafe <http://typesafe.com/> -  Reactive apps on the JVM
> Twitter: @patriknw
>
> 

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: http://akka.io/faq/
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to