Endre, could it be due to pending-to-send system message overflow?

On Thu, Jan 22, 2015 at 11:45 AM, Johannes Berg <jberg...@gmail.com> wrote:

> Okay, I increased the load further and now I see the same problem again.
> It seems to just have gotten a bit better in that it doesn't happen as
> fast, but with enough load it happens.
>
> To re-iterate, I have Akka 2.3.9 on all (8) nodes and
> auto-down-unreachable-after = off on all nodes and I don't do any manual
> downing anywhere, still the leader log prints this:
>
> 2015-01-22 10:35:37 +0000 - [INFO] - from Cluster(akka://system) in
> system-akka.actor.default-dispatcher-2
> Cluster Node [akka.tcp://system@ip1:port1] - Leader is removing
> unreachable node [akka.tcp://system@ip2:port2]
>
> and the node(s) under load is(are) removed from the cluster (quarantined).
> How is this possible?
>
> On Wednesday, January 21, 2015 at 5:53:06 PM UTC+2, drewhk wrote:
>>
>> Hi Johannes,
>>
>> See the milestone here: https://github.com/akka/
>> akka/issues?q=milestone%3A2.3.9+is%3Aclosed
>>
>> The tickets cross reference the PRs, too, so you can look at the code
>> changes. The issue that probably hit you is https://github.com/akka/
>> akka/issues/16623 which manifested as system message delivery errors on
>> some systems, but actually was caused by accidentally duplicated internal
>> actors (a regression).
>>
>> -Endre
>>
>> On Wed, Jan 21, 2015 at 4:47 PM, Johannes Berg <jber...@gmail.com> wrote:
>>
>>> Upgrading to 2.3.9 does indeed seem to solve my problem. At least I
>>> haven't experienced them yet.
>>>
>>> Now I'm curious what the fixes were, is there somewhere a change summary
>>> between versions or where is it listed what bugs have been fixed in which
>>> versions?
>>>
>>> On Wednesday, January 21, 2015 at 11:31:02 AM UTC+2, drewhk wrote:
>>>>
>>>> Hi Johannes,
>>>>
>>>> We just released 2.3.9 with important bugfixes. I recommend to update
>>>> and see if the problem is still persisting.
>>>>
>>>> -Endre
>>>>
>>>> On Wed, Jan 21, 2015 at 10:29 AM, Johannes Berg <jber...@gmail.com>
>>>> wrote:
>>>>
>>>>> Many connections seem to be formed in the case when the node has been
>>>>> marked down for unreachability even though it's still alive and it tries 
>>>>> to
>>>>> connect back into the cluster. The removed node prints:
>>>>>
>>>>> "Address is now gated for 5000 ms, all messages to this address will
>>>>> be delivered to dead letters. Reason: The remote system has quarantined
>>>>> this system. No further associations to the remote system are possible
>>>>> until this system is restarted."
>>>>>
>>>>> It doesn't seem to close the connections properly even though it opens
>>>>> new ones continously.
>>>>>
>>>>> Anyway that's a separate issue that I'm not that concerned about right
>>>>> now, I've now realized I don't want to use automatic downing instead I
>>>>> would like to allow nodes to go unreachable and come back to reachable 
>>>>> even
>>>>> if it takes quite some time and manually stopping the process and downing
>>>>> the node in case of an actual crash.
>>>>>
>>>>> Consequently I've put
>>>>>
>>>>> auto-down-unreachable-after = off
>>>>>
>>>>> in the config. Now I have the problem that nodes still are removed,
>>>>> this is from the leader node log:
>>>>>
>>>>> 08:50:14.087UTC INFO [system-akka.actor.default-dispatcher-4]
>>>>> Cluster(akka://system) - Cluster Node [akka.tcp://system@ip1:port1] -
>>>>> Leader is removing unreachable node [akka.tcp://system@ip2:port2]
>>>>>
>>>>> I can understand my node is marked unreachable beause it's under heavy
>>>>> load but I don't understand what could cause it to be removed. I'm not
>>>>> doing any manual downing and have the auto-down to off, what else could
>>>>> trigger the removal?
>>>>>
>>>>> Using the akka-cluster script I can see that the node has most other
>>>>> nodes marked as unreachable (including the leader) and that it has another
>>>>> leader than other nodes.
>>>>>
>>>>> My test system consists of 8 nodes.
>>>>>
>>>>> About the unreachability I'm not having long GC pauses and not sending
>>>>> large blobs, but I'm sending very many smaller messages as fast as I can.
>>>>> If I just hammer it fast enough it will end up unreachable which I can
>>>>> except, but I need to get it back to reachable.
>>>>>
>>>>> On Thursday, December 11, 2014 at 11:22:41 AM UTC+2, Björn Antonsson
>>>>> wrote:
>>>>>>
>>>>>> Hi Johannes,
>>>>>>
>>>>>> On 9 December 2014 at 15:29:53, Johannes Berg (jber...@gmail.com)
>>>>>> wrote:
>>>>>>
>>>>>> Hi! I'm doing some load tests in our system and getting problems that
>>>>>> some of my nodes are marked as unreachable even though the processes are
>>>>>> up. I'm seeing it going a few times from reachable to unreachable and 
>>>>>> back
>>>>>> a few times before staying unreachable saying connection gated for 5000ms
>>>>>> and staying silently that way.
>>>>>>
>>>>>> Looking at the connections made to one of the seed nodes I see that I
>>>>>> have several hundreds of connections from other nodes except the failing
>>>>>> ones. Is this normal? There are several (hundreds) just between two 
>>>>>> nodes.
>>>>>> When are connections formed between cluster nodes and when are they taken
>>>>>> down?
>>>>>>
>>>>>>
>>>>>> Several hundred connections between two nodes seems very wrong. There
>>>>>> should only be one connection between two nodes that communicate over 
>>>>>> akka
>>>>>> remoting or are part of a cluster. How many nodes do you have in your
>>>>>> cluster?
>>>>>>
>>>>>> If you are using cluster aware routers then there should be one
>>>>>> connection between the router node and the rooutee nodes (can be the same
>>>>>> connection that is used for the cluster communication).
>>>>>>
>>>>>> The connections between the nodes don't get torn down, they stay
>>>>>> open, but they are reused for all remoting communication between the 
>>>>>> nodes.
>>>>>>
>>>>>> Also is there some limit on how many connections a node with default
>>>>>> settings will accept?
>>>>>>
>>>>>> We have auto-down-unreachable-after = 10s set in our config, does
>>>>>> this mean if the node is busy and doesn't respond in 10 seconds it 
>>>>>> becomes
>>>>>> unreachable?
>>>>>>
>>>>>> Is there any reason why it would stay unreachable and not re-try to
>>>>>> join the cluster?
>>>>>>
>>>>>>
>>>>>> The auto down, setting is actually just what it says. I the node is
>>>>>> considered unreachable for 10 seconds, it will be moved to DOWN and won't
>>>>>> be able to come back into the cluster. The different states of the 
>>>>>> cluster
>>>>>> and the settings are explained in the documentation.
>>>>>>
>>>>>> http://doc.akka.io/docs/akka/2.3.7/common/cluster.html
>>>>>> http://doc.akka.io/docs/akka/2.3.7/scala/cluster-usage.html
>>>>>>
>>>>>> If you are having problems with nodes becoming unreachable then you
>>>>>> could check if you are doing one of these things:
>>>>>> 1) sending to large blobs as messages, that effectively block out the
>>>>>> heart beats going over the same connection
>>>>>> 2) having long GC pauses that trigger the failure detector since
>>>>>> nodes don't reply to heartbeats
>>>>>>
>>>>>> B/
>>>>>>
>>>>>> We are using Akka 2.3.6 and using cluster aware routers quite much
>>>>>> with a lot of remote messages going around.
>>>>>>
>>>>>> Anyone that can shed some light on this or that can point me at some
>>>>>> documentation about these things?
>>>>>> --
>>>>>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>> >>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/c
>>>>>> urrent/additional/faq.html
>>>>>> >>>>>>>>>> Search the archives: https://groups.google.com/grou
>>>>>> p/akka-user
>>>>>> ---
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "Akka User List" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to akka-user+...@googlegroups.com.
>>>>>> To post to this group, send email to akka...@googlegroups.com.
>>>>>> Visit this group at http://groups.google.com/group/akka-user.
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Björn Antonsson
>>>>>> Typesafe <http://typesafe.com/> – Reactive Apps on the JVM
>>>>>> twitter: @bantonsson <http://twitter.com/#!/bantonsson>
>>>>>>
>>>>>>  --
>>>>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>> >>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/c
>>>>> urrent/additional/faq.html
>>>>> >>>>>>>>>> Search the archives: https://groups.google.com/grou
>>>>> p/akka-user
>>>>> ---
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Akka User List" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to akka-user+...@googlegroups.com.
>>>>> To post to this group, send email to akka...@googlegroups.com.
>>>>> Visit this group at http://groups.google.com/group/akka-user.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  --
>>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>>> >>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/
>>> current/additional/faq.html
>>> >>>>>>>>>> Search the archives: https://groups.google.com/
>>> group/akka-user
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "Akka User List" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to akka-user+...@googlegroups.com.
>>> To post to this group, send email to akka...@googlegroups.com.
>>> Visit this group at http://groups.google.com/group/akka-user.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> >>>>>>>>>> Read the docs: http://akka.io/docs/
> >>>>>>>>>> Check the FAQ:
> http://doc.akka.io/docs/akka/current/additional/faq.html
> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
> ---
> You received this message because you are subscribed to the Google Groups
> "Akka User List" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to akka-user+unsubscr...@googlegroups.com.
> To post to this group, send email to akka-user@googlegroups.com.
> Visit this group at http://groups.google.com/group/akka-user.
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Cheers,
√

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to akka-user+unsubscr...@googlegroups.com.
To post to this group, send email to akka-user@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Reply via email to