Hi,

Last couple of days am facing strange issue with my setup. Sometimes mco
client is unable to connect with central activemq server and keep getting
onconnect_fail error on mco log. When I check on activemq.log at the same
time not getting any errros, but if I restart the central activemq then it
started working properly.

Any idea on this pls?



On Thu, Mar 9, 2017 at 8:48 PM, seabos <[email protected]> wrote:

> I would suggest that you first validate that this condition is occurring
> by trying to follow the steps I gave in privious note. e.g - collect two
> consecutive lists, compare, find non-responding node, restart agent on
> node, see if it responds. If you're patient you could actually wait for a
> period of say 1 hour after re-connectivity has occured via agent restart
> and see only after 1 hour of no traffic that it has been truncated again.
>
> But to directly answer your question setting the keepalive would have to
> be done on every node running mc agent, or where these long running unused
> connections can exist. you need to understand that if this condition is
> occurring (firewall/network device truncation) that it could be happening
> at 1 hr of inactivity, 30 minutes of inactivity, 15 minutes of inactivity
> etc, perhaps even less. So you would need to set the keepalive to be less
> than what the truncation period is. This will require some detective work.
> So setting to 1000 when the truncation period is 15 minutes will produce
> some strange behavior in the 1.666 minutes between the time the truncation
> occurs and time the keepalive activates the TCP connection.
>
> hope that helps
>
> On Wednesday, March 8, 2017 at 11:07:00 PM UTC-5, Ravi Kumar wrote:
>>
>> One thing, should I need to try change the TCP keep alive timeout on
>> central broker or each remote brokers or mco console or each end servers
>> which connects to activemq?
>>
>>
>> On Thu, Mar 9, 2017 at 8:07 AM, <[email protected]> wrote:
>>
>>> Thanks let me try the same
>>>
>>>
>>> On Wednesday, March 8, 2017 at 5:08:44 AM UTC+5:30, seabos wrote:
>>>>
>>>> So just a thought here...When you're dealing with geographically
>>>> distributed nodes, at least in the case of RMQ you can run into TCP
>>>> timeouts/truncation that can occur on long running TCP connections where
>>>> there is no communication between them. This is a kernel defined value. See
>>>> http://unix.stackexchange.com/questions/316020/in-linux-
>>>> does-proc-sys-net-ipv4-tcp-keepalive-time-has-impact-on-both-client
>>>>
>>>> What can happen is if you have a keepalive time longer than the value
>>>> of the firewall truncation (idle timeout) you will see all sorts of
>>>> randomness in response as services disconnect and reconnect in accordance
>>>> with when the firewall truncates them and when the default tcp keep alive
>>>> "re-vivifies" the connection.
>>>>
>>>> We have very large geographically distributed collectives and so this
>>>> is one of the things we've had to address. (also we have very large
>>>> collectives of over 70K, and have tested up to 250K nodes in single DCs)
>>>>
>>>> Here's a suggestion to try and tshoot: set the tcp keepalive time to
>>>> less than 1000 seconds (little more than 16 minutes) in the kernel (pretty
>>>> sure the default value is 7200 seconds or about 2 hours)
>>>>
>>>> A way you could test this assertion is to collect two sample lists
>>>> using mco ping
>>>>
>>>> mco ping -nodes=bla > /tmp/ListA
>>>> wait 5 minutes
>>>> mco ping -nodes=bla > /tmp/ListB
>>>>
>>>> check what is NOT in List B from List A and try to mco ping that host
>>>> direct....see if it still does not respond. If it still does not respond go
>>>> onto the host and restart the mcollective service (agent) then try mco ping
>>>> to that host again. If it works the second time that's a reasonable
>>>> indication that the long running tcp connection is being truncated by a fw
>>>> of some sort.
>>>>
>>>> hope that helps
>>>>
>>>>
>>>>
>>>>
>>>> On Friday, March 3, 2017 at 11:29:41 PM UTC-5, [email protected]
>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I want to use mcollective on geographically and attempted to configure
>>>>> ActiveMQ network of brokers for that. After the configuration everything 
>>>>> is
>>>>> working fine as expected but only problem is, when I run any mco query for
>>>>> multiple servers then its getting disconnected for sometime when many of
>>>>> the not responded.
>>>>>
>>>>> For example, I have connected 1000 clients for each location of
>>>>> ActiveMQ broker, and am deploying mcollective on many servers and I have
>>>>> the entire list of servers(458) which am deploying it. So when I run mco
>>>>> ping or any mco queries with --nodes=<server_list_file> geographically am
>>>>> getting response from 243 servers which means mcollective configured
>>>>> successfully on these servers and no response from other servers which
>>>>> means mcollective not configured yet on those servers. But the problem is
>>>>> here many servers are not responded for mcollective and due to this when I
>>>>> run the mco query on connected servers am not getting response. But If I
>>>>> leave for few minutes and then run the same query against working servers
>>>>> then am getting response. Am suspecting that when I run mco query and many
>>>>> servers are not responded then middleware takes time to clear pending
>>>>> queues or something happening. Any idea to tune this?
>>>>>
>>>>>
>>>>> Regards
>>>>> Ravi
>>>>>
>>>> --
>>>
>>> ---
>>> You received this message because you are subscribed to a topic in the
>>> Google Groups "mcollective-users" group.
>>> To unsubscribe from this topic, visit https://groups.google.com/d/to
>>> pic/mcollective-users/_bH542nBZSo/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to
>>> [email protected].
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
>
> ---
> You received this message because you are subscribed to a topic in the
> Google Groups "mcollective-users" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/
> topic/mcollective-users/_bH542nBZSo/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"mcollective-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to