I would suggest that you first validate that this condition is occurring by 
trying to follow the steps I gave in privious note. e.g - collect two 
consecutive lists, compare, find non-responding node, restart agent on 
node, see if it responds. If you're patient you could actually wait for a 
period of say 1 hour after re-connectivity has occured via agent restart 
and see only after 1 hour of no traffic that it has been truncated again. 

But to directly answer your question setting the keepalive would have to be 
done on every node running mc agent, or where these long running unused 
connections can exist. you need to understand that if this condition is 
occurring (firewall/network device truncation) that it could be happening 
at 1 hr of inactivity, 30 minutes of inactivity, 15 minutes of inactivity 
etc, perhaps even less. So you would need to set the keepalive to be less 
than what the truncation period is. This will require some detective work. 
 So setting to 1000 when the truncation period is 15 minutes will produce 
some strange behavior in the 1.666 minutes between the time the truncation 
occurs and time the keepalive activates the TCP connection.

hope that helps

On Wednesday, March 8, 2017 at 11:07:00 PM UTC-5, Ravi Kumar wrote:
>
> One thing, should I need to try change the TCP keep alive timeout on 
> central broker or each remote brokers or mco console or each end servers 
> which connects to activemq?
>
>
> On Thu, Mar 9, 2017 at 8:07 AM, <[email protected] <javascript:>> wrote:
>
>> Thanks let me try the same 
>>
>>
>> On Wednesday, March 8, 2017 at 5:08:44 AM UTC+5:30, seabos wrote:
>>>
>>> So just a thought here...When you're dealing with geographically 
>>> distributed nodes, at least in the case of RMQ you can run into TCP 
>>> timeouts/truncation that can occur on long running TCP connections where 
>>> there is no communication between them. This is a kernel defined value. See 
>>> http://unix.stackexchange.com/questions/316020/in-linux-does-proc-sys-net-ipv4-tcp-keepalive-time-has-impact-on-both-client
>>>
>>> What can happen is if you have a keepalive time longer than the value of 
>>> the firewall truncation (idle timeout) you will see all sorts of randomness 
>>> in response as services disconnect and reconnect in accordance with when 
>>> the firewall truncates them and when the default tcp keep alive 
>>> "re-vivifies" the connection. 
>>>
>>> We have very large geographically distributed collectives and so this is 
>>> one of the things we've had to address. (also we have very large 
>>> collectives of over 70K, and have tested up to 250K nodes in single DCs)
>>>
>>> Here's a suggestion to try and tshoot: set the tcp keepalive time to 
>>> less than 1000 seconds (little more than 16 minutes) in the kernel (pretty 
>>> sure the default value is 7200 seconds or about 2 hours) 
>>>
>>> A way you could test this assertion is to collect two sample lists using 
>>> mco ping 
>>>
>>> mco ping -nodes=bla > /tmp/ListA
>>> wait 5 minutes
>>> mco ping -nodes=bla > /tmp/ListB
>>>
>>> check what is NOT in List B from List A and try to mco ping that host 
>>> direct....see if it still does not respond. If it still does not respond go 
>>> onto the host and restart the mcollective service (agent) then try mco ping 
>>> to that host again. If it works the second time that's a reasonable 
>>> indication that the long running tcp connection is being truncated by a fw 
>>> of some sort. 
>>>
>>> hope that helps
>>>
>>>
>>>
>>>
>>> On Friday, March 3, 2017 at 11:29:41 PM UTC-5, [email protected] wrote:
>>>>
>>>> Hi,
>>>>
>>>> I want to use mcollective on geographically and attempted to configure 
>>>> ActiveMQ network of brokers for that. After the configuration everything 
>>>> is 
>>>> working fine as expected but only problem is, when I run any mco query for 
>>>> multiple servers then its getting disconnected for sometime when many of 
>>>> the not responded. 
>>>>
>>>> For example, I have connected 1000 clients for each location of 
>>>> ActiveMQ broker, and am deploying mcollective on many servers and I have 
>>>> the entire list of servers(458) which am deploying it. So when I run mco 
>>>> ping or any mco queries with --nodes=<server_list_file> geographically am 
>>>> getting response from 243 servers which means mcollective configured 
>>>> successfully on these servers and no response from other servers which 
>>>> means mcollective not configured yet on those servers. But the problem is 
>>>> here many servers are not responded for mcollective and due to this when I 
>>>> run the mco query on connected servers am not getting response. But If I 
>>>> leave for few minutes and then run the same query against working servers 
>>>> then am getting response. Am suspecting that when I run mco query and many 
>>>> servers are not responded then middleware takes time to clear pending 
>>>> queues or something happening. Any idea to tune this?
>>>>
>>>>
>>>> Regards
>>>> Ravi
>>>>
>>> -- 
>>
>> --- 
>> You received this message because you are subscribed to a topic in the 
>> Google Groups "mcollective-users" group.
>> To unsubscribe from this topic, visit 
>> https://groups.google.com/d/topic/mcollective-users/_bH542nBZSo/unsubscribe
>> .
>> To unsubscribe from this group and all its topics, send an email to 
>> [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"mcollective-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to