So just a thought here...When you're dealing with geographically 
distributed nodes, at least in the case of RMQ you can run into TCP 
timeouts/truncation that can occur on long running TCP connections where 
there is no communication between them. This is a kernel defined value. 
See 
http://unix.stackexchange.com/questions/316020/in-linux-does-proc-sys-net-ipv4-tcp-keepalive-time-has-impact-on-both-client

What can happen is if you have a keepalive time longer than the value of 
the firewall truncation (idle timeout) you will see all sorts of randomness 
in response as services disconnect and reconnect in accordance with when 
the firewall truncates them and when the default tcp keep alive 
"re-vivifies" the connection. 

We have very large geographically distributed collectives and so this is 
one of the things we've had to address. (also we have very large 
collectives of over 70K, and have tested up to 250K nodes in single DCs)

Here's a suggestion to try and tshoot: set the tcp keepalive time to less 
than 1000 seconds (little more than 16 minutes) in the kernel (pretty sure 
the default value is 7200 seconds or about 2 hours) 

A way you could test this assertion is to collect two sample lists using 
mco ping 

mco ping -nodes=bla > /tmp/ListA
wait 5 minutes
mco ping -nodes=bla > /tmp/ListB

check what is NOT in List B from List A and try to mco ping that host 
direct....see if it still does not respond. If it still does not respond go 
onto the host and restart the mcollective service (agent) then try mco ping 
to that host again. If it works the second time that's a reasonable 
indication that the long running tcp connection is being truncated by a fw 
of some sort. 

hope that helps




On Friday, March 3, 2017 at 11:29:41 PM UTC-5, [email protected] wrote:
>
> Hi,
>
> I want to use mcollective on geographically and attempted to configure 
> ActiveMQ network of brokers for that. After the configuration everything is 
> working fine as expected but only problem is, when I run any mco query for 
> multiple servers then its getting disconnected for sometime when many of 
> the not responded. 
>
> For example, I have connected 1000 clients for each location of ActiveMQ 
> broker, and am deploying mcollective on many servers and I have the entire 
> list of servers(458) which am deploying it. So when I run mco ping or any 
> mco queries with --nodes=<server_list_file> geographically am getting 
> response from 243 servers which means mcollective configured successfully 
> on these servers and no response from other servers which means mcollective 
> not configured yet on those servers. But the problem is here many servers 
> are not responded for mcollective and due to this when I run the mco query 
> on connected servers am not getting response. But If I leave for few 
> minutes and then run the same query against working servers then am getting 
> response. Am suspecting that when I run mco query and many servers are not 
> responded then middleware takes time to clear pending queues or something 
> happening. Any idea to tune this?
>
>
> Regards
> Ravi
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"mcollective-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to