Thanks let me try the same On Wednesday, March 8, 2017 at 5:08:44 AM UTC+5:30, seabos wrote: > > So just a thought here...When you're dealing with geographically > distributed nodes, at least in the case of RMQ you can run into TCP > timeouts/truncation that can occur on long running TCP connections where > there is no communication between them. This is a kernel defined value. See > http://unix.stackexchange.com/questions/316020/in-linux-does-proc-sys-net-ipv4-tcp-keepalive-time-has-impact-on-both-client > > What can happen is if you have a keepalive time longer than the value of > the firewall truncation (idle timeout) you will see all sorts of randomness > in response as services disconnect and reconnect in accordance with when > the firewall truncates them and when the default tcp keep alive > "re-vivifies" the connection. > > We have very large geographically distributed collectives and so this is > one of the things we've had to address. (also we have very large > collectives of over 70K, and have tested up to 250K nodes in single DCs) > > Here's a suggestion to try and tshoot: set the tcp keepalive time to less > than 1000 seconds (little more than 16 minutes) in the kernel (pretty sure > the default value is 7200 seconds or about 2 hours) > > A way you could test this assertion is to collect two sample lists using > mco ping > > mco ping -nodes=bla > /tmp/ListA > wait 5 minutes > mco ping -nodes=bla > /tmp/ListB > > check what is NOT in List B from List A and try to mco ping that host > direct....see if it still does not respond. If it still does not respond go > onto the host and restart the mcollective service (agent) then try mco ping > to that host again. If it works the second time that's a reasonable > indication that the long running tcp connection is being truncated by a fw > of some sort. > > hope that helps > > > > > On Friday, March 3, 2017 at 11:29:41 PM UTC-5, [email protected] wrote: >> >> Hi, >> >> I want to use mcollective on geographically and attempted to configure >> ActiveMQ network of brokers for that. After the configuration everything is >> working fine as expected but only problem is, when I run any mco query for >> multiple servers then its getting disconnected for sometime when many of >> the not responded. >> >> For example, I have connected 1000 clients for each location of ActiveMQ >> broker, and am deploying mcollective on many servers and I have the entire >> list of servers(458) which am deploying it. So when I run mco ping or any >> mco queries with --nodes=<server_list_file> geographically am getting >> response from 243 servers which means mcollective configured successfully >> on these servers and no response from other servers which means mcollective >> not configured yet on those servers. But the problem is here many servers >> are not responded for mcollective and due to this when I run the mco query >> on connected servers am not getting response. But If I leave for few >> minutes and then run the same query against working servers then am getting >> response. Am suspecting that when I run mco query and many servers are not >> responded then middleware takes time to clear pending queues or something >> happening. Any idea to tune this? >> >> >> Regards >> Ravi >> >
-- --- You received this message because you are subscribed to the Google Groups "mcollective-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
