Hi, Last couple of days am facing strange issue with my setup. Sometimes mco client is unable to connect with central activemq server and keep getting onconnect_fail error on mco log. When I check on activemq.log at the same time not getting any errros, but if I restart the central activemq then it started working properly.
Any idea on this pls? On Thu, Mar 9, 2017 at 8:48 PM, seabos <[email protected]> wrote: > I would suggest that you first validate that this condition is occurring > by trying to follow the steps I gave in privious note. e.g - collect two > consecutive lists, compare, find non-responding node, restart agent on > node, see if it responds. If you're patient you could actually wait for a > period of say 1 hour after re-connectivity has occured via agent restart > and see only after 1 hour of no traffic that it has been truncated again. > > But to directly answer your question setting the keepalive would have to > be done on every node running mc agent, or where these long running unused > connections can exist. you need to understand that if this condition is > occurring (firewall/network device truncation) that it could be happening > at 1 hr of inactivity, 30 minutes of inactivity, 15 minutes of inactivity > etc, perhaps even less. So you would need to set the keepalive to be less > than what the truncation period is. This will require some detective work. > So setting to 1000 when the truncation period is 15 minutes will produce > some strange behavior in the 1.666 minutes between the time the truncation > occurs and time the keepalive activates the TCP connection. > > hope that helps > > On Wednesday, March 8, 2017 at 11:07:00 PM UTC-5, Ravi Kumar wrote: >> >> One thing, should I need to try change the TCP keep alive timeout on >> central broker or each remote brokers or mco console or each end servers >> which connects to activemq? >> >> >> On Thu, Mar 9, 2017 at 8:07 AM, <[email protected]> wrote: >> >>> Thanks let me try the same >>> >>> >>> On Wednesday, March 8, 2017 at 5:08:44 AM UTC+5:30, seabos wrote: >>>> >>>> So just a thought here...When you're dealing with geographically >>>> distributed nodes, at least in the case of RMQ you can run into TCP >>>> timeouts/truncation that can occur on long running TCP connections where >>>> there is no communication between them. This is a kernel defined value. See >>>> http://unix.stackexchange.com/questions/316020/in-linux- >>>> does-proc-sys-net-ipv4-tcp-keepalive-time-has-impact-on-both-client >>>> >>>> What can happen is if you have a keepalive time longer than the value >>>> of the firewall truncation (idle timeout) you will see all sorts of >>>> randomness in response as services disconnect and reconnect in accordance >>>> with when the firewall truncates them and when the default tcp keep alive >>>> "re-vivifies" the connection. >>>> >>>> We have very large geographically distributed collectives and so this >>>> is one of the things we've had to address. (also we have very large >>>> collectives of over 70K, and have tested up to 250K nodes in single DCs) >>>> >>>> Here's a suggestion to try and tshoot: set the tcp keepalive time to >>>> less than 1000 seconds (little more than 16 minutes) in the kernel (pretty >>>> sure the default value is 7200 seconds or about 2 hours) >>>> >>>> A way you could test this assertion is to collect two sample lists >>>> using mco ping >>>> >>>> mco ping -nodes=bla > /tmp/ListA >>>> wait 5 minutes >>>> mco ping -nodes=bla > /tmp/ListB >>>> >>>> check what is NOT in List B from List A and try to mco ping that host >>>> direct....see if it still does not respond. If it still does not respond go >>>> onto the host and restart the mcollective service (agent) then try mco ping >>>> to that host again. If it works the second time that's a reasonable >>>> indication that the long running tcp connection is being truncated by a fw >>>> of some sort. >>>> >>>> hope that helps >>>> >>>> >>>> >>>> >>>> On Friday, March 3, 2017 at 11:29:41 PM UTC-5, [email protected] >>>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I want to use mcollective on geographically and attempted to configure >>>>> ActiveMQ network of brokers for that. After the configuration everything >>>>> is >>>>> working fine as expected but only problem is, when I run any mco query for >>>>> multiple servers then its getting disconnected for sometime when many of >>>>> the not responded. >>>>> >>>>> For example, I have connected 1000 clients for each location of >>>>> ActiveMQ broker, and am deploying mcollective on many servers and I have >>>>> the entire list of servers(458) which am deploying it. So when I run mco >>>>> ping or any mco queries with --nodes=<server_list_file> geographically am >>>>> getting response from 243 servers which means mcollective configured >>>>> successfully on these servers and no response from other servers which >>>>> means mcollective not configured yet on those servers. But the problem is >>>>> here many servers are not responded for mcollective and due to this when I >>>>> run the mco query on connected servers am not getting response. But If I >>>>> leave for few minutes and then run the same query against working servers >>>>> then am getting response. Am suspecting that when I run mco query and many >>>>> servers are not responded then middleware takes time to clear pending >>>>> queues or something happening. Any idea to tune this? >>>>> >>>>> >>>>> Regards >>>>> Ravi >>>>> >>>> -- >>> >>> --- >>> You received this message because you are subscribed to a topic in the >>> Google Groups "mcollective-users" group. >>> To unsubscribe from this topic, visit https://groups.google.com/d/to >>> pic/mcollective-users/_bH542nBZSo/unsubscribe. >>> To unsubscribe from this group and all its topics, send an email to >>> [email protected]. >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- > > --- > You received this message because you are subscribed to a topic in the > Google Groups "mcollective-users" group. > To unsubscribe from this topic, visit https://groups.google.com/d/ > topic/mcollective-users/_bH542nBZSo/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > For more options, visit https://groups.google.com/d/optout. > -- --- You received this message because you are subscribed to the Google Groups "mcollective-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
