What are the specific values you upped? I'd like to preemptively take a look at these myself - would prefer HA not going crazy. :)
On Mon, Nov 12, 2012 at 3:12 PM, Arnaud Gaillard <arnaud.gaill...@xtendsys.net> wrote: > Thanks we did increase the timeout value, and everything seem to be back in > order. > > > > > > On Mon, Nov 12, 2012 at 8:21 PM, Caleb Call <calebc...@me.com> wrote: > >> I had the same thing happen to my environment. I thought it was just my >> older hardware. I ended up upping the check values in the global settings >> and it hasn't reoccured since (it happened three time before making these >> changes). One thing that help slightly is adding a hosts entry for my >> hypervisors to the management server. >> >> >> On Nov 12, 2012, at 12:07 PM, Arnaud Gaillard < >> arnaud.gaill...@xtendsys.net> wrote: >> >> > Hello, >> > >> > We rebooted the management server to check if it had an impact on a >> little >> > display bug we spotted. Since that moment all our nfra is getting crazy. >> > >> > After the reboot all our node (17) went to the >> > disconnected/alert/connecting state and the HA-Worker is complaining that >> > the various hosts are unreachable. (please note that no other change were >> > made and the network is fine and no Iptables/FW are preventing the >> > communication) >> > >> > For instance: >> > Unable to reach the agent for VM[ConsoleProxy|v-189-VM]: Resource >> [Host:78] >> > is unreachable: Host 78: Host is not in the right state: Disconnected >> > >> > and >> > >> > 2012-11-12 15:22:47,849 INFO [agent.manager.AgentMonitor] >> > (AgentMonitor:null) Found the following agents behind on ping: [75, 52, >> 4] >> > 2012-11-12 15:22:47,851 DEBUG [cloud.host.Status] (AgentMonitor:null) >> Ping >> > timeout for host 75, do invstigation >> > 2012-11-12 15:22:47,853 DEBUG [cloud.host.Status] (AgentMonitor:null) >> Ping >> > timeout for host 52, do invstigation >> > 2012-11-12 15:22:47,853 INFO [agent.manager.AgentManagerImpl] >> > (AgentTaskPool-5:null) Investigating why host 75 has disconnected with >> > event PingTimeout >> > 2012-11-12 15:22:47,854 DEBUG [agent.manager.AgentManagerImpl] >> > (AgentTaskPool-5:null) checking if agent (75) is alive >> > 2012-11-12 15:22:47,855 INFO [agent.manager.AgentManagerImpl] >> > (AgentTaskPool-6:null) Investigating why host 52 has disconnected with >> > event PingTimeout >> > 2012-11-12 15:22:47,855 DEBUG [agent.manager.AgentManagerImpl] >> > (AgentTaskPool-6:null) checking if agent (52) is alive >> > 2012-11-12 15:22:47,856 DEBUG [cloud.host.Status] (AgentMonitor:null) >> Ping >> > timeout for host 4, do invstigation >> > >> > Seems that Ping failed because all are down are seen down.... >> > All the node are running fine and are connected (tcp status connected) to >> > the server however the management server seems to not see them. >> > >> > The interface show the status connecting but the node are never going >> back >> > to the connected mode. The client is saying: >> > >> > >> > 2012-11-12 15:39:36,202 INFO [utils.nio.NioClient] (Agent-Selector:null) >> > Connecting to 172.16.11.10:8250 >> > 2012-11-12 15:39:36,295 INFO [utils.nio.NioClient] (Agent-Selector:null) >> > SSL: Handshake done >> > 2012-11-12 15:39:41,296 INFO [cloud.agent.Agent] (Agent Timer:null) >> > Connected to the server >> > 2012-11-12 15:45:37,635 INFO [cloud.agent.Agent] (Agent Timer:null) The >> > startup command is now cancelled >> > 2012-11-12 15:45:42,636 INFO [cloud.agent.Agent] (Agent Timer:null) Lost >> > connection to the server. Dealing with the remaining commands... >> > >> > Were the agent tries to connect to the management server but get >> > disconnected for an unknow reason. >> > >> > The only error I see that catch my eye in the log is: >> > >> > 2012-11-12 15:17:48,945 ERROR [cloud.servlet.CloudStartupServlet] >> > (main:null) Exception starting management server >> > java.lang.NumberFormatException: For input string: "false" >> > at >> > >> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) >> > at java.lang.Integer.parseInt(Integer.java:481) >> > at java.lang.Integer.parseInt(Integer.java:514) >> > at com.cloud.api.ApiServer.init(ApiServer.java:282) >> > at com.cloud.api.ApiServer.initApiServer(ApiServer.java:159) >> > at >> > com.cloud.servlet.CloudStartupServlet.init(CloudStartupServlet.java:46) >> > at javax.servlet.GenericServlet.init(GenericServlet.java:212) >> > at >> > >> org.apache.catalina.core.StandardWrapper.loadServlet(StandardWrapper.java:1173) >> > at >> > org.apache.catalina.core.StandardWrapper.load(StandardWrapper.java:993) >> > at >> > >> org.apache.catalina.core.StandardContext.loadOnStartup(StandardContext.java:4187) >> > at >> > org.apache.catalina.core.StandardContext.start(StandardContext.java:4496) >> > at >> > >> org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791) >> > at >> > org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771) >> > at >> > org.apache.catalina.core.StandardHost.addChild(StandardHost.java:526) >> > at >> > >> org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:1041) >> > at >> > >> org.apache.catalina.startup.HostConfig.deployDirectories(HostConfig.java:964) >> > at >> > org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:502) >> > at >> > org.apache.catalina.startup.HostConfig.start(HostConfig.java:1277) >> > at >> > >> org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:321) >> > at >> > >> org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:119) >> > at >> > org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053) >> > at >> > org.apache.catalina.core.StandardHost.start(StandardHost.java:722) >> > at >> > org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045) >> > at >> > org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443) >> > at >> > org.apache.catalina.core.StandardService.start(StandardService.java:516) >> > at >> > org.apache.catalina.core.StandardServer.start(StandardServer.java:710) >> > at org.apache.catalina.startup.Catalina.start(Catalina.java:593) >> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> > at >> > >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >> > at >> > >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> > at java.lang.reflect.Method.invoke(Method.java:616) >> > at >> org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289) >> > at >> org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414) >> > >> > We are clueless about what is causing this issue, as all the node (from 3 >> > different zones) are seen down but they are in fact running fine... >> > >> > As these is creating a big mess in our production so any idea may be >> useful! >> > >> > Thanks, >> >>