Hello, We rebooted the management server to check if it had an impact on a little display bug we spotted. Since that moment all our nfra is getting crazy.
After the reboot all our node (17) went to the disconnected/alert/connecting state and the HA-Worker is complaining that the various hosts are unreachable. (please note that no other change were made and the network is fine and no Iptables/FW are preventing the communication) For instance: Unable to reach the agent for VM[ConsoleProxy|v-189-VM]: Resource [Host:78] is unreachable: Host 78: Host is not in the right state: Disconnected and 2012-11-12 15:22:47,849 INFO [agent.manager.AgentMonitor] (AgentMonitor:null) Found the following agents behind on ping: [75, 52, 4] 2012-11-12 15:22:47,851 DEBUG [cloud.host.Status] (AgentMonitor:null) Ping timeout for host 75, do invstigation 2012-11-12 15:22:47,853 DEBUG [cloud.host.Status] (AgentMonitor:null) Ping timeout for host 52, do invstigation 2012-11-12 15:22:47,853 INFO [agent.manager.AgentManagerImpl] (AgentTaskPool-5:null) Investigating why host 75 has disconnected with event PingTimeout 2012-11-12 15:22:47,854 DEBUG [agent.manager.AgentManagerImpl] (AgentTaskPool-5:null) checking if agent (75) is alive 2012-11-12 15:22:47,855 INFO [agent.manager.AgentManagerImpl] (AgentTaskPool-6:null) Investigating why host 52 has disconnected with event PingTimeout 2012-11-12 15:22:47,855 DEBUG [agent.manager.AgentManagerImpl] (AgentTaskPool-6:null) checking if agent (52) is alive 2012-11-12 15:22:47,856 DEBUG [cloud.host.Status] (AgentMonitor:null) Ping timeout for host 4, do invstigation Seems that Ping failed because all are down are seen down.... All the node are running fine and are connected (tcp status connected) to the server however the management server seems to not see them. The interface show the status connecting but the node are never going back to the connected mode. The client is saying: 2012-11-12 15:39:36,202 INFO [utils.nio.NioClient] (Agent-Selector:null) Connecting to 172.16.11.10:8250 2012-11-12 15:39:36,295 INFO [utils.nio.NioClient] (Agent-Selector:null) SSL: Handshake done 2012-11-12 15:39:41,296 INFO [cloud.agent.Agent] (Agent Timer:null) Connected to the server 2012-11-12 15:45:37,635 INFO [cloud.agent.Agent] (Agent Timer:null) The startup command is now cancelled 2012-11-12 15:45:42,636 INFO [cloud.agent.Agent] (Agent Timer:null) Lost connection to the server. Dealing with the remaining commands... Were the agent tries to connect to the management server but get disconnected for an unknow reason. The only error I see that catch my eye in the log is: 2012-11-12 15:17:48,945 ERROR [cloud.servlet.CloudStartupServlet] (main:null) Exception starting management server java.lang.NumberFormatException: For input string: "false" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:481) at java.lang.Integer.parseInt(Integer.java:514) at com.cloud.api.ApiServer.init(ApiServer.java:282) at com.cloud.api.ApiServer.initApiServer(ApiServer.java:159) at com.cloud.servlet.CloudStartupServlet.init(CloudStartupServlet.java:46) at javax.servlet.GenericServlet.init(GenericServlet.java:212) at org.apache.catalina.core.StandardWrapper.loadServlet(StandardWrapper.java:1173) at org.apache.catalina.core.StandardWrapper.load(StandardWrapper.java:993) at org.apache.catalina.core.StandardContext.loadOnStartup(StandardContext.java:4187) at org.apache.catalina.core.StandardContext.start(StandardContext.java:4496) at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791) at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771) at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:526) at org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:1041) at org.apache.catalina.startup.HostConfig.deployDirectories(HostConfig.java:964) at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:502) at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1277) at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:321) at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:119) at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053) at org.apache.catalina.core.StandardHost.start(StandardHost.java:722) at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045) at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443) at org.apache.catalina.core.StandardService.start(StandardService.java:516) at org.apache.catalina.core.StandardServer.start(StandardServer.java:710) at org.apache.catalina.startup.Catalina.start(Catalina.java:593) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289) at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414) We are clueless about what is causing this issue, as all the node (from 3 different zones) are seen down but they are in fact running fine... As these is creating a big mess in our production so any idea may be useful! Thanks,