HA getting crazy after management server restart (3.0.2)

Arnaud Gaillard Mon, 12 Nov 2012 11:08:04 -0800

Hello,

We rebooted the management server to check if it had an impact on a little
display bug we spotted. Since that moment all our nfra is getting crazy.


After the reboot all our node (17) went to the
disconnected/alert/connecting state and the HA-Worker is complaining that
the various hosts are unreachable. (please note that no other change were
made and the network is fine and no Iptables/FW are preventing the
communication)

For instance:
Unable to reach the agent for VM[ConsoleProxy|v-189-VM]: Resource [Host:78]
is unreachable: Host 78: Host is not in the right state: Disconnected

and

2012-11-12 15:22:47,849 INFO  [agent.manager.AgentMonitor]
(AgentMonitor:null) Found the following agents behind on ping: [75, 52, 4]
2012-11-12 15:22:47,851 DEBUG [cloud.host.Status] (AgentMonitor:null) Ping
timeout for host 75, do invstigation
2012-11-12 15:22:47,853 DEBUG [cloud.host.Status] (AgentMonitor:null) Ping
timeout for host 52, do invstigation
2012-11-12 15:22:47,853 INFO  [agent.manager.AgentManagerImpl]
(AgentTaskPool-5:null) Investigating why host 75 has disconnected with
event PingTimeout
2012-11-12 15:22:47,854 DEBUG [agent.manager.AgentManagerImpl]
(AgentTaskPool-5:null) checking if agent (75) is alive
2012-11-12 15:22:47,855 INFO  [agent.manager.AgentManagerImpl]
(AgentTaskPool-6:null) Investigating why host 52 has disconnected with
event PingTimeout
2012-11-12 15:22:47,855 DEBUG [agent.manager.AgentManagerImpl]
(AgentTaskPool-6:null) checking if agent (52) is alive
2012-11-12 15:22:47,856 DEBUG [cloud.host.Status] (AgentMonitor:null) Ping
timeout for host 4, do invstigation

Seems that Ping failed because all are down are seen down....
All the node are running fine and are connected (tcp status connected) to
the server however the management server seems to not see them.

The interface show the status connecting but the node are never going back
to the connected mode. The client is saying:


2012-11-12 15:39:36,202 INFO  [utils.nio.NioClient] (Agent-Selector:null)
Connecting to 172.16.11.10:8250
2012-11-12 15:39:36,295 INFO  [utils.nio.NioClient] (Agent-Selector:null)
SSL: Handshake done
2012-11-12 15:39:41,296 INFO  [cloud.agent.Agent] (Agent Timer:null)
Connected to the server
2012-11-12 15:45:37,635 INFO  [cloud.agent.Agent] (Agent Timer:null) The
startup command is now cancelled
2012-11-12 15:45:42,636 INFO  [cloud.agent.Agent] (Agent Timer:null) Lost
connection to the server. Dealing with the remaining commands...

Were the agent tries to connect to the management server but get
disconnected for an unknow reason.

The only error I see that catch my eye in the log is:

 2012-11-12 15:17:48,945 ERROR [cloud.servlet.CloudStartupServlet]
(main:null) Exception starting management server
 java.lang.NumberFormatException: For input string: "false"
         at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
         at java.lang.Integer.parseInt(Integer.java:481)
         at java.lang.Integer.parseInt(Integer.java:514)
         at com.cloud.api.ApiServer.init(ApiServer.java:282)
         at com.cloud.api.ApiServer.initApiServer(ApiServer.java:159)
         at
com.cloud.servlet.CloudStartupServlet.init(CloudStartupServlet.java:46)
         at javax.servlet.GenericServlet.init(GenericServlet.java:212)
         at
org.apache.catalina.core.StandardWrapper.loadServlet(StandardWrapper.java:1173)
         at
org.apache.catalina.core.StandardWrapper.load(StandardWrapper.java:993)
          at
org.apache.catalina.core.StandardContext.loadOnStartup(StandardContext.java:4187)
         at
org.apache.catalina.core.StandardContext.start(StandardContext.java:4496)
        at
org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791)
          at
org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771)
          at
org.apache.catalina.core.StandardHost.addChild(StandardHost.java:526)
         at
org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:1041)
         at
org.apache.catalina.startup.HostConfig.deployDirectories(HostConfig.java:964)
         at
org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:502)
         at
org.apache.catalina.startup.HostConfig.start(HostConfig.java:1277)
         at
org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:321)
         at
org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:119)
         at
org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053)
         at
org.apache.catalina.core.StandardHost.start(StandardHost.java:722)
         at
org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045)
         at
org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443)
         at
org.apache.catalina.core.StandardService.start(StandardService.java:516)
       at
org.apache.catalina.core.StandardServer.start(StandardServer.java:710)
         at org.apache.catalina.startup.Catalina.start(Catalina.java:593)
         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
         at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
         at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
         at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289)
          at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414)

We are clueless about what is causing this issue, as all the node (from 3
different zones) are seen down but they are in fact running fine...

As these is creating a big mess in our production so any idea may be useful!

Thanks,

HA getting crazy after management server restart (3.0.2)

Reply via email to