Re: HA getting crazy after management server restart (3.0.2)

Bryan Whitehead Mon, 12 Nov 2012 16:13:53 -0800

What are the specific values you upped? I'd like to preemptively take
a look at these myself - would prefer HA not going crazy. :)


On Mon, Nov 12, 2012 at 3:12 PM, Arnaud Gaillard
<[email protected]> wrote:
> Thanks we did increase the timeout value, and everything seem to be back in
> order.
>
>
>
>
>
> On Mon, Nov 12, 2012 at 8:21 PM, Caleb Call <[email protected]> wrote:
>
>> I had the same thing happen to my environment.  I thought it was just my
>> older hardware.  I ended up upping the check values in the global settings
>> and it hasn't reoccured since (it happened three time before making these
>> changes).  One thing that help slightly is adding a hosts entry for my
>> hypervisors to the management server.
>>
>>
>> On Nov 12, 2012, at 12:07 PM, Arnaud Gaillard <
>> [email protected]> wrote:
>>
>> > Hello,
>> >
>> > We rebooted the management server to check if it had an impact on a
>> little
>> > display bug we spotted. Since that moment all our nfra is getting crazy.
>> >
>> > After the reboot all our node (17) went to the
>> > disconnected/alert/connecting state and the HA-Worker is complaining that
>> > the various hosts are unreachable. (please note that no other change were
>> > made and the network is fine and no Iptables/FW are preventing the
>> > communication)
>> >
>> > For instance:
>> > Unable to reach the agent for VM[ConsoleProxy|v-189-VM]: Resource
>> [Host:78]
>> > is unreachable: Host 78: Host is not in the right state: Disconnected
>> >
>> > and
>> >
>> > 2012-11-12 15:22:47,849 INFO  [agent.manager.AgentMonitor]
>> > (AgentMonitor:null) Found the following agents behind on ping: [75, 52,
>> 4]
>> > 2012-11-12 15:22:47,851 DEBUG [cloud.host.Status] (AgentMonitor:null)
>> Ping
>> > timeout for host 75, do invstigation
>> > 2012-11-12 15:22:47,853 DEBUG [cloud.host.Status] (AgentMonitor:null)
>> Ping
>> > timeout for host 52, do invstigation
>> > 2012-11-12 15:22:47,853 INFO  [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-5:null) Investigating why host 75 has disconnected with
>> > event PingTimeout
>> > 2012-11-12 15:22:47,854 DEBUG [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-5:null) checking if agent (75) is alive
>> > 2012-11-12 15:22:47,855 INFO  [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-6:null) Investigating why host 52 has disconnected with
>> > event PingTimeout
>> > 2012-11-12 15:22:47,855 DEBUG [agent.manager.AgentManagerImpl]
>> > (AgentTaskPool-6:null) checking if agent (52) is alive
>> > 2012-11-12 15:22:47,856 DEBUG [cloud.host.Status] (AgentMonitor:null)
>> Ping
>> > timeout for host 4, do invstigation
>> >
>> > Seems that Ping failed because all are down are seen down....
>> > All the node are running fine and are connected (tcp status connected) to
>> > the server however the management server seems to not see them.
>> >
>> > The interface show the status connecting but the node are never going
>> back
>> > to the connected mode. The client is saying:
>> >
>> >
>> > 2012-11-12 15:39:36,202 INFO  [utils.nio.NioClient] (Agent-Selector:null)
>> > Connecting to 172.16.11.10:8250
>> > 2012-11-12 15:39:36,295 INFO  [utils.nio.NioClient] (Agent-Selector:null)
>> > SSL: Handshake done
>> > 2012-11-12 15:39:41,296 INFO  [cloud.agent.Agent] (Agent Timer:null)
>> > Connected to the server
>> > 2012-11-12 15:45:37,635 INFO  [cloud.agent.Agent] (Agent Timer:null) The
>> > startup command is now cancelled
>> > 2012-11-12 15:45:42,636 INFO  [cloud.agent.Agent] (Agent Timer:null) Lost
>> > connection to the server. Dealing with the remaining commands...
>> >
>> > Were the agent tries to connect to the management server but get
>> > disconnected for an unknow reason.
>> >
>> > The only error I see that catch my eye in the log is:
>> >
>> > 2012-11-12 15:17:48,945 ERROR [cloud.servlet.CloudStartupServlet]
>> > (main:null) Exception starting management server
>> > java.lang.NumberFormatException: For input string: "false"
>> >         at
>> >
>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>> >         at java.lang.Integer.parseInt(Integer.java:481)
>> >         at java.lang.Integer.parseInt(Integer.java:514)
>> >         at com.cloud.api.ApiServer.init(ApiServer.java:282)
>> >         at com.cloud.api.ApiServer.initApiServer(ApiServer.java:159)
>> >         at
>> > com.cloud.servlet.CloudStartupServlet.init(CloudStartupServlet.java:46)
>> >         at javax.servlet.GenericServlet.init(GenericServlet.java:212)
>> >         at
>> >
>> org.apache.catalina.core.StandardWrapper.loadServlet(StandardWrapper.java:1173)
>> >         at
>> > org.apache.catalina.core.StandardWrapper.load(StandardWrapper.java:993)
>> >          at
>> >
>> org.apache.catalina.core.StandardContext.loadOnStartup(StandardContext.java:4187)
>> >         at
>> > org.apache.catalina.core.StandardContext.start(StandardContext.java:4496)
>> >        at
>> >
>> org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791)
>> >          at
>> > org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771)
>> >          at
>> > org.apache.catalina.core.StandardHost.addChild(StandardHost.java:526)
>> >         at
>> >
>> org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:1041)
>> >         at
>> >
>> org.apache.catalina.startup.HostConfig.deployDirectories(HostConfig.java:964)
>> >         at
>> > org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:502)
>> >         at
>> > org.apache.catalina.startup.HostConfig.start(HostConfig.java:1277)
>> >         at
>> >
>> org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:321)
>> >         at
>> >
>> org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:119)
>> >         at
>> > org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053)
>> >         at
>> > org.apache.catalina.core.StandardHost.start(StandardHost.java:722)
>> >         at
>> > org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045)
>> >         at
>> > org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443)
>> >         at
>> > org.apache.catalina.core.StandardService.start(StandardService.java:516)
>> >       at
>> > org.apache.catalina.core.StandardServer.start(StandardServer.java:710)
>> >         at org.apache.catalina.startup.Catalina.start(Catalina.java:593)
>> >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> >         at
>> >
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> >         at
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> >        at java.lang.reflect.Method.invoke(Method.java:616)
>> >         at
>> org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289)
>> >          at
>> org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414)
>> >
>> > We are clueless about what is causing this issue, as all the node (from 3
>> > different zones) are seen down but they are in fact running fine...
>> >
>> > As these is creating a big mess in our production so any idea may be
>> useful!
>> >
>> > Thanks,
>>
>>

Re: HA getting crazy after management server restart (3.0.2)

Reply via email to