[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14204744#comment-14204744
 ] 

Joris van Lieshout commented on CLOUDSTACK-7853:
------------------------------------------------

What I just saw in our management log is that 3 minutes before the management 
server found the host behind on ping the cluster was put in Unmanage mode 
(XenServer patching maintenance).

I also noticed that the AgentTaskPool threads that would do the investigation 
you mention was not triggered for this host. I don't know if this is because it 
was busy or because the agent thread was destroyed after the cluster was put in 
Unmanage. 

This is how I now believer it went.
1. Cluster Unmanage
2. Host rebooted (the brand of physical boxed we use need at least 10 minutes 
to reboot)
3. Host got behind on ping in the meanwhile
4. Host state transitioned from Disconnected to Alert via PingTimeout
5. On the next AgentMonitor cycle a transition was attempted form Alert via 
PingTimeout. Unknown transition so exception was thrown.
6. Host returned from reboot and cluster was set to manage again
7. Due to this invalid state transition the host never transitioned from Alert 
to something else.

> Hosts that are temporary Disconnected and get behind on ping (PingTimeout) 
> turn up in permanent state Alert
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-7853
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7853
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the 
> default.) 
>    Affects Versions: Future, 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1, 4.6.0
>            Reporter: Joris van Lieshout
>            Priority: Critical
>
> If for some reason (I've been unable to determine why but my suspicion is 
> that the management server is busy processing other agent requests and/or 
> xapi is temporary unavailable) a host that is Disconnected gets behind on 
> ping (PingTimeout) it it transitioned to a permanent state of Alert.
> INFO  [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-9551e174) Found the 
> following agents behind on ping: [421, 427, 425]
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Ping timeout for host 421, 
> do invstigation
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Transition:[Resource state 
> = Enabled, Agent event = PingTimeout, Host id = 421, name = xxxxxx1]
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Agent status update: [id = 
> 421; name = xxxxxx1; old status = Disconnected; event = PingTimeout; new 
> status = Alert; old update count = 111; new update count = 112]
> ----/ next cycle / -----
> INFO  [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-2a81b9f7) Found the 
> following agents behind on ping: [421, 427, 425]
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Ping timeout for host 421, 
> do invstigation
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Transition:[Resource state 
> = Enabled, Agent event = PingTimeout, Host id = 421, name = xxxxxx1]
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Cannot transit agent 
> status with event PingTimeout for host 421, name=xxxxxx1, mangement server id 
> is 345052370017
> ERROR [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-2a81b9f7) Caught the 
> following exception: 
> com.cloud.utils.exception.CloudRuntimeException: Cannot transit agent status 
> with event PingTimeout for host 421, mangement server id is 
> 345052370017,Unable to transition to a new state from Alert via PingTimeout
>         at 
> com.cloud.agent.manager.AgentManagerImpl.agentStatusTransitTo(AgentManagerImpl.java:1334)
>         at 
> com.cloud.agent.manager.AgentManagerImpl.disconnectAgent(AgentManagerImpl.java:1349)
>         at 
> com.cloud.agent.manager.AgentManagerImpl.disconnectInternal(AgentManagerImpl.java:1378)
>         at 
> com.cloud.agent.manager.AgentManagerImpl.disconnectWithInvestigation(AgentManagerImpl.java:1384)
>         at 
> com.cloud.agent.manager.AgentManagerImpl$MonitorTask.runInContext(AgentManagerImpl.java:1466)
>         at 
> org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
>         at 
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
>         at 
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
>         at 
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
>         at 
> org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at 
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:701)
> I think the bug occures because there is no valid state transition from Alert 
> via PingTimeout to something recoverable
> Status.java
>               s_fsm.addTransition(Status.Alert, Event.AgentConnected, 
> Status.Connecting);
>         s_fsm.addTransition(Status.Alert, Event.Ping, Status.Up);
>         s_fsm.addTransition(Status.Alert, Event.Remove, Status.Removed);
>         s_fsm.addTransition(Status.Alert, Event.ManagementServerDown, 
> Status.Alert);
>         s_fsm.addTransition(Status.Alert, Event.AgentDisconnected, 
> Status.Alert);
>         s_fsm.addTransition(Status.Alert, Event.ShutdownRequested, 
> Status.Disconnected);
>  As a workaround to get out of this situation we put the cluster in Unmanage, 
> wait 10 minutes and put the cluster back in manage



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to