Likitha Shetty created CLOUDSTACK-7415:
------------------------------------------

             Summary: Host remains in Alert after vCenter restart
                 Key: CLOUDSTACK-7415
                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7415
             Project: CloudStack
          Issue Type: Bug
      Security Level: Public (Anyone can view this level - this is the default.)
          Components: Management Server
    Affects Versions: 4.0.0
            Reporter: Likitha Shetty
            Assignee: Likitha Shetty
            Priority: Critical
             Fix For: 4.5.0


In a clustered management server environment, after a vCenter restart some 
hosts repeatedly go back into alert state even after the vCenter comes up.

Root caused the issue to the below race condition - 

There is a scheduled PingTask that is run for every host and the interval at 
which it is run is configurable (global config - ping.interval). When vCenter 
gets restarted, PingTask is unable to get the host status and so it schedules 
another task to handle the disconnect for the host agent.
This disconnect task determines the host status by sending CheckHeathCommand to 
the agent. When the command returns an answer that says the resource is not 
alive, CS performs further investigations and in this case VMware investigator 
confirms the host to be in disconnected state. After which disconnect is 
processed which involves the following - 
1. Cancel all scheduled tasks for that agent which includes PingTask
2. Send disconnect to all listeners including AgentMonitor which clears the 
agent from MS's PingMap
If the above disconnect takes a while to get scheduled and spills over to the 
next PingTask interval, then the next PingTask runs wherein if by now the 
vCenter is Up and host is connected the Ping is successful and hence an entry 
for the agent is made in the PingMap.
Once an entry is made in the PingMap after a disconnect, every minute the 
AgentMonitor task will run to find the agent behind on Ping, disconnect host 
agent without investigation because the attache is no longer connected and put 
the host back into Alert state.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to