[
https://issues.apache.org/jira/browse/CLOUDSTACK-7415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14220769#comment-14220769
]
ASF subversion and git services commented on CLOUDSTACK-7415:
-------------------------------------------------------------
Commit 59ce63918e227f93d64642e881551c25738de3b3 in cloudstack's branch
refs/heads/4.3 from [~likithas]
[ https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;h=59ce639 ]
CLOUDSTACK-7415. Host remains in Alert after vCenter restart.
Management server PingTask should update PingMap entry for an agent only if it
is already present in the Management Server's PingMap.
(cherry picked from commit 8ce6eba549bcd3fa007aaf10a29c3a2fef9ffaaa)
Signed-off-by: Rohit Yadav <[email protected]>
> Host remains in Alert after vCenter restart
> -------------------------------------------
>
> Key: CLOUDSTACK-7415
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7415
> Project: CloudStack
> Issue Type: Bug
> Security Level: Public(Anyone can view this level - this is the
> default.)
> Components: Management Server
> Affects Versions: 4.0.0
> Reporter: Likitha Shetty
> Assignee: Likitha Shetty
> Priority: Critical
> Fix For: 4.5.0
>
>
> In a clustered management server environment, after a vCenter restart some
> hosts repeatedly go back into alert state even after the vCenter comes up.
> Root caused the issue to the below race condition -
> There is a scheduled PingTask that is run for every host and the interval at
> which it is run is configurable (global config - ping.interval). When vCenter
> gets restarted, PingTask is unable to get the host status and so it schedules
> another task to handle the disconnect for the host agent.
> This disconnect task determines the host status by sending CheckHeathCommand
> to the agent. When the command returns an answer that says the resource is
> not alive, CS performs further investigations and in this case VMware
> investigator confirms the host to be in disconnected state. After which
> disconnect is processed which involves the following -
> 1. Cancel all scheduled tasks for that agent which includes PingTask
> 2. Send disconnect to all listeners including AgentMonitor which clears the
> agent from MS's PingMap
> If the above disconnect takes a while to get scheduled and spills over to the
> next PingTask interval, then the next PingTask runs wherein if by now the
> vCenter is Up and host is connected the Ping is successful and hence an entry
> for the agent is made in the PingMap.
> Once an entry is made in the PingMap after a disconnect, every minute the
> AgentMonitor task will run to find the agent behind on Ping, disconnect host
> agent without investigation because the attache is no longer connected and
> put the host back into Alert state.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)