[
https://issues.apache.org/jira/browse/CLOUDSTACK-5859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dave Garbus updated CLOUDSTACK-5859:
------------------------------------
Description:
We have a group of 13 KVM servers added to a single cluster within CloudStack.
All VMs use local hypervisor storage, with the exception of one that was
configured to use NFS-based primary storage with a HA service offering.
An issue occurred with the SAN responsible for serving the NFS mount (primary
storage for HA VM) and the mount was put into a read-only state. Shortly after,
each host in the cluster rebooted and continued to stay in a reboot loop until
I put the primary storage into maintenance. These messages were in the
agent.log on each of the KVM hosts:
2014-01-12 02:40:20,953 WARN [kvm.resource.KVMHAMonitor] (Thread-137180:null)
write heartbeat failed: timeout, retry: 4
2014-01-12 02:40:20,953 WARN [kvm.resource.KVMHAMonitor] (Thread-137180:null)
write heartbeat failed: timeout; reboot the host
In essence, a single HA-enabled VM was able to bring down an entire KVM cluster
that was hosting a number of VMs with local storage. It would seem that the
fencing script needs to be improved to account for cases where both local and
shared storage is used.
was:
We have a group of 13 KVM servers added to a single cluster within CloudStack.
All VMs use local hypervisor storage, with the exception of one that was
configured to use NFS-based primary storage with a HA service offering.
An issue occurred with the disk responsible for serving the NFS mount (primary
storage for HA VM) and the mount was put into a read-only state. Shortly after,
each host in the cluster rebooted and continued to stay in a reboot loop until
I put the primary storage into maintenance. These messages were in the
agent.log on each of the KVM hosts:
2014-01-12 02:40:20,953 WARN [kvm.resource.KVMHAMonitor] (Thread-137180:null)
write heartbeat failed: timeout, retry: 4
2014-01-12 02:40:20,953 WARN [kvm.resource.KVMHAMonitor] (Thread-137180:null)
write heartbeat failed: timeout; reboot the host
In essence, a single HA-enabled VM was able to bring down an entire KVM cluster
that was hosting a number of VMs with local storage. It would seem that the
fencing script needs to be improved to account for cases where both local and
shared storage is used.
> [HA] Shared storage failure results in reboot loop; VMs with Local storage
> brought offline
> ------------------------------------------------------------------------------------------
>
> Key: CLOUDSTACK-5859
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-5859
> Project: CloudStack
> Issue Type: Bug
> Security Level: Public(Anyone can view this level - this is the
> default.)
> Components: KVM
> Affects Versions: 4.2.0
> Environment: RHEL/CentOS 6.4 with KVM
> Reporter: Dave Garbus
> Priority: Critical
>
> We have a group of 13 KVM servers added to a single cluster within
> CloudStack. All VMs use local hypervisor storage, with the exception of one
> that was configured to use NFS-based primary storage with a HA service
> offering.
> An issue occurred with the SAN responsible for serving the NFS mount (primary
> storage for HA VM) and the mount was put into a read-only state. Shortly
> after, each host in the cluster rebooted and continued to stay in a reboot
> loop until I put the primary storage into maintenance. These messages were in
> the agent.log on each of the KVM hosts:
> 2014-01-12 02:40:20,953 WARN [kvm.resource.KVMHAMonitor]
> (Thread-137180:null) write heartbeat failed: timeout, retry: 4
> 2014-01-12 02:40:20,953 WARN [kvm.resource.KVMHAMonitor]
> (Thread-137180:null) write heartbeat failed: timeout; reboot the host
> In essence, a single HA-enabled VM was able to bring down an entire KVM
> cluster that was hosting a number of VMs with local storage. It would seem
> that the fencing script needs to be improved to account for cases where both
> local and shared storage is used.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)