[jira] [Commented] (CLOUDSTACK-5859) [HA] Shared storage failure results in reboot loop; VMs with Local storage brought offline

Bjoern Teipel (JIRA) Wed, 26 Mar 2014 22:41:25 -0700

    [ 
https://issues.apache.org/jira/browse/CLOUDSTACK-5859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13948922#comment-13948922
 ]


Bjoern Teipel commented on CLOUDSTACK-5859:
-------------------------------------------

I personally don't see any reason for rebooting a hyper visor if NFS is 
unavailable or timing out due to IO/Net issues, especially if you have VMs on 
local or CLVM storage.
I'll patch our installation to not reboot the Hypervisor, since I had a pool of 
10 servers happily rebooting after a VLAN configuration error which ran also 
CLVM with fencing on top. Was not fun to fix. And those behavior does't exist 
on Xenserver to my knowledge

> [HA] Shared storage failure results in reboot loop; VMs with Local storage 
> brought offline
> ------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-5859
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-5859
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the 
> default.) 
>          Components: KVM
>    Affects Versions: 4.2.0
>         Environment: RHEL/CentOS 6.4 with KVM
>            Reporter: Dave Garbus
>            Priority: Critical
>
> We have a group of 13 KVM servers added to a single cluster within 
> CloudStack. All VMs use local hypervisor storage, with the exception of one 
> that was configured to use NFS-based primary storage with a HA service 
> offering.
> An issue occurred with the SAN responsible for serving the NFS mount (primary 
> storage for HA VM) and the mount was put into a read-only state. Shortly 
> after, each host in the cluster rebooted and continued to stay in a reboot 
> loop until I put the primary storage into maintenance. These messages were in 
> the agent.log on each of the KVM hosts:
> 2014-01-12 02:40:20,953 WARN  [kvm.resource.KVMHAMonitor] 
> (Thread-137180:null) write heartbeat failed: timeout, retry: 4
> 2014-01-12 02:40:20,953 WARN  [kvm.resource.KVMHAMonitor] 
> (Thread-137180:null) write heartbeat failed: timeout; reboot the host
> In essence, a single HA-enabled VM was able to bring down an entire KVM 
> cluster that was hosting a number of VMs with local storage. It would seem 
> that the fencing script needs to be improved to account for cases where both 
> local and shared storage is used.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CLOUDSTACK-5859) [HA] Shared storage failure results in reboot loop; VMs with Local storage brought offline

Reply via email to