[jira] [Commented] (CLOUDSTACK-5859) [HA] Shared storage failure results in reboot loop; VMs with Local storage brought offline

haijiao (JIRA) Thu, 09 Apr 2015 01:45:05 -0700

    [ 
https://issues.apache.org/jira/browse/CLOUDSTACK-5859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486983#comment-14486983
 ]


haijiao commented on CLOUDSTACK-5859:
-------------------------------------

We had hit the similar issue too.

KVM VMs configured as 'HA'  within one cluster are able to access 2 NFS primary 
storages. (1# and 2#).

While 2# storage accidently became inaccessible (due to incorrect permission 
setting),  all the hosts within that cluster kept rebooting with message below 
until we corrected the setting.

It seems the design here could be further improved.  CloudStack shall check if 
any other storage attached to these VMs is still accessible.  The script 
'kvmheartbeat.sh' should NOT reboot the hosts as long as one shared storage is 
still working, since the root cause is obviously not the 'host' but  something 
else.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2015-04-03 14:01:29,555 WARN [kvm.resource.KVMHAMonitor] (Thread-1330:null) 
write heartbeat failed: Failed to create 
/mnt/5e41f790-8da9-36d2-938e-f3ea767bfadb/KVMHA//hb-10.226.31.11, retry: 0
2015-04-03 14:01:29,575 WARN [kvm.resource.KVMHAMonitor] (Thread-1330:null) 
write heartbeat failed: Failed to create 
/mnt/5e41f790-8da9-36d2-938e-f3ea767bfadb/KVMHA//hb-10.226.31.11, retry: 1
2015-04-03 14:01:29,595 WARN [kvm.resource.KVMHAMonitor] (Thread-1330:null) 
write heartbeat failed: Failed to create 
/mnt/5e41f790-8da9-36d2-938e-f3ea767bfadb/KVMHA//hb-10.226.31.11, retry: 2
2015-04-03 14:01:29,614 WARN [kvm.resource.KVMHAMonitor] (Thread-1330:null) 
write heartbeat failed: Failed to create 
/mnt/5e41f790-8da9-36d2-938e-f3ea767bfadb/KVMHA//hb-10.226.31.11, retry: 3
2015-04-03 14:01:29,635 WARN [kvm.resource.KVMHAMonitor] (Thread-1330:null) 
write heartbeat failed: Failed to create 
/mnt/5e41f790-8da9-36d2-938e-f3ea767bfadb/KVMHA//hb-10.226.31.11, retry: 4
2015-04-03 14:01:29,635 WARN [kvm.resource.KVMHAMonitor] (Thread-1330:null) 
write heartbeat failed: Failed to create 
/mnt/5e41f790-8da9-36d2-938e-f3ea767bfadb/KVMHA//hb-10.226.31.11; reboot the 
host
2015-04-03 14:02:01,246 INFO [cloud.agent.Agent] (AgentShutdownThread:null) 
Stopping the agent: Reason = sig.kill

> [HA] Shared storage failure results in reboot loop; VMs with Local storage 
> brought offline
> ------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-5859
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-5859
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the 
> default.) 
>          Components: KVM
>    Affects Versions: 4.2.0
>         Environment: RHEL/CentOS 6.4 with KVM
>            Reporter: Dave Garbus
>            Priority: Critical
>
> We have a group of 13 KVM servers added to a single cluster within 
> CloudStack. All VMs use local hypervisor storage, with the exception of one 
> that was configured to use NFS-based primary storage with a HA service 
> offering.
> An issue occurred with the SAN responsible for serving the NFS mount (primary 
> storage for HA VM) and the mount was put into a read-only state. Shortly 
> after, each host in the cluster rebooted and continued to stay in a reboot 
> loop until I put the primary storage into maintenance. These messages were in 
> the agent.log on each of the KVM hosts:
> 2014-01-12 02:40:20,953 WARN  [kvm.resource.KVMHAMonitor] 
> (Thread-137180:null) write heartbeat failed: timeout, retry: 4
> 2014-01-12 02:40:20,953 WARN  [kvm.resource.KVMHAMonitor] 
> (Thread-137180:null) write heartbeat failed: timeout; reboot the host
> In essence, a single HA-enabled VM was able to bring down an entire KVM 
> cluster that was hosting a number of VMs with local storage. It would seem 
> that the fencing script needs to be improved to account for cases where both 
> local and shared storage is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CLOUDSTACK-5859) [HA] Shared storage failure results in reboot loop; VMs with Local storage brought offline

Reply via email to