Jean-Francois Nadeau created CLOUDSTACK-10397:
-------------------------------------------------
Summary: Transient NFS access issues should not result in
duplicate VMs or KVM hosts resets
Key: CLOUDSTACK-10397
URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10397
Project: CloudStack
Issue Type: Bug
Security Level: Public (Anyone can view this level - this is the default.)
Components: cloudstack-agent, Hypervisor Controller
Affects Versions: 4.11.1.1
Reporter: Jean-Francois Nadeau
Under CentOS 7.x with KVM and NFS as primary storage, we expect to tolerate
and recover from temporary disconnection from primary storage. We simulate
this with iptables from the KVM host using a DROP rule in the input and output
chains to the NFS servers IP.
The observation under 4.11.2 is that an NFS disconnection of more than 5
minutes will
With VM HA enabled and host HA disabled: Cloudstack agent will often block
refreshing primary storage and go in Down state from the controller
perspective. Controller will restart VMs on other hosts creating duplicate VMs
on the network and possibly corrupt VM root disk if the transient issue goes
away.
With VM HA enabled and host HA disabled: Same agent issue can cause it to block
and will end in either Disconnect or Down state. Host HA framework will reset
the KVM hosts after the kvm.ha._degraded_._max_.period . The problem here is
that, yes the host HA does ensure we don't have dup VMs but at scale this
would also provoke a lot of KVM host resets (if not all of them).
On 4.9.3 the cloudstack agent will simply "hang" in there and the controller
would not see the KVM host down (at least for 60 minutes). When the network
issue blocking NFS access is resolved all KVM hosts and VMs just resume
working with no large scale fencing happening.
The same resilience is expected on 4.11.x . This a a blocker for an upgrade
from 4.9, considering we are more at risk on 4.11 with VM HA enabled and
regardless of if host HA is enabled.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)