This PR doesn't seem to completely fix the problem (or maybe this is a 
completely new problem). We installed the RC release with this PR on a test 
system and are able to get the KVM host to be marked as `Down` by using 
iptables to drop outgoing requests to NFS. My investigation shows that the line 
[`storage = 
conn.storagePoolLookupByUUIDString(uuid);`](https://github.com/apache/cloudstack/blob/4.11/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHAMonitor.java#L95)
 blocks indefinitely.  So, `kvmheartbeat.sh` is never executed, a host 
investigation is started, the host with blocked NFS is marked as `Down` and 
finally all VMs on that host are rescheduled and result in duplicate VMs.

I pulled a thread dump and found the KVMHAMonitor thread will hang here until 
NFS is unblocked, didn't dig any deeper yet though.

```"Thread-20" - Thread t@135
   java.lang.Thread.State: RUNNABLE
        at com.sun.jna.Native.invokePointer(Native Method)
        at com.sun.jna.Function.invokePointer(Function.java:470)
        at com.sun.jna.Function.invoke(Function.java:404)
        at com.sun.jna.Function.invoke(Function.java:315)
        at com.sun.jna.Library$Handler.invoke(Library.java:212)
        at com.sun.proxy.$Proxy3.virStoragePoolLookupByUUIDString(Unknown 
Source)
        at org.libvirt.Connect.storagePoolLookupByUUIDString(Unknown Source)
        at 
com.cloud.hypervisor.kvm.resource.KVMHAMonitor$Monitor.runInContext(KVMHAMonitor.java:95)
        - locked <1afb3370> (a java.util.concurrent.ConcurrentHashMap)
        at 
org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
        at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
        at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
        at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
        at 
org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None```

[ Full content available at: https://github.com/apache/cloudstack/pull/2722 ]
This message was relayed via gitbox.apache.org for devnull@infra.apache.org

Reply via email to