TimServers opened a new issue, #12920:
URL: https://github.com/apache/cloudstack/issues/12920

   ### problem
   
   ##### ISSUE TYPE
   
   * Bug Report
   
   ##### COMPONENT NAME
   
   KVM, Orchestration, HA
   
   ##### CLOUDSTACK VERSION
   
   4.22.x
   
   ##### CONFIGURATION
   
   - KVM hypervisor
   - Shared primary storage on NFS
   - HA-enabled user VM
   - sync.interval = 60
   - no ha.tag configured
   - tested with an HA-enabled VM deployed on a healthy KVM host
   - multiple management servers in the zone/cluster
   
   ##### OS / ENVIRONMENT
   
   - CloudStack management servers on Ubuntu 24.04
   - MySQL 8
   - KVM hosts on Linux/libvirt
   - Primary storage: NFS
   
   ##### SUMMARY
   
   On CloudStack 4.22.x, if a KVM VM is stopped unexpectedly on the hypervisor 
using `virsh destroy`, CloudStack detects `PowerReportMissing`, waits for the 
grace period, and schedules HA restart work. However, the HA worker then fails 
to restart the VM because `KVMInvestigator` reports the VM as alive (`alive? 
true`) while the host is still up.
   
   As a result:
   - the VM remains in `Running` state in CloudStack/UI
   - the VM is not transitioned to `Stopped`
   - HA does not restart it
   - the same HA scheduling/investigation loop repeats on subsequent sync cycles
   
   This appears related to #10406 / #10407, which were intended to fix cases 
where VMs were not moving to `Stopped` when `PowerReportMissing` is processed.
   
   ##### EXPECTED RESULTS
   
   After the grace period passes, CloudStack should process 
`PowerReportMissing`, transition the VM to `Stopped`, and, because HA is 
enabled, restart the VM automatically.
   
   Expected behavior for this test case:
   
   1. `virsh destroy <domain>` removes the libvirt domain.
   2. CloudStack detects the VM as missing.
   3. After the graceful period expires, CloudStack updates the VM power report 
to `PowerReportMissing`.
   4. CloudStack transitions the VM state from `Running` to `Stopped`.
   5. HA schedules a restart for the VM.
   6. The VM is restarted automatically on an eligible host.
   7. The CloudStack UI/API reflects the VM state correctly and does not 
continue to show the VM as `Running`.
   
   ##### ACTUAL RESULTS
   
   CloudStack detects the VM as missing and the graceful period is working 
correctly:
   
   ```text
   2026-03-31 02:28:43,791 DEBUG ... Detected missing VM. host: 6, vm id: 
91(...), power state: PowerReportMissing, last state update: 
2026-03-31T02:27:43+0000
   2026-03-31 02:28:43,791 DEBUG ... vm id: 91 - time since last state 
update(60791 ms) has not passed graceful period yet
   
   2026-03-31 02:29:43,722 DEBUG ... Detected missing VM. host: 6, vm id: 
91(...), power state: PowerReportMissing, last state update: 
2026-03-31T02:27:43+0000
   2026-03-31 02:29:43,722 DEBUG ... vm id: 91 - time since last state 
update(120722 ms) has passed graceful period
   ```
   After the graceful period passes, CloudStack updates the VM power report and 
schedules HA restart work:
   ```
   2026-03-31 02:29:43,742 DEBUG ... VM state report is updated. Host {...}, VM 
instance {"id":91,"instanceName":"i-2-91-VM","state":"Running"...}, power 
state: PowerReportMissing
   2026-03-31 02:29:43,775 INFO  ... Detected out-of-band stop of a HA enabled 
VM ... will schedule restart.
   2026-03-31 02:29:43,798 INFO  ... Schedule vm for HA: VM instance 
{"id":91,"instanceName":"i-2-91-VM","state":"Running"...}
   2026-03-31 02:29:43,820 INFO  ... HA on VM instance 
{"id":91,"instanceName":"i-2-91-VM","state":"Running"...}
   ```
   The HA worker checks the VM, and the host-side agent confirms that the 
libvirt domain no longer exists:
   ```
   2026-03-31 02:29:43,855 DEBUG ... Unable to get vm state on VM instance 
{"id":91,"instanceName":"i-2-91-VM","state":"Running"...}```
   ```
   KVM host agent log:
   ```
   2026-03-31 02:29:43,928 ERROR ... Could not get state for VM [i-2-91-VM] 
(retry=0) due to: org.libvirt.LibvirtException: Domain not found: no domain 
with matching name 'i-2-91-VM'
   ```
   However, KVMInvestigator then reports the VM as alive, and the HA restart is 
cancelled:
   ```
   2026-03-31 02:29:43,859 INFO  ... KVMInvestigator found VM instance 
{"id":91,"instanceName":"i-2-91-VM","state":"Running"...} to be alive? true
   2026-03-31 02:29:43,860 INFO  ... VM instance 
{"id":91,"instanceName":"i-2-91-VM","state":"Running"...} is alive and host is 
up. No need to restart it.
   ```
   This same pattern repeats on later sync cycles, including 02:31:43, 
02:33:43, and 02:36:43.
   
   Final observed behavior:
   
   the VM remains in Running state in CloudStack/UI
   the VM is not transitioned to Stopped
   HA does not restart the VM
   the missing-domain / HA-scheduled / KVMInvestigator alive=true loop repeats 
continuously
   
   
   ### versions
   
   cloudstack-management  4.22.0.0
   cloudstack-agent 4.22.0.0 
   libvirt 10.0.0-2ubuntu8.11
   ubuntu 24.04 LTS
   
   ### The steps to reproduce the bug
   
   1. Deploy a user VM on a KVM host with HA enabled.
   2. Confirm the VM is in `Running` state in CloudStack.
   3. On the KVM host, destroy the domain unexpectedly:
   
      ```bash
      virsh destroy <domain-name>
   
   ### What to do about it?
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to