jmsperu opened a new pull request, #12822: URL: https://github.com/apache/cloudstack/pull/12822
## Summary Fixes #12821 — KVM VMs remain indefinitely paused when NAS backup job fails. When `virsh backup-begin` executes a push backup, QEMU pauses the domain for a consistent snapshot. If the backup write fails (e.g. NFS storage full), `nasbackup.sh` calls `cleanup()` but: 1. **Never resumes the paused VM** — `cleanup()` only removes files and unmounts 2. **Never exits the monitoring loop** — missing `exit` after `cleanup()` in the `Failed` case causes an infinite loop 3. **Same missing exit in `backup_stopped_vm()`** — `qemu-img convert` failure calls `cleanup()` but continues processing ### Changes - **`cleanup()`**: Added VM state detection via `virsh domstate` and automatic `virsh resume` if the VM is found paused, ensuring the VM is always resumed during error handling - **`backup_running_vm()`**: Added `exit 1` after `cleanup()` in the `Failed` backup job case to terminate the infinite monitoring loop - **`backup_stopped_vm()`**: Added `exit 1` after `cleanup()` on `qemu-img convert` failure ### Evidence In production, NFS backup storage filling to 100% caused 8 VMs to become paused simultaneously across 3 KVM hosts. Some VMs remained paused for over 6 hours. CloudStack UI showed them as "Running" while they were actually paused at the KVM level, requiring manual `virsh resume` on each host. ### Note The pattern of checking and resuming paused VMs already exists in the Java layer — see `LibvirtBackupSnapshotCommandWrapper.java:186-188` and `KVMStorageProcessor.java:2268-2272` — but was missing from the shell script that actually manages the backup lifecycle. ## Test plan - [ ] Trigger NAS backup on a running VM with sufficient storage — verify backup completes and VM stays running - [ ] Trigger NAS backup with NFS storage at 100% — verify backup fails but VM is resumed automatically - [ ] Trigger NAS backup on a stopped VM with a bad disk path — verify cleanup exits properly - [ ] Verify `cleanup()` correctly resumes VM before removing temp files and unmounting 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
