jmsperu commented on PR #13074: URL: https://github.com/apache/cloudstack/pull/13074#issuecomment-4684101637
Pushed updates addressing the review, plus a correctness fix found while testing the incremental path on a real **libvirt 10.0.0 / qemu 8.2** host. ### Commits 1. **`3456ba7d`** — review cleanups: removed the unused `RebaseBackupCommand` + its wrapper, and stripped GitHub-username / review-thread references from code comments (kept the technical explanation in each). 2. **`ba74778d`** — `LibvirtTakeBackupCommandWrapper` now uses the `--bitmap-new` value it already passes instead of re-parsing the `BITMAP_CREATED=` line. The marker stays only as a *success* signal (a stopped-VM on RBD/LINSTOR genuinely doesn't create a bitmap). 3. **`8adde4da`** — **bug fix: parent-checkpoint recreation for an incremental taken after a VM restart.** ### The bug (caught in testing) An incremental taken after the VM has been (re)started since the last backup **failed**. CloudStack rebuilds the domain XML on every start, wiping libvirt's checkpoint registry, while the dirty bitmap persists on the qcow2 (QEMU reloads it). The old code rebuilt a *minimal* checkpoint XML and did a **fresh** `checkpoint-create`, which QEMU rejects: ``` error: internal error: unable to execute QEMU command 'transaction': Bitmap already exists: <name> ``` and the `qemu-img bitmap --remove` fallback can't run on a live VM: ``` qemu-img: Could not open '...qcow2': Failed to get "write" lock ``` ### The fix Re-register the parent with **`checkpoint-create --redefine`** using the **full** `checkpoint-dumpxml` output (a minimal/synthesized XML is rejected by libvirt's checkpoint RNG schema). So: persist `<bitmap>.checkpoint.xml` next to each backup on the NAS, and on recreate `--redefine` from the parent backup's saved XML; if it's missing (a pre-fix backup) or redefine fails, fall back to a full so the chain restarts cleanly. Verified end-to-end on libvirt 10.0.0: `full → dirty → incremental`, then `restart → redefine → incremental`, all succeed. ### On the remaining threads - **stderr (INCREMENTAL_FALLBACK / BITMAP_CREATED)** — already addressed; those markers are echoed to **stdout**. - **`BITMAP_RECREATED`, `findLiveChildren(parent).isEmpty()`, unused `NASBackupChainKeys` strings** — already removed in earlier pushes (the two strings flagged are now the in-use `ChainDecision.mode` sentinels `TYPE_FULL`/`TYPE_INCREMENTAL`). - **`INCREMENTAL_FALLBACK`** — keeping it: the running-vs-stopped decision is made at the last moment in the script (the `virsh list` immediately before the backup), and the marker is how the stopped path tells the orchestrator to record a FULL. Pulling it earlier into Java would *widen* the stop-mid-flight race. - **`getVolumePoolsAndPaths(parentVols)`** — that block computes parent *file paths* from `Backup.VolumeInfo` (historical backup metadata); the method takes live `VolumeVO`, so it's a different input/purpose, not a direct swap. On the larger "move the checks into Java" suggestion: I started it, but testing showed the recreate needs `checkpoint-dumpxml` + `--redefine` against NAS-side XML, which is cohesive in the script — I've kept it there for now and can revisit the Java move as a follow-up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
