jmsperu commented on PR #13074:
URL: https://github.com/apache/cloudstack/pull/13074#issuecomment-4684101637

   Pushed updates addressing the review, plus a correctness fix found while 
testing the incremental path on a real **libvirt 10.0.0 / qemu 8.2** host.
   
   ### Commits
   1. **`3456ba7d`** — review cleanups: removed the unused 
`RebaseBackupCommand` + its wrapper, and stripped GitHub-username / 
review-thread references from code comments (kept the technical explanation in 
each).
   2. **`ba74778d`** — `LibvirtTakeBackupCommandWrapper` now uses the 
`--bitmap-new` value it already passes instead of re-parsing the 
`BITMAP_CREATED=` line. The marker stays only as a *success* signal (a 
stopped-VM on RBD/LINSTOR genuinely doesn't create a bitmap).
   3. **`8adde4da`** — **bug fix: parent-checkpoint recreation for an 
incremental taken after a VM restart.**
   
   ### The bug (caught in testing)
   An incremental taken after the VM has been (re)started since the last backup 
**failed**. CloudStack rebuilds the domain XML on every start, wiping libvirt's 
checkpoint registry, while the dirty bitmap persists on the qcow2 (QEMU reloads 
it). The old code rebuilt a *minimal* checkpoint XML and did a **fresh** 
`checkpoint-create`, which QEMU rejects:
   ```
   error: internal error: unable to execute QEMU command 'transaction': Bitmap 
already exists: <name>
   ```
   and the `qemu-img bitmap --remove` fallback can't run on a live VM:
   ```
   qemu-img: Could not open '...qcow2': Failed to get "write" lock
   ```
   
   ### The fix
   Re-register the parent with **`checkpoint-create --redefine`** using the 
**full** `checkpoint-dumpxml` output (a minimal/synthesized XML is rejected by 
libvirt's checkpoint RNG schema). So: persist `<bitmap>.checkpoint.xml` next to 
each backup on the NAS, and on recreate `--redefine` from the parent backup's 
saved XML; if it's missing (a pre-fix backup) or redefine fails, fall back to a 
full so the chain restarts cleanly. Verified end-to-end on libvirt 10.0.0: 
`full → dirty → incremental`, then `restart → redefine → incremental`, all 
succeed.
   
   ### On the remaining threads
   - **stderr (INCREMENTAL_FALLBACK / BITMAP_CREATED)** — already addressed; 
those markers are echoed to **stdout**.
   - **`BITMAP_RECREATED`, `findLiveChildren(parent).isEmpty()`, unused 
`NASBackupChainKeys` strings** — already removed in earlier pushes (the two 
strings flagged are now the in-use `ChainDecision.mode` sentinels 
`TYPE_FULL`/`TYPE_INCREMENTAL`).
   - **`INCREMENTAL_FALLBACK`** — keeping it: the running-vs-stopped decision 
is made at the last moment in the script (the `virsh list` immediately before 
the backup), and the marker is how the stopped path tells the orchestrator to 
record a FULL. Pulling it earlier into Java would *widen* the stop-mid-flight 
race.
   - **`getVolumePoolsAndPaths(parentVols)`** — that block computes parent 
*file paths* from `Backup.VolumeInfo` (historical backup metadata); the method 
takes live `VolumeVO`, so it's a different input/purpose, not a direct swap.
   
   On the larger "move the checks into Java" suggestion: I started it, but 
testing showed the recreate needs `checkpoint-dumpxml` + `--redefine` against 
NAS-side XML, which is cohesive in the script — I've kept it there for now and 
can revisit the Java move as a follow-up.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to