abh1sar commented on code in PR #13074:
URL: https://github.com/apache/cloudstack/pull/13074#discussion_r3467102369


##########
scripts/vm/hypervisor/kvm/nasbackup.sh:
##########
@@ -113,20 +122,97 @@ backup_running_vm() {
   mount_operation
   mkdir -p "$dest" || { echo "Failed to create backup directory $dest"; exit 
1; }
 
+  # Determine effective mode for this run.
+  # Legacy callers (no -M argument) get the original full-only behavior with 
no checkpoint.
+  # The Java wrapper (LibvirtTakeBackupCommandWrapper) pre-validates required 
args before
+  # invoking the script; the case below is a defensive fallback for direct 
invocations.
+  local effective_mode="${MODE:-legacy-full}"
+  local make_checkpoint=0
+  case "$effective_mode" in
+    incremental|full)
+      make_checkpoint=1
+      ;;
+    legacy-full)
+      make_checkpoint=0
+      ;;
+    *)
+      echo "Unknown mode: $effective_mode"
+      cleanup
+      exit 1
+      ;;
+  esac
+
+  # When incremental, make sure the parent checkpoint is registered with 
libvirt. CloudStack
+  # rebuilds the domain XML on every VM start, which wipes libvirt's in-memory 
checkpoint
+  # registry, while the dirty bitmap persists on the qcow2 (QEMU re-loads it 
on start). A
+  # fresh checkpoint-create cannot be used then — QEMU reports "Bitmap already 
exists" — so the
+  # parent must be re-registered with --redefine. libvirt only needs the 
checkpoint name and a
+  # creationTime for a redefine (the value need not be accurate — checkpoints 
are ephemeral),
+  # so we synthesize a minimal XML on the fly instead of persisting the full 
checkpoint dump.
+  #
+  # First verify the parent bitmap actually exists on the running qcow2 — it 
can be absent after
+  # a migration even though the orchestrator's active_checkpoint says it 
should be there. If it
+  # is gone, fall back to a full backup rather than letting backup-begin fail 
below.
+  if [[ "$effective_mode" == "incremental" ]]; then
+    if ! virsh -c qemu:///system qemu-monitor-command "$VM" 
'{"execute":"query-block"}' 2>/dev/null | grep -q "\"$BITMAP_PARENT\""; then
+      log -e "incremental: parent bitmap $BITMAP_PARENT not present on the 
qcow2 — falling back to full"
+      echo "INCREMENTAL_FALLBACK=true"
+      effective_mode="full"
+    fi
+  fi

Review Comment:
   This doesn't work if only one of the disks is missing the bitmap. This 
checks if bitmap is present on any of the disks. We have to check all disks.
   
   This check and decision can be moved to Java.
   `LibvirtStartBackupCommandWrapper.getVmDiskPathHasFromCheckpointMap()` 
already does this. It can be moved to a common file like 
LibvirtComputingResource and both LibvirtStartBackupCommandWrapper and 
LibvirtTakeBackupCommandWrapper using it.
   



##########
plugins/backup/nas/src/main/java/org/apache/cloudstack/backup/NASBackupProvider.java:
##########
@@ -495,9 +863,42 @@ public boolean deleteBackup(Backup backup, boolean forced) 
{
             throw new CloudRuntimeException(String.format("Unable to find a 
running KVM host in zone %d to delete backup %s", backup.getZoneId(), 
backup.getUuid()));
         }
 
-        DeleteBackupCommand command = new 
DeleteBackupCommand(backup.getExternalId(), backupRepository.getType(),
-                backupRepository.getAddress(), 
backupRepository.getMountOptions());
+        // Backups outside any tracked chain (legacy or non-incremental 
providers) are
+        // deleted straight away — no children semantics apply.
+        if (readDetail(backup, NASBackupChainKeys.CHAIN_ID) == null) {
+            return deleteBackupFileAndRow(backup, backupRepository, host);
+        }
+
+        // Snapshot-style cascade: defer the on-NAS rm + DB row while there 
are live children,
+        // mark this backup as delete-pending, and let the leaf's deletion 
sweep it up later.
+        // See DefaultSnapshotStrategy#deleteSnapshotChain for the same 
pattern on incremental
+        // snapshots. forced=true means the caller wants the entire subtree 
gone right now.
+        if (forced) {
+            return cascadeDeleteSubtree(backup, backupRepository, host);
+        }
+

Review Comment:
   Or better yet, can we set it to the parent backup's checkpoint_id so that 
the next backup is again incremental instead of full. Up to you, if it will 
complicate things, we can document and defer it to later improvements



##########
plugins/backup/nas/src/main/java/org/apache/cloudstack/backup/NASBackupProvider.java:
##########
@@ -168,6 +205,287 @@ protected Host getVMHypervisorHost(VirtualMachine vm) {
         return 
resourceManager.findOneRandomRunningHostByHypervisor(Hypervisor.HypervisorType.KVM,
 vm.getDataCenterId());
     }
 
+    /**
+     * Returned by {@link #decideChain(VirtualMachine)} to describe the next 
backup's place in
+     * the chain: full vs incremental, the bitmap name to create, and (for 
incrementals) the
+     * parent bitmap and parent file path.
+     */
+    static final class ChainDecision {
+        final String mode;            // "full" or "incremental"
+        final String bitmapNew;
+        final String bitmapParent;    // null for full
+        // Per-volume parent backup file paths, one per current VM volume in 
deviceId order.
+        // null/empty for full. Each volume needs its own parent file because 
backup files
+        // are named after each volume's own UUID (root.<uuid>.qcow2 / 
datadisk.<uuid>.qcow2).
+        final List<String> parentPaths;
+        final String chainId;         // chain identifier this backup belongs 
to
+        final int chainPosition;      // 0 for full, N for the Nth incremental 
in the chain
+
+        private ChainDecision(String mode, String bitmapNew, String 
bitmapParent, List<String> parentPaths,
+                              String chainId, int chainPosition) {
+            this.mode = mode;
+            this.bitmapNew = bitmapNew;
+            this.bitmapParent = bitmapParent;
+            this.parentPaths = parentPaths;
+            this.chainId = chainId;
+            this.chainPosition = chainPosition;
+        }
+
+        static ChainDecision fullStart(String bitmapName) {
+            return new ChainDecision(NASBackupChainKeys.TYPE_FULL, bitmapName, 
null, null,
+                    UUID.randomUUID().toString(), 0);
+        }
+
+        /**
+         * Decision used when the incremental feature is disabled: a plain 
full backup that
+         * creates no bitmap and carries no chain identity, so nothing 
chain/checkpoint-related
+         * is sent to the agent or persisted. Keeps the feature-off path 
byte-for-byte legacy.
+         */
+        static ChainDecision legacyFull() {
+            return new ChainDecision(NASBackupChainKeys.TYPE_LEGACY_FULL, 
null, null, null, null, 0);
+        }
+
+        static ChainDecision incremental(String bitmapNew, String 
bitmapParent, List<String> parentPaths,
+                                         String chainId, int chainPosition) {
+            return new ChainDecision(NASBackupChainKeys.TYPE_INCREMENTAL, 
bitmapNew, bitmapParent,
+                    parentPaths, chainId, chainPosition);
+        }
+
+        boolean isIncremental() {
+            return NASBackupChainKeys.TYPE_INCREMENTAL.equals(mode);
+        }
+
+        boolean isLegacyFull() {
+            return NASBackupChainKeys.TYPE_LEGACY_FULL.equals(mode);
+        }
+    }
+
+    /**
+     * Decides whether the next backup for {@code vm} should be a fresh full 
or an incremental
+     * appended to the existing chain. Stopped VMs are always full (libvirt 
{@code backup-begin}
+     * requires a running QEMU process). The {@code nas.backup.full.every} 
ConfigKey controls
+     * how many backups (full + incrementals) form one chain before a new full 
is forced.
+     *
+     * <p>The decision is anchored on the VM's {@code 
nas.active_checkpoint_id} detail, which
+     * records the bitmap that currently exists on the running QEMU. After a 
restore that
+     * detail is cleared, so the next backup is automatically full — even 
though there may be
+     * a more recent "last backup taken" row in the database. The decision 
deliberately avoids
+     * relying on "last backup taken", because that row is misleading after a 
restore.</p>
+     */
+    protected ChainDecision decideChain(VirtualMachine vm) {
+        // Master switch — when the operator disables incrementals at the zone 
level the backup
+        // behaves exactly like the pre-incremental full-only path: no bitmap 
is generated and no
+        // chain/checkpoint metadata is created, sent to the agent, or 
persisted (legacy-full).
+        Boolean incrementalEnabled = 
NASBackupIncrementalEnabled.valueIn(vm.getDataCenterId());
+        if (incrementalEnabled == null || !incrementalEnabled) {
+            return ChainDecision.legacyFull();
+        }
+
+        final String newBitmap = "backup-" + System.currentTimeMillis() / 
1000L;
+
+        // Stopped VMs cannot do incrementals — script will also fall back, 
but we make the
+        // decision here so we register the right type up-front.
+        if (VirtualMachine.State.Stopped.equals(vm.getState())) {
+            return ChainDecision.fullStart(newBitmap);
+        }
+
+        Integer fullEvery = NASBackupFullEvery.valueIn(vm.getDataCenterId());
+        if (fullEvery == null || fullEvery <= 1) {
+            // Disabled or every-backup-is-full mode.
+            return ChainDecision.fullStart(newBitmap);
+        }
+
+        // 1. If the VM has no active_checkpoint_id, there is no bitmap on the 
host to use as
+        //    a parent. This is the case after restore (we clear it), after VM 
was just assigned
+        //    to the offering, or on the very first backup.
+        String activeCheckpoint = readVmActiveCheckpoint(vm.getId());
+        if (activeCheckpoint == null) {
+            return ChainDecision.fullStart(newBitmap);
+        }
+
+        // 2. The most-recent BackedUp backup is the only safe parent — after 
restore the
+        //    next backup is always a fresh full, so anything older has a 
rotated-out bitmap.
+        //    If the latest backup's bitmap doesn't match the VM's 
active_checkpoint_id, the
+        //    chain is broken: force a full.
+        Backup parent = findLatestBackedUpBackup(vm.getId());
+        if (parent == null || !activeCheckpoint.equals(readDetail(parent, 
NASBackupChainKeys.BITMAP_NAME))) {
+            LOG.debug("VM {} latest backup does not match 
active_checkpoint_id={} — forcing full",
+                    vm.getInstanceName(), activeCheckpoint);
+            return ChainDecision.fullStart(newBitmap);
+        }
+
+        String parentChainId = readDetail(parent, NASBackupChainKeys.CHAIN_ID);
+        int parentChainPosition = chainPosition(parent);
+        if (parentChainId == null || parentChainPosition == Integer.MAX_VALUE) 
{
+            return ChainDecision.fullStart(newBitmap);
+        }
+
+        // Force a fresh full when the chain has reached the configured length.
+        if (parentChainPosition + 1 >= fullEvery) {
+            return ChainDecision.fullStart(newBitmap);
+        }
+
+        // The script needs the parent backup's on-NAS file path PER VOLUME so 
it can rebase
+        // each new qcow2 onto the matching parent. The paths are stored 
relative to the NAS
+        // mount root — the script resolves them inside its mount session. 
When alignment
+        // fails (volume count changed, etc.) compose returns null and we fall 
back to full
+        // so we don't risk corrupting the chain.
+        List<String> parentPaths = composeParentBackupPaths(parent, 
vm.getId());
+        if (parentPaths == null) {
+            LOG.debug("VM {} parent backup {} volume layout no longer matches 
current VM — forcing full",
+                    vm.getInstanceName(), parent.getUuid());
+            return ChainDecision.fullStart(newBitmap);
+        }
+        return ChainDecision.incremental(newBitmap, activeCheckpoint, 
parentPaths,
+                parentChainId, parentChainPosition + 1);
+    }
+
+    /**
+     * Read the {@code nas.active_checkpoint_id} VM detail. Returns {@code 
null} when no detail
+     * exists (post-restore, first backup, or after explicit reset).
+     */
+    private String readVmActiveCheckpoint(long vmId) {
+        VMInstanceDetailVO d = vmInstanceDetailsDao.findDetail(vmId, 
NASBackupChainKeys.VM_ACTIVE_CHECKPOINT_ID);
+        if (d == null) {
+            return null;
+        }
+        String v = d.getValue();
+        return (v == null || v.isEmpty()) ? null : v;
+    }
+
+    /**
+     * Locate the most-recent {@code BackedUp} backup for {@code vmId}. The 
chain invariant
+     * guarantees the latest backup is the only valid incremental parent — 
after restore the
+     * next backup is always a fresh full, and {@link #decideChain} checks the 
bitmap matches.
+     */
+    private Backup findLatestBackedUpBackup(long vmId) {
+        List<Backup> history = backupDao.listByVmId(null, vmId);
+        if (history == null || history.isEmpty()) {
+            return null;
+        }
+        return history.stream()
+                .filter(b -> Backup.Status.BackedUp.equals(b.getStatus()))
+                .max(Comparator.comparing(Backup::getDate))
+                .orElse(null);
+    }
+
+    private String readDetail(Backup backup, String key) {
+        BackupDetailVO d = backupDetailsDao.findDetail(backup.getId(), key);
+        return d == null ? null : d.getValue();
+    }
+
+    /**
+     * Compose the on-NAS path of EVERY parent backup file (one per VM volume) 
in the same
+     * order the script will iterate the current VM's disks (deviceId asc). 
Relative to the
+     * NAS mount, matches the layout written by {@code nasbackup.sh}:
+     *   first  disk -> {@code <backupPath>/root.<volUuid>.qcow2}
+     *   others      -> {@code <backupPath>/datadisk.<volUuid>.qcow2}
+     *
+     * Returns {@code null} if the parent's stored volume count doesn't match 
the current VM's
+     * volume count. Volume attach/detach is blocked while a VM is assigned to 
a backup offering;
+     * if the offering was removed and re-assigned the active checkpoint is 
cleared in
+     * {@link #removeVMFromBackupOffering}, so this method doesn't need to 
revalidate volume
+     * identities — a count mismatch is the only way to reach this branch with 
a non-null
+     * active_checkpoint_id.
+     */
+    private List<String> composeParentBackupPaths(Backup parent, long vmId) {
+        // backupPath is stored as externalId by createBackupObject — e.g.
+        // "i-2-1234-VM/2026.04.27.13.45.00".
+        String dir = parent.getExternalId();
+        if (dir == null || dir.isEmpty()) {
+            return null;
+        }
+
+        List<Backup.VolumeInfo> parentVols = parent.getBackedUpVolumes();
+        if (parentVols == null || parentVols.isEmpty()) {
+            return null;
+        }
+
+        List<VolumeVO> currentVols = volumeDao.findByInstance(vmId);
+        if (currentVols == null || currentVols.size() != parentVols.size()) {
+            return null;
+        }
+
+        // parentVols is in deviceId order at the time the parent was taken. 
The script names
+        // files as root.<uuid>.qcow2 for the first volume and 
datadisk.<uuid>.qcow2 for the rest.
+        List<String> paths = new ArrayList<>(parentVols.size());
+        for (int i = 0; i < parentVols.size(); i++) {
+            String volUuid = parentVols.get(i).getUuid();
+            String prefix = (i == 0) ? "root" : "datadisk";
+            paths.add(dir + "/" + prefix + "." + volUuid + ".qcow2");

Review Comment:
   Instead of creating the path based on uuid get the path directly using 
getPath in Backup.VolumeInfo.
   UUID and the actual path can differ after volume migration.  path is used to 
name the backup file not uuid.
   you can keep rest of the code as is. Just replace uuid with path
   
   example Backup.VolumeInfo after migration:
   
`[{"uuid":"17ce7a22-eaea-4391-b4f5-dc263379b05a","type":"ROOT","size":10737418240,"path":"1d7386d4-cc84-40ba-ae4d-03b056c07725.qcow2","deviceId":0,"diskOfferingId":"b55fb6dc-2dc1-4781-8e7c-01ba7931e84e"}`



##########
scripts/vm/hypervisor/kvm/nasbackup.sh:
##########
@@ -113,20 +122,97 @@ backup_running_vm() {
   mount_operation
   mkdir -p "$dest" || { echo "Failed to create backup directory $dest"; exit 
1; }
 
+  # Determine effective mode for this run.
+  # Legacy callers (no -M argument) get the original full-only behavior with 
no checkpoint.
+  # The Java wrapper (LibvirtTakeBackupCommandWrapper) pre-validates required 
args before
+  # invoking the script; the case below is a defensive fallback for direct 
invocations.
+  local effective_mode="${MODE:-legacy-full}"
+  local make_checkpoint=0
+  case "$effective_mode" in
+    incremental|full)
+      make_checkpoint=1
+      ;;
+    legacy-full)
+      make_checkpoint=0
+      ;;
+    *)
+      echo "Unknown mode: $effective_mode"
+      cleanup
+      exit 1
+      ;;
+  esac
+
+  # When incremental, make sure the parent checkpoint is registered with 
libvirt. CloudStack
+  # rebuilds the domain XML on every VM start, which wipes libvirt's in-memory 
checkpoint
+  # registry, while the dirty bitmap persists on the qcow2 (QEMU re-loads it 
on start). A
+  # fresh checkpoint-create cannot be used then — QEMU reports "Bitmap already 
exists" — so the
+  # parent must be re-registered with --redefine. libvirt only needs the 
checkpoint name and a
+  # creationTime for a redefine (the value need not be accurate — checkpoints 
are ephemeral),
+  # so we synthesize a minimal XML on the fly instead of persisting the full 
checkpoint dump.
+  #
+  # First verify the parent bitmap actually exists on the running qcow2 — it 
can be absent after
+  # a migration even though the orchestrator's active_checkpoint says it 
should be there. If it
+  # is gone, fall back to a full backup rather than letting backup-begin fail 
below.
+  if [[ "$effective_mode" == "incremental" ]]; then
+    if ! virsh -c qemu:///system qemu-monitor-command "$VM" 
'{"execute":"query-block"}' 2>/dev/null | grep -q "\"$BITMAP_PARENT\""; then
+      log -e "incremental: parent bitmap $BITMAP_PARENT not present on the 
qcow2 — falling back to full"
+      echo "INCREMENTAL_FALLBACK=true"
+      effective_mode="full"
+    fi
+  fi
+
+  if [[ "$effective_mode" == "incremental" ]]; then
+    if ! virsh -c qemu:///system checkpoint-list "$VM" --name 2>/dev/null | 
grep -qx "$BITMAP_PARENT"; then
+      redefine_xml=$(mktemp)
+      printf 
'<domaincheckpoint><name>%s</name><creationTime>%s</creationTime></domaincheckpoint>'
 \
+        "$BITMAP_PARENT" "$(date +%s)" > "$redefine_xml"
+      if virsh -c qemu:///system checkpoint-create "$VM" --xmlfile 
"$redefine_xml" --redefine > /dev/null 2>&1; then
+        rm -f "$redefine_xml" # parent checkpoint re-registered; the 
incremental can proceed against it
+      else
+        rm -f "$redefine_xml"
+        # Parent checkpoint could not be re-registered — fall back to a full 
backup in place so
+        # the chain restarts cleanly instead of failing. Emit a stdout marker 
so the wrapper
+        # records this backup as a full (incrementalFallback=true).
+        log -e "incremental: parent checkpoint $BITMAP_PARENT could not be 
re-registered — falling back to full"
+        echo "INCREMENTAL_FALLBACK=true"

Review Comment:
   Since the checkpoint xml is no longer being saved in the backup folder, 
can't we move the logic of redefining checkpoint to Java.
   
   same as the cpomment about checking bitmaps, you can reuse the code that 
does exactly this in 
`LibvirtStartBackupCommandWrapper.ensureFromCheckpointExists()`
   
   
   We can remove the INCREMENTAL_FALLBACK handling completely from the wrapper 
then.
   ```
   
           // The script self-heals to a full backup when an incremental can't 
proceed (e.g. the
           // parent checkpoint can't be re-registered) and signals it with 
INCREMENTAL_FALLBACK
           // on stdout. Detect it, then strip the marker line before parsing 
the backup size.
           String rawStdout = result.second();
           boolean incrementalFallback = 
rawStdout.contains(INCREMENTAL_FALLBACK_MARKER);
           String stdout = stripMarkerLines(rawStdout).trim();
           long backupSize = parseBackupSize(stdout, diskPaths);
   
           BackupAnswer answer = new BackupAnswer(command, true, stdout);
           answer.setSize(backupSize);
           // A successful run always created command.getBitmapNew() (full and 
incremental both do;
           // it is null for legacy-full, which the orchestrator treats as "no 
bitmap").
           answer.setBitmapCreated(command.getBitmapNew());
           answer.setIncrementalFallback(incrementalFallback);
           return answer;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to