[I] KVM Incremental Snapshots/Backups [cloudstack]

via GitHub Fri, 12 Apr 2024 10:35:10 -0700


JoaoJandre opened a new issue, #8907:
URL: https://github.com/apache/cloudstack/issues/8907

##### ISSUE TYPE
* Feature Idea

##### COMPONENT NAME
~~~
Volume Snapshot
~~~

##### CLOUDSTACK VERSION

~~~
4.20/main
~~~

##### CONFIGURATION

##### OS / ENVIRONMENT

KVM, file storage (NFS, Shared mountpoint, local storage)

##### SUMMARY

This spec addresses a new feature to allow users to create differential
volume snapshots/backups on KVM

# 1. Problem description

Currently, when taking a volume snapshot/backup with KVM as the hypervisor,
ACS creates a temporary delta and makes it the VM's source file, with the
original volume as a backing store. After that, the original volume is copied
to another directory (with qemu-img convert), if the
`snapshot.backup.to.secondary` configuration is set to `true`, the snapshot
will be copied to the secondary storage, transforming it in a backup; the delta
is then merged into the original volume. Using this approach, every volume
snapshot is a full snapshot/backup. However, in many situations, always taking
full snapshots of volumes is costly for both the storage network and storage
systems. ACS already executes differential snapshots for XenServer volumes.
Therefore, the goal of this proposal is to extend the current workflow for the
integration with KVM, leveraging it to present a similar feature set as we have
with XenServer.

For the sake of clarity, in this document, we will use the following
definitions of snapshots and backups:

* Snapshot: A snapshot is a full or incremental copy of a VM's volume in the
primary storage;
* Backup: A backup is a volume snapshot that is stored on secondary storage.

# 2. Proposed changes

To address the described problems, we propose to extend the volume snapshot
feature on KVM that was normalized by #5297, allowing users to create
differential volume snapshots on KVM. To give operators fine control over which
type of snapshot is being taken, we propose to add a new global configuration
`kvm.incremental.snapshot`, which can be overridden on the zone and cluster
configuration levels; this configuration will be `false` by default.

Using XenServer as the hypervisor, the `snapshot.delta.max` configuration is
used to determine the number of volume deltas that will be kept simultaneously
in the primary storage. We propose to use the same configuration for the
incremental snapshot feature on KVM, and use it to limit the size of the
snapshot backing chain on the primary/secondary storage. We will also update
the configuration description to specify that this configuration is only used
with XenServer and KVM. The implications of the `snapshot.delta.max`
configuration will be explained in the <a href="#snap-creation"
class="internal-link">snapshot/backup creation section</a>.

Also, it's important to notice that, while the `snapshot.delta.max`
configuration will define the maximum number of deltas for a backing chain on
the primary/secondary storage; the maximum number of snapshots that will be
available to the user is defined by the account's snapshot limit. The <a
href="#snap-interactions" class="internal-link">interactions between recurring
snapshots, configurations and account limits section</a> addresses the
relationship between account limits and configurations.

### 2.0.1. The DomainBackupBegin API

To allow incremental snapshots on KVM, we propose to use Libvirt's
`domainBackupBegin` API. This API allows the creation of either full snapshots
or incremental snapshots; it also allows the creation of checkpoints, which
Libvirt uses to create incremental snapshots. A checkpoint represents a point
in time after which blocks changed by the hypervisor are tracked. The
checkpoints are Libvirt's abstraction of bitmaps, that is, a checkpoint always
corresponds to a bitmap on the VM's volume.

The `domainBackupBegin` API has two main parameters that interest us:

* `backupXML`: this parameter contains details about the snapshots,
including which snapshot mode to use, whether the snapshot is incremental from
a previous checkpoint, which disks participate in the snapshot and the snapshot
destination.
* `checkpointXML`: when this parameter is informed, Libvirt creates a
checkpoint atomically covering the same point in time as the `backup`.

When using Libvirt's `domainBackupBegin` API, if the `backupXML` has the tag
`<incremental>` informing the name of a valid checkpoint, an incremental
snapshot is created based on that checkpoint. Furthermore, the API requires
that the volume is attached to a VM that is running or paused, as it uses the
VM's process (QEMU process in the hypervisor operating system) to execute the
volume snapshot.

Libvirt's checkpoints are always linked to a VM, this means that if we
undefine or migrate it, they will be lost. However, the bitmap on the volume
does not depend on the VM; thus, if we save the checkpoint metadata, by using
the `checkpointDumpXml` API, we can later use this XML to recreate the
checkpoint on the VM after it is migrated or stopped/started on
ACS[^stop-start-vm]. Therefore, even if the VM is migrated or recreated, we can
continue to take incremental snapshots.

More information on the `domainBackupBegin` API can be found in the
[official
documentation](https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainBackupBegin);
also, more information on Libvirt's checkpoints can also be found in the
[official
documentation](https://www.libvirt.org/html/libvirt-libvirt-domain-checkpoint.html#virDomainCheckpointCreateXML).

### 2.0.2. Limitations

This feature will use the Libvirt `domainBackupBegin` API that was
introduced in version 7.2.0, and extended to allow incremental snapshot in
version 7.6.0; furthermore, the incremental snapshot API needs qemu 6.1. Thus,
this feature will only be available in environments with Libvirt 7.6.0+ and
qemu 6.1+. If the `kvm.incremental.snapshot` configuration is `true`, but the
hosts do not have the necessary Libvirt and qemu versions, an error will be
raised when creating a snapshot.

As the snapshots do not contain the bitmaps that were used to create them,
after reverting a volume using a snapshot, the volume will have no bitmaps;
thus, we will need to start a new snapshot chain.

Furthermore, this feature will only be available when using file-based
storage, such as shared mount point (iSCSI and FC), NFS and local storage.
Other storage types for KVM, such as CLVM and RBD, need different approaches to
allow incremental backups; therefore, those will not be contemplated in the
proposed spec.

# 2.1. Snapshot/Backup creation
<section id="snap-creation"></section>

The current snapshot creation process is summarized in the diagram below.
Its main flaw is that we always copy the snapshot to a directory on the primary
storage, even if we will later copy it to the secondary storage, doubling the
strain on the storage systems.

1. If the volume is attached to a running VM, take a disk-only snapshot by
calling Libvirt's `DomainSnapshotCreateXML` API;
2. Use qemu-img convert to copy the volume source file to a directory in the
primary storage;
3. If the volume is attached to a running VM, merge snapshot with the old
volume source file;
4. If snapshot backup is enabled, we create a backup by copying the snapshot
to the secondary storage and then delete it from the primary storage.

The proposed incremental snapshot creation workflow is summarized in the
following diagram. We propose to optimize the current workflow, as well as add
a new one that allows the creation of incremental snapshots.

* If `kvm.incremental.snapshot` is false, we keep the old API usage and main
logic, but we propose to copy the snapshot directly to the final destination,
instead of always copying to the primary storage first.

* If `kvm.incremental.snapshot` is true and the host does not have the
minimum Libvirt and qemu versions, an exception will be thrown.

* If `kvm.incremental.snapshot` is true and the host has the minimum Libvirt
and qemu versions, the following workflow will be executed:

1. If the volume is not attached to a running VM on ACS, we have to
create a paused transient dummy VM to be able to call the `domainBackupBegin`
API. Also, to use the previous checkpoints, we must recreate them on the dummy
VM. After the snapshot process is done we will destroy the dummy VM.
2. If we are starting a snapshot chain, create a full snapshot by
calling the `domainBackupBegin` API and not referencing any old checkpoints;
else, we call the same API, but referencing the last checkpoint. Either way, we
make the API create the snapshot directly on the correct storage based on the
`snapshot.backup.to.secondary` configuration;
3. If it isn't the first snapshot in the chain, we rebase it, so that it
points to the absolute path of the previous incremental snapshot
[^absolute-path]. The path of the previous snapshot will depend on the
`snapshot.backup.to.secondary` configuration, if it is `false`, the primary
storage path will always be used, as the snapshots will be on the primary
storage, if the configuration is `true`, the same logic applies, using the
secondary storage path.
4. Dump the created checkpoint, and save it to the correct storage
depending on the `snapshot.backup.to.secondary` configuration.
5. Edit the checkpoint dump, remove everything but the checkpoint name,
creation time and the disk that was snapshotted; also, change the parent
checkpoint to the correct one.
6. Redefine the checkpoint on the VM.
7. If a dummy VM was created, we destroy it.

The process of editing the checkpoint dump and then redefining it on the VM
is needed because, even though these metadata are not important to the backup,
Libvirt will validate it if informed when recreating the checkpoints on other
VMs. Also, we must manually edit the checkpoint parent because Libvirt always
assumes that a new checkpoint is a child of the latest one, even if the
checkpoints are not connected.

During the snapshot process, the volume will be in the `Snapshotting` state,
while the volume is in this state, no other operations can be done with it,
such as volume attach/detach. Also, if the volume is attached to a VM, the
snapshot job is queued alongside the other VM jobs, therefore, we do not have
to worry about the VM being stopped/started during the volume snapshot, as each
job is processed sequentially for each given VM.

In runtime, if the `kvm.incremental.snapshot` configuration is changed from
`false` to `true`, when taking a snapshot of a volume, a new snapshot chain
will begin, that is, the next snapshot will be a full snapshot and the later
ones will be incremental. If the configuration is changed from `true` to
`false`, the current snapshot chains of the volumes will not be continued, and
the future snapshots will be full snapshots.

As different clusters might have different Libvirt versions, the
`kvm.incremental.snapshot` configuration can be overwritten on the cluster
level. If the `kvm.incremental.snapshot` configuration is true for a cluster
that does not have the needed Libvirt version, an error will be raised
informing that the configuration should be set to false on this cluster, as it
does not support this feature. In any case, the snapshot reversion will be the
same for any cluster, as it will still use the same APIs that are used today.

We propose to save the checkpoint as a file in the primary/secondary storage
instead of directly on the database because the `domainBackupBegin` API needs a
file as input, if we kept the checkpoint XML in the database we would need to
create a temporary file anyway. Furthermore, as the checkpoints are only useful
if their corresponding snapshot exists; if we lose the storage where the
snapshot is, the checkpoint becomes useless, so keeping them together in the
primary/secondary storage seems to be the best approach in the current scenario.

To persist the checkpoint location in the database, a new column will be
added to the `snapshot_store_ref` table: `kvm_checkpoint_path`. This column
will have the `varchar(255)` type (same as the `install_path` column) and will
be `null` by default. When an incremental snapshot is taken in KVM, its
corresponding checkpoint path will be saved in this column. Also, another
column will be added to the same table: `end_of_chain`; this column will have
the `int(1) unsigned` type and will be 1 by default. This column will be used
mainly when a process causes a chain to be severed and the next snapshot must
be a full one, such as when restoring a snapshot.

When working with incremental snapshots, the first snapshot in the snapshot
chain will always be a full snapshot: this is needed as we must have something
to start "incrementing" from. The maximum size of the snapshot chain on the
primary/secondary storage will be limited by the `snapshot.delta.max`
configuration; after this limit is reached, a new snapshot chain will be
started, that is, the next snapshot will be a full snapshot. Also, to avoid
having too many checkpoints on a VM, we will delete the old checkpoints when
creating a new snapshot chain.

The reason why we decided to propose a limited snapshot chain instead of
creating an unlimited snapshot chain is that, while taking full snapshots is
costly, the risk of eventually losing the base snapshot, therefore losing all
snapshots, increases with the size of the chain. This approach has been tested
and validated by the industry; it is used in XenServer and VMware with Veeam as
the Backup provider, for example. Nonetheless, taking a full snapshot from time
to time is still far cheaper than always taking full snapshots.

Let us take the following example: The user has a volume with 500GB
allocated, that grows 20GB per day; they have a recurring snapshot policy set
to create snapshots every day and keep the last 7 days stored; using full
snapshots, by the end of the week, they'll be using 3.92 TB of storage; using
incremental snapshots with `snapshot.delta.max` greater than 6, only 620 GB of
storage would be used.

### 2.1.1. Interactions between recurring snapshots/backups, configurations
and account limits
<section id="snap-interactions"></section>

This section will give some examples to illustrate the interactions between
the user's snapshot/backup limit, the `snapshot.delta.max` configuration and
the `maxSnaps` parameter of the `createSnapshotPolicy` API. The examples are
described in the table below. When "removing" incremental backups, they might
stay for a while in the primary/secondary storage, more about incremental
backup deletion can be seen in the <a href="#snap-deletion"
class="internal-link">snapshot deletion section</a>.

| Use-case | `snapshot.delta.max` | `maxSnaps` | Account/Domain Limit |
Behavior |
| ------ | ------ | ------ | ------ | ------ |
| 1 | 7 | 7 | 7 | When taking the 8th backup, as the `maxSnaps` limit was
reached, the 1st one will be logically removed; as `snapshot.delta.max` was
reached, the new backup will be the start of a new backing chain |
| 2 | 7 | 5 | 7 | When taking the 6th backup, as the `maxSnaps` limit was
reached, the 1st one will be logically removed; however, the backup will still
be an incremental backup, a new chain will only be started on the 8th backup,
when the `snapshot.delta.max` is reached |
| 3 | 5 | 7 | 7 | When taking the 6th backup, as the `snapshot.delta.max`
was reached, the new backup will be the start of a new backing chain; however,
the 1st backup will only be removed after `maxSnaps` is reached |
| 4 | 7 | 7 | 5 | When taking the 6th backup, an error will be thrown, as
the user has reached the account/domain snapshot limit |

# 2.2. Snapshot/Backup Reversion

The proposed new snapshot restore process is summarized in the diagram below.

There are two possibilities when restoring snapshots:

1. The current full snapshot process will be the same, we copy it from the
correct storage and replace the VM's source file.
2. Restoring an incremental snapshot is simple, as long as we keep the
backing file chain consistent, as described in the <a href="#snap-creation"
class="internal-link">snapshot/snapshot creation section</a>. We can use a
single `qemu-img convert` command to create a consolidated copy of the
snapshot, that is, the command will consolidate all the snapshot backing files
and copy the result to where we need it.

After restoring a snapshot, we cannot continue an incremental snapshot
chain, therefore, the `end_of_chain` has to be marked as true on the latest
snapshot created, this way we will know when creating the next snapshot that it
will start a new chain.

# 2.3. Template/Volume creation from snapshot/backup

The current process of creating a template/volume from a snapshot can be
adapted to allow creating from an incremental snapshot. The only difference is
that we need to use `qemu-img convert` on the snapshot before sending the
command to the SSVM, similar to what is currently done when the snapshot is not
on the secondary storage. After the template/volume is created, we can remove
the converted image from the secondary storage. The diagram below (taken from
#5124) summarizes the current template/volume creation from snapshot/backup
process.

# 2.4. Snapshot/Backup deletion
<section id="snap-deletion"></section>

The diagram below summarizes the snapshot deletion process:

When deleting incremental snapshots, we have to check for two things: the
snapshot has any ancestors; the snapshot has any descendants.

After marking the snapshot as removed on the database, if it has any active
descendants, we will keep it in the primary/secondary storage until those
descendants are removed, only then we will delete the snapshot from the
storage; if the snapshot does not have any descendants, we delete it
immediately. We do this to preserve the descendants' ability to be restored,
otherwise, the backing chain would be broken and all the descendants would be
useless.

After checking for descendants, we check if the snapshot has any ancestors,
if it does, we delete any ancestors that were removed in the database but were
kept in storage.

The checkpoint deletion is directly linked to its corresponding incremental
snapshot, we must keep the checkpoint until the snapshot is deleted, otherwise,
we will not be able to continue taking incremental snapshots after VM migration
or volume detach, for example.

# 2.5. Volume migration

### 2.5.1 Live volume migration

When a volume is created via linked-clone, it has a source and a backing
file; currently, when live migrating from NFS to NFS, the same structure will
be maintained on the new storage; otherwise, ACS consolidates the source and
backing files while migrating. Also, before migrating the volume, ACS will
check if the template is on the destination storage, if it is not, it will be
copied there, even though for consolidated volumes it is unnecessary.
Furthermore, current live volume migration always migrates the VM's root
volume, even if the user only requested a data disk migration, putting
unnecessary strain on storage and network systems.

Moreover, when live migrating a volume of a VM that has any volume on NFS
storage, if the volumes on NFS are not migrated, the migration will fail. This
happens as the current migration command used does not inform the
`VIR_MIGRATE_PARAM_MIGRATE_DISKS` parameter, which specifies which volumes
should be migrated alongside the VM migration; without this parameter, Libvirt
will assume that all the volumes should be migrated, raising an error when it
tries to overwrite the NFS volumes over themselves.

Furthermore, when migrating to an NFS storage, ACS will validate if the
destination host has access to the source storage. This causes an issue when
migrating from local storage to NFS storage, as the destination host will never
have direct access to the source host's local storage. As the current volume
migration process has several inconsistencies, it will be normalized alongside
this feature. The current volume migration workflow is summarized in the
diagram below.

We propose to normalize the migration behavior when migrating from
file-based storage to file-based storage. We will always consolidate the volume
with its backing file; thus, copying the template to the new storage will be
unnecessary. This way the live volume migration will always have the same
behavior. Furthermore, we will only migrate the VM's root volume when the user
asks for it. Also, we will remove the special case of checking if the
destination host has access to the source storage when the destination storage
is NFS.

Moreover, we will change the migration API used from `virDomainMigrate2` to
`virDomainMigrate3`, this API allows us to inform the
`VIR_MIGRATE_PARAM_MIGRATE_DISKS` parameter to tell Libvirt to only migrate the
volumes we want; therefore avoiding the aforementioned error with volumes on
NFS storage.

As ACS's live volume migration also needs a VM migration on KVM, and
Libvirt's migrate command does not guarantee that the volume bitmaps will be
copied, after live migrating volume, we will have to start a new snapshot
chain. The new migration workflow is summarized in the diagram below.

We will not allow volume migration when the volume has snapshots on the
primary storage, as there are a few cases where this could bring
inconsistencies. For example, if we live-migrate the VM and migrate the volume
from local storage to a zone/cluster scope storage, the VM's destination host
will not have access to the old snapshots, making them useless. This limitation
is already present on the current implementation, where all the snapshots of a
volume that is being migrated are listed from the database and if any of them
are not located on the secondary storage, an exception will be raised.

### 2.5.2 Cold volume migration

When performing cold migration on a volume using KVM as a hypervisor, ACS
will first use `qemu-img convert` to copy the volume to secondary storage.
Then, the volume will be copied to the destination primary storage. The diagram
below summarizes the cold migration workflow.

The only change that we need is to add the `--bitmaps` parameter to the
`qemu-img convert` command used, so that the volume keeps its existing bitmaps,
otherwise, we would need to create a new backup chain for the next backup.

# 2.6. Checkpoint management

There are a few processes that will be tweaked to make checkpoints
consistent:

1. VM start: If the VM is being started with a attached volume that has
incremental snapshots, the volume's checkpoints must be recreated after the VM
is started, otherwise, we won't be able to continue the incremental snapshot
chain.
2. VM migration: After the VM is migrated, the volume's checkpoints must
also be recreated, for the same reason as 1.
3. Volume attach: After attaching a volume to a VM, the volume's checkpoints
must be recreated in the VM, for the same reason as 1.
4. Volume detach: After detaching a volume from a VM, to avoid leaving
useless metadata on the VM, the volume's checkpoints must be removed from the
VM.

[^absolute-path]: We can do this as NFS storage is always mounted using the
same path across all hosts. The path is always `/mnt/<uuid>` where `<uuid>` is
derived from the NFS host and path. Also, for SharedMountPoint storages, the
path must be the same as well.
[^stop-start-vm]: When a VM is stopped on ACS with KVM as the hypervisor,
the VM actually gets undefined on Libvirt, later, when the VM is started, it
gets recreated.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] KVM Incremental Snapshots/Backups [cloudstack]

Reply via email to