JoaoJandre opened a new issue, #8907:
URL: https://github.com/apache/cloudstack/issues/8907

   ##### ISSUE TYPE
    * Feature Idea
   
   
   ##### COMPONENT NAME
   ~~~
   Volume Snapshot
   ~~~
   
   ##### CLOUDSTACK VERSION
   
   ~~~
   4.20/main
   ~~~
   
   ##### CONFIGURATION
   
   ##### OS / ENVIRONMENT
   
   KVM, file storage (NFS, Shared mountpoint, local storage)
   
   ##### SUMMARY
   
   This spec addresses a new feature to allow users to create differential 
volume snapshots/backups on KVM
   
   # 1. Problem description
   
   Currently, when taking a volume snapshot/backup with KVM as the hypervisor, 
ACS creates a temporary delta and makes it the VM's source file, with the 
original volume as a backing store. After that, the original volume is copied 
to another directory (with qemu-img convert), if the 
`snapshot.backup.to.secondary` configuration is set to `true`, the snapshot 
will be copied to the secondary storage, transforming it in a backup; the delta 
is then merged into the original volume. Using this approach, every volume 
snapshot is a full snapshot/backup. However, in many situations, always taking 
full snapshots of volumes is costly for both the storage network and storage 
systems. ACS already executes differential snapshots for XenServer volumes. 
Therefore, the goal of this proposal is to extend the current workflow for the 
integration with KVM, leveraging it to present a similar feature set as we have 
with XenServer.
   
   For the sake of clarity, in this document, we will use the following 
definitions of snapshots and backups:
   
   * Snapshot: A snapshot is a full or incremental copy of a VM's volume in the 
primary storage;
   * Backup: A backup is a volume snapshot that is stored on secondary storage.
   
   # 2. Proposed changes
   
   To address the described problems, we propose to extend the volume snapshot 
feature on KVM that was normalized by #5297, allowing users to create 
differential volume snapshots on KVM. To give operators fine control over which 
type of snapshot is being taken, we propose to add a new global configuration 
`kvm.incremental.snapshot`, which can be overridden on the zone and cluster 
configuration levels; this configuration will be `false` by default.
   
   Using XenServer as the hypervisor, the `snapshot.delta.max` configuration is 
used to determine the number of volume deltas that will be kept simultaneously 
in the primary storage. We propose to use the same configuration for the 
incremental snapshot feature on KVM, and use it to limit the size of the 
snapshot backing chain on the primary/secondary storage. We will also update 
the configuration description to specify that this configuration is only used 
with XenServer and KVM. The implications of the `snapshot.delta.max` 
configuration will be explained in the <a href="#snap-creation" 
class="internal-link">snapshot/backup creation section</a>. 
   
   Also, it's important to notice that, while the `snapshot.delta.max` 
configuration will define the maximum number of deltas for a backing chain on 
the primary/secondary storage; the maximum number of snapshots that will be 
available to the user is defined by the account's snapshot limit. The <a 
href="#snap-interactions" class="internal-link">interactions between recurring 
snapshots, configurations and account limits section</a> addresses the 
relationship between account limits and configurations.   
   
   ### 2.0.1. The DomainBackupBegin API
   
   To allow incremental snapshots on KVM, we propose to use Libvirt's 
`domainBackupBegin` API. This API allows the creation of either full snapshots 
or incremental snapshots; it also allows the creation of checkpoints, which 
Libvirt uses to create incremental snapshots. A checkpoint represents a point 
in time after which blocks changed by the hypervisor are tracked. The 
checkpoints are Libvirt's abstraction of bitmaps, that is, a checkpoint always 
corresponds to a bitmap on the VM's volume. 
   
   The `domainBackupBegin` API has two main parameters that interest us: 
   
   * `backupXML`: this parameter contains details about the snapshots, 
including which snapshot mode to use, whether the snapshot is incremental from 
a previous checkpoint, which disks participate in the snapshot and the snapshot 
destination.
   * `checkpointXML`: when this parameter is informed, Libvirt creates a 
checkpoint atomically covering the same point in time as the `backup`.
   
   When using Libvirt's `domainBackupBegin` API, if the `backupXML` has the tag 
`<incremental>` informing the name of a valid checkpoint, an incremental 
snapshot is created based on that checkpoint. Furthermore, the API requires 
that the volume is attached to a VM that is running or paused, as it uses the 
VM's process (QEMU process in the hypervisor operating system) to execute the 
volume snapshot.
   
   Libvirt's checkpoints are always linked to a VM, this means that if we 
undefine or migrate it, they will be lost. However, the bitmap on the volume 
does not depend on the VM; thus, if we save the checkpoint metadata, by using 
the `checkpointDumpXml` API, we can later use this XML to recreate the 
checkpoint on the VM after it is migrated or stopped/started on 
ACS[^stop-start-vm]. Therefore, even if the VM is migrated or recreated, we can 
continue to take incremental snapshots.
   
   More information on the `domainBackupBegin` API can be found in the 
[official 
documentation](https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainBackupBegin);
 also, more information on Libvirt's checkpoints can also be found in the 
[official 
documentation](https://www.libvirt.org/html/libvirt-libvirt-domain-checkpoint.html#virDomainCheckpointCreateXML).
   
   ### 2.0.2. Limitations
   
   This feature will use the Libvirt `domainBackupBegin` API that was 
introduced in version 7.2.0, and extended to allow incremental snapshot in 
version 7.6.0; furthermore, the incremental snapshot API needs qemu 6.1. Thus, 
this feature will only be available in environments with Libvirt 7.6.0+ and 
qemu 6.1+. If the `kvm.incremental.snapshot` configuration is `true`, but the 
hosts do not have the necessary Libvirt and qemu versions, an error will be 
raised when creating a snapshot.
   
   As the snapshots do not contain the bitmaps that were used to create them, 
after reverting a volume using a snapshot, the volume will have no bitmaps; 
thus, we will need to start a new snapshot chain.
   
   Furthermore, this feature will only be available when using file-based 
storage, such as shared mount point (iSCSI and FC), NFS and local storage. 
Other storage types for KVM, such as CLVM and RBD, need different approaches to 
allow incremental backups; therefore, those will not be contemplated in the 
proposed spec.
   
   # 2.1. Snapshot/Backup creation
   <section id="snap-creation"></section>
   
   The current snapshot creation process is summarized in the diagram below. 
Its main flaw is that we always copy the snapshot to a directory on the primary 
storage, even if we will later copy it to the secondary storage, doubling the 
strain on the storage systems.
   
   <img 
src="https://res.cloudinary.com/sc-clouds/image/upload/v1706624986/specs/cloudstack/kvm-incremental-snapshots/Old_snapshot_creation_s1o0gr.png";
       alt="create-snapshot-old"
       style="width: 100%;
       height: auto;">
   
   1. If the volume is attached to a running VM, take a disk-only snapshot by 
calling Libvirt's `DomainSnapshotCreateXML` API;
   2. Use qemu-img convert to copy the volume source file to a directory in the 
primary storage;
   3. If the volume is attached to a running VM, merge snapshot with the old 
volume source file; 
   4. If snapshot backup is enabled, we create a backup by copying the snapshot 
to the secondary storage and then delete it from the primary storage.
   
   The proposed incremental snapshot creation workflow is summarized in the 
following diagram. We propose to optimize the current workflow, as well as add 
a new one that allows the creation of incremental snapshots.
   
   <img 
src="https://res.cloudinary.com/sc-clouds/image/upload/v1708696515/specs/cloudstack/kvm-incremental-snapshots/k7yfghgwuualip0unhzu.png";
       alt="create-snapshot"
       style="width: 100%;
       height: auto;">
   
   * If `kvm.incremental.snapshot` is false, we keep the old API usage and main 
logic, but we propose to copy the snapshot directly to the final destination, 
instead of always copying to the primary storage first.
   
   * If `kvm.incremental.snapshot` is true and the host does not have the 
minimum Libvirt and qemu versions, an exception will be thrown.
   
   * If `kvm.incremental.snapshot` is true and the host has the minimum Libvirt 
and qemu versions, the following workflow will be executed:
   
       1. If the volume is not attached to a running VM on ACS, we have to 
create a paused transient dummy VM to be able to call the `domainBackupBegin` 
API. Also, to use the previous checkpoints, we must recreate them on the dummy 
VM. After the snapshot process is done we will destroy the dummy VM.
       2. If we are starting a snapshot chain, create a full snapshot by 
calling the `domainBackupBegin` API and not referencing any old checkpoints; 
else, we call the same API, but referencing the last checkpoint. Either way, we 
make the API create the snapshot directly on the correct storage based on the 
`snapshot.backup.to.secondary` configuration;
       3. If it isn't the first snapshot in the chain, we rebase it, so that it 
points to the absolute path of the previous incremental snapshot 
[^absolute-path]. The path of the previous snapshot will depend on the 
`snapshot.backup.to.secondary` configuration, if it is `false`, the primary 
storage path will always be used, as the snapshots will be on the primary 
storage, if the configuration is `true`, the same logic applies, using the 
secondary storage path.
       4. Dump the created checkpoint, and save it to the correct storage 
depending on the `snapshot.backup.to.secondary` configuration.
       5. Edit the checkpoint dump, remove everything but the checkpoint name, 
creation time and the disk that was snapshotted; also, change the parent 
checkpoint to the correct one.
       6. Redefine the checkpoint on the VM.
       7. If a dummy VM was created, we destroy it.
   
   The process of editing the checkpoint dump and then redefining it on the VM 
is needed because, even though these metadata are not important to the backup, 
Libvirt will validate it if informed when recreating the checkpoints on other 
VMs. Also, we must manually edit the checkpoint parent because Libvirt always 
assumes that a new checkpoint is a child of the latest one, even if the 
checkpoints are not connected.
   
   During the snapshot process, the volume will be in the `Snapshotting` state, 
while the volume is in this state, no other operations can be done with it, 
such as volume attach/detach. Also, if the volume is attached to a VM, the 
snapshot job is queued alongside the other VM jobs, therefore, we do not have 
to worry about the VM being stopped/started during the volume snapshot, as each 
job is processed sequentially for each given VM.
   
   In runtime, if the `kvm.incremental.snapshot` configuration is changed from 
`false` to `true`, when taking a snapshot of a volume, a new snapshot chain 
will begin, that is, the next snapshot will be a full snapshot and the later 
ones will be incremental. If the configuration is changed from `true` to 
`false`, the current snapshot chains of the volumes will not be continued, and 
the future snapshots will be full snapshots.
   
   As different clusters might have different Libvirt versions, the 
`kvm.incremental.snapshot` configuration can be overwritten on the cluster 
level. If the `kvm.incremental.snapshot` configuration is true for a cluster 
that does not have the needed Libvirt version, an error will be raised 
informing that the configuration should be set to false on this cluster, as it 
does not support this feature. In any case, the snapshot reversion will be the 
same for any cluster, as it will still use the same APIs that are used today.
   
   We propose to save the checkpoint as a file in the primary/secondary storage 
instead of directly on the database because the `domainBackupBegin` API needs a 
file as input, if we kept the checkpoint XML in the database we would need to 
create a temporary file anyway. Furthermore, as the checkpoints are only useful 
if their corresponding snapshot exists; if we lose the storage where the 
snapshot is, the checkpoint becomes useless, so keeping them together in the 
primary/secondary storage seems to be the best approach in the current scenario.
   
   To persist the checkpoint location in the database, a new column will be 
added to the `snapshot_store_ref` table: `kvm_checkpoint_path`. This column 
will have the `varchar(255)` type (same as the `install_path` column) and will 
be `null` by default. When an incremental snapshot is taken in KVM, its 
corresponding checkpoint path will be saved in this column. Also, another 
column will be added to the same table: `end_of_chain`; this column will have 
the `int(1) unsigned` type and will be 1 by default. This column will be used 
mainly when a process causes a chain to be severed and the next snapshot must 
be a full one, such as when restoring a snapshot.
   
   When working with incremental snapshots, the first snapshot in the snapshot 
chain will always be a full snapshot: this is needed as we must have something 
to start "incrementing" from. The maximum size of the snapshot chain on the 
primary/secondary storage will be limited by the `snapshot.delta.max` 
configuration; after this limit is reached, a new snapshot chain will be 
started, that is, the next snapshot will be a full snapshot. Also, to avoid 
having too many checkpoints on a VM, we will delete the old checkpoints when 
creating a new snapshot chain.
   
   The reason why we decided to propose a limited snapshot chain instead of 
creating an unlimited snapshot chain is that, while taking full snapshots is 
costly, the risk of eventually losing the base snapshot, therefore losing all 
snapshots, increases with the size of the chain. This approach has been tested 
and validated by the industry; it is used in XenServer and VMware with Veeam as 
the Backup provider, for example. Nonetheless, taking a full snapshot from time 
to time is still far cheaper than always taking full snapshots. 
   
   Let us take the following example: The user has a volume with 500GB 
allocated, that grows 20GB per day; they have a recurring snapshot policy set 
to create snapshots every day and keep the last 7 days stored; using full 
snapshots, by the end of the week, they'll be using 3.92 TB of storage; using 
incremental snapshots with `snapshot.delta.max` greater than 6, only 620 GB of 
storage would be used.
   
   ### 2.1.1. Interactions between recurring snapshots/backups, configurations 
and account limits
   <section id="snap-interactions"></section>
   
   This section will give some examples to illustrate the interactions between 
the user's snapshot/backup limit, the `snapshot.delta.max` configuration and 
the `maxSnaps` parameter of the `createSnapshotPolicy` API. The examples are 
described in the table below. When "removing" incremental backups, they might 
stay for a while in the primary/secondary storage, more about incremental 
backup deletion can be seen in the <a href="#snap-deletion" 
class="internal-link">snapshot deletion section</a>.
   
   <div style="page-break-after: always;"></div>
   
   | Use-case | `snapshot.delta.max` | `maxSnaps` | Account/Domain Limit | 
Behavior |
   | ------ | ------ | ------ | ------ | ------ |
   | 1 | 7 | 7 | 7 | When taking the 8th backup, as the `maxSnaps` limit was 
reached, the 1st one will be logically removed; as `snapshot.delta.max` was 
reached, the new backup will be the start of a new backing chain |
   | 2 | 7 | 5 | 7 | When taking the 6th backup, as the `maxSnaps` limit was 
reached, the 1st one will be logically removed; however, the backup will still 
be an incremental backup, a new chain will only be started on the 8th backup, 
when the `snapshot.delta.max` is reached |
   | 3 | 5 | 7 | 7 | When taking the 6th backup, as the `snapshot.delta.max` 
was reached, the new backup will be the start of a new backing chain; however, 
the 1st backup will only be removed after `maxSnaps` is reached |
   | 4 | 7 | 7 | 5 | When taking the 6th backup, an error will be thrown, as 
the user has reached the account/domain snapshot limit |
   
   
   # 2.2. Snapshot/Backup Reversion
   
   The proposed new snapshot restore process is summarized in the diagram below.
   
   <img 
src="https://res.cloudinary.com/sc-clouds/image/upload/v1704718061/specs/cloudstack/kvm-incremental-snapshots/Backup_reversion1_3_nwwuwt.png";
       alt="create-snapshot"
       style="width: 100%;
       height: auto;">
   
   
   There are two possibilities when restoring snapshots:
   
   1. The current full snapshot process will be the same, we copy it from the 
correct storage and replace the VM's source file. 
   2. Restoring an incremental snapshot is simple, as long as we keep the 
backing file chain consistent, as described in the <a href="#snap-creation" 
class="internal-link">snapshot/snapshot creation section</a>. We can use a 
single `qemu-img convert` command to create a consolidated copy of the 
snapshot, that is, the command will consolidate all the snapshot backing files 
and copy the result to where we need it.
   
   After restoring a snapshot, we cannot continue an incremental snapshot 
chain, therefore, the `end_of_chain` has to be marked as true on the latest 
snapshot created, this way we will know when creating the next snapshot that it 
will start a new chain.
   
   
   # 2.3. Template/Volume creation from snapshot/backup
   
   The current process of creating a template/volume from a snapshot can be 
adapted to allow creating from an incremental snapshot. The only difference is 
that we need to use `qemu-img convert` on the snapshot before sending the 
command to the SSVM, similar to what is currently done when the snapshot is not 
on the secondary storage. After the template/volume is created, we can remove 
the converted image from the secondary storage. The diagram below (taken from 
#5124) summarizes the current template/volume creation from snapshot/backup 
process. 
   
   <img 
src="https://res.cloudinary.com/sc-clouds/image/upload/v1702059856/specs/cloudstack/kvm-incremental-snapshots/122423092-0b685c00-cf64-11eb-8a73-f86032a09412_f9xkkd.jpg";
       alt="create-snapshot"
       style="width: 100%;
       height: auto;">
   
   
   # 2.4. Snapshot/Backup deletion
   <section id="snap-deletion"></section>
   
   The diagram below summarizes the snapshot deletion process:
   
   <img 
src="https://res.cloudinary.com/sc-clouds/image/upload/v1703772199/specs/cloudstack/kvm-incremental-snapshots/New_Incremental_Backup_Deletion_1_o3f57h.png";
       alt="snapshot-deletion"
       style="width: 100%;
       height: auto;">
   
   When deleting incremental snapshots, we have to check for two things: the 
snapshot has any ancestors; the snapshot has any descendants. 
   
   After marking the snapshot as removed on the database, if it has any active 
descendants, we will keep it in the primary/secondary storage until those 
descendants are removed, only then we will delete the snapshot from the 
storage; if the snapshot does not have any descendants, we delete it 
immediately. We do this to preserve the descendants' ability to be restored, 
otherwise, the backing chain would be broken and all the descendants would be 
useless.
   
   After checking for descendants, we check if the snapshot has any ancestors, 
if it does, we delete any ancestors that were removed in the database but were 
kept in storage.
   
   The checkpoint deletion is directly linked to its corresponding incremental 
snapshot, we must keep the checkpoint until the snapshot is deleted, otherwise, 
we will not be able to continue taking incremental snapshots after VM migration 
or volume detach, for example.
   
   # 2.5. Volume migration
   
   ### 2.5.1 Live volume migration
   
   When a volume is created via linked-clone, it has a source and a backing 
file; currently, when live migrating from NFS to NFS, the same structure will 
be maintained on the new storage; otherwise, ACS consolidates the source and 
backing files while migrating. Also, before migrating the volume, ACS will 
check if the template is on the destination storage, if it is not, it will be 
copied there, even though for consolidated volumes it is unnecessary. 
Furthermore, current live volume migration always migrates the VM's root 
volume, even if the user only requested a data disk migration, putting 
unnecessary strain on storage and network systems.   
   
   Moreover, when live migrating a volume of a VM that has any volume on NFS 
storage, if the volumes on NFS are not migrated, the migration will fail. This 
happens as the current migration command used does not inform the 
`VIR_MIGRATE_PARAM_MIGRATE_DISKS` parameter, which specifies which volumes 
should be migrated alongside the VM migration; without this parameter, Libvirt 
will assume that all the volumes should be migrated, raising an error when it 
tries to overwrite the NFS volumes over themselves.
   
   Furthermore, when migrating to an NFS storage, ACS will validate if the 
destination host has access to the source storage. This causes an issue when 
migrating from local storage to NFS storage, as the destination host will never 
have direct access to the source host's local storage. As the current volume 
migration process has several inconsistencies, it will be normalized alongside 
this feature. The current volume migration workflow is summarized in the 
diagram below.
   
   <img 
src="https://res.cloudinary.com/sc-clouds/image/upload/v1703772418/specs/cloudstack/kvm-incremental-snapshots/Old_Volume_Migration_4_i6fzz1.png";
       alt="migrate-volume-old"
       style="width: 100%;
       height: auto;">
   
   We propose to normalize the migration behavior when migrating from 
file-based storage to file-based storage. We will always consolidate the volume 
with its backing file; thus, copying the template to the new storage will be 
unnecessary. This way the live volume migration will always have the same 
behavior. Furthermore, we will only migrate the VM's root volume when the user 
asks for it. Also, we will remove the special case of checking if the 
destination host has access to the source storage when the destination storage 
is NFS.   
   
   Moreover, we will change the migration API used from `virDomainMigrate2` to 
`virDomainMigrate3`, this API allows us to inform the 
`VIR_MIGRATE_PARAM_MIGRATE_DISKS` parameter to tell Libvirt to only migrate the 
volumes we want; therefore avoiding the aforementioned error with volumes on 
NFS storage.
   
   As ACS's live volume migration also needs a VM migration on KVM, and 
Libvirt's migrate command does not guarantee that the volume bitmaps will be 
copied, after live migrating volume, we will have to start a new snapshot 
chain. The new migration workflow is summarized in the diagram below.  
   
   <img 
src="https://res.cloudinary.com/sc-clouds/image/upload/v1710334355/specs/cloudstack/kvm-incremental-snapshots/New_Live_Volume_Migration_plpjks.png";
       alt="migrate-volume-new"
       style="width: 100%;
       height: auto;">
   
   We will not allow volume migration when the volume has snapshots on the 
primary storage, as there are a few cases where this could bring 
inconsistencies. For example, if we live-migrate the VM and migrate the volume 
from local storage to a zone/cluster scope storage, the VM's destination host 
will not have access to the old snapshots, making them useless. This limitation 
is already present on the current implementation, where all the snapshots of a 
volume that is being migrated are listed from the database and if any of them 
are not located on the secondary storage, an exception will be raised.
   
   
   ### 2.5.2 Cold volume migration
   
   When performing cold migration on a volume using KVM as a hypervisor, ACS 
will first use `qemu-img convert` to copy the volume to secondary storage. 
Then, the volume will be copied to the destination primary storage. The diagram 
below summarizes the cold migration workflow.
   
   <img 
src="https://res.cloudinary.com/sc-clouds/image/upload/v1703779206/specs/cloudstack/kvm-incremental-snapshots/cold_volume_migration_zzhm7b.png";
       alt="cold-volume-migration-old"
       style="width: 100%;
       height: auto;">
   
   The only change that we need is to add the `--bitmaps` parameter to the 
`qemu-img convert` command used, so that the volume keeps its existing bitmaps, 
otherwise, we would need to create a new backup chain for the next backup.
   
   # 2.6. Checkpoint management
   
   There are a few processes that will be tweaked to make checkpoints 
consistent: 
   
   1. VM start: If the VM is being started with a attached volume that has 
incremental snapshots, the volume's checkpoints must be recreated after the VM 
is started, otherwise, we won't be able to continue the incremental snapshot 
chain.
   2. VM migration: After the VM is migrated, the volume's checkpoints must 
also be recreated, for the same reason as 1.
   3. Volume attach: After attaching a volume to a VM, the volume's checkpoints 
must be recreated in the VM, for the same reason as 1.
   4. Volume detach: After detaching a volume from a VM, to avoid leaving 
useless metadata on the VM, the volume's checkpoints must be removed from the 
VM.
   
   
   [^absolute-path]: We can do this as NFS storage is always mounted using the 
same path across all hosts. The path is always `/mnt/<uuid>` where `<uuid>` is 
derived from the NFS host and path. Also, for SharedMountPoint storages, the 
path must be the same as well.
   [^stop-start-vm]: When a VM is stopped on ACS with KVM as the hypervisor, 
the VM actually gets undefined on Libvirt, later, when the VM is started, it 
gets recreated.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to