Here's my updated counterproposal for a backup API.

In comparison to v2 posted by Nikolay: - changed terminology a bit: Nikolay's "BlockSnapshot" is now called a "Checkpoint", and "BlockExportStart/Stop" is now "BackupBegin/End"
- flesh out more API descriptions
- better documentation of proposed XML, for both checkpoints and backup

Barring any major issues turned up during review, I've already starting to code this into libvirt with a goal of getting an implementation ready for review this month.

Each domain will gain the ability to track a tree of Checkpoint
objects (we've previously mentioned the term "system checkpoint" in
the <domainsnapshot> XML as the combination of disk and RAM state; so
I'll use the term "disk checkpoint" in prose as needed, to make it
obvious that the checkpoints described here do not include RAM state).
I will use the virDomainSnapshot API as a guide, meaning that we will
track a tree of checkpoints where each checkpoint can have 0 or 1
parent checkpoints, in part because I plan to reuse a lot of the
snapshot code as a starting point for implementing checkpoint

Qemu does NOT track a relationship between internal snapshots, so
libvirt has to manage the backing tree all by itself; by the same
argument, if qemu does not add a parent relationship to dirty bitmaps,
libvirt can probably manage everything itself by copying how it
manages parent relationships between internal snapshots.  However, I
think it will be far easier for libvirt to exploit qemu dirty bitmaps
if qemu DOES add bitmap tracking; particularly if qemu adds ways to
easily compose a temporary bitmap that is the union of one bitmap plus
a fixed number of its parents.

Design-wise, libvirt will manage things so that there is only one
enabled dirty-bitmap per qcow2 image at a time, when no backup
operation is in effect.  There is a notion of a current (or most
recent) checkpoint; when a new checkpoint is created, that becomes the
current one and the former checkpoint becomes the parent of the new
one.  If there is no current checkpoint, then there is no active dirty
bitmap managed by libvirt.

Representing things on a timeline, when a guest is first created,
there is no dirty bitmap; later, the checkpoint "check1" is created,
which in turn creates "bitmap1" in the qcow2 image for all changes
past that point; when a second checkmark "check2" is created, a qemu
transaction is used to create and enable the new "bitmap2" bitmap at
the same time as disabling "bitmap1" bitmap.  (Actually, it's probably
easier to name the bitmap in the qcow2 file with the same name as the
Checkpoint object being tracked in libvirt, but for discussion
purposes, it's less confusing if I use separate names for now.)

creation ....... check1 ....... check2 ....... active
        no bitmap       bitmap1        bitmap2

When a user wants to create a backup, they select which point in time
the backup starts from; the default value NULL represents a full
backup (all content since disk creation to the point in time of the
backup call, no bitmap is needed, use sync=full for push model or
sync=none for the pull model); any other value represents the name of
a checkpoint to use as an incremental backup (all content from the
checkpoint to the point in time of the backup call; libvirt forms a
temporary bitmap as needed, the uses sync=incremental for push model
or sync=none plus exporting the bitmap for the pull model).  For
example, requesting an incremental backup from "check2" can just reuse
"bitmap2", but requesting an incremental backup from "check1" requires
the computation of the bitmap containing the union of "bitmap1" and

Libvirt will always create a new bitmap when starting a backup
operation, whether or not the user requests that a checkpoint be
created.  Most users that want incremental backup sequences will
create a new checkpoint every time they do a backup; the new bitmap
that libvirt creates is then associated with that new checkpoint, and
even after the backup operation completes, the new bitmap remains in
the qcow2 file.  But it is also possible to request a backup without a
new checkpoint (it merely means that it is not possible to create a
subsequent incremental backup from the backup just started); in that
case, libvirt will have to take care of merging the new bitmap back
into the previous one at the end of the backup operation.

I think that it should be possible to run multiple backup operations
in parallel in the long run.  But in the interest of getting a proof
of concept implementation out quickly, it's easier to state that for
the initial implementation, libvirt supports at most one backup
operation at a time (to do another backup, you have to wait for the
current one to complete, or else abort and abandon the current
one). As there is only one backup job running at a time, the existing
virDomainGetJobInfo()/virDomainGetJobStats() will be able to report
statistics about the job (insofar as such statistics are available).
But in preparation for the future, when libvirt does add parallel job
support, starting a backup job will return a job id; and presumably
we'd add a new virDomainGetJobStatsByID() for grabbing statistics of
an arbitrary (rather than the most-recently-started) job.

Since live migration also acts as a job visible through
virDomainGetJobStats(), I'm going to treat an active backup job and
live migration as mutually exclusive.  This is particularly true when
we have a pull model backup ongoing: if qemu on the source is acting
as an NBD server, you can't migrate away from that qemu and tell the
NBD client to reconnect to the NBD server on the migration
destination.  So, to perform a migration, you have to cancel any
pending backup operations.  Conversely, if a migration job is
underway, it will not be possible to start a new backup job until
migration completes.  However, we DO need to modify migration to
ensure that any persistent bitmaps are migrated.

I also think that in the long run, it should be possible to start a
backup operation, and while it is still ongoing, create a new external
snapshot, and still be able to coordinate the transfer of bitmaps from
the old image to the new overlay.  But for the first implementation,
it's probably easiest to state that an ongoing backup prevents
creation of a new snapshot.  However, a current checkpoint (which
means we DO have an active bitmap, even if there is no active backup)
DOES need to be transfered to the new overlay, and conversely, a block
commit job needs to merge all bitmaps from the old overlay to the
backing file that is now becoming the active layer again.  I don't
know if qemu has primitives for this in place yet; and if it does not,
the only conservative thing we can do in the initial implementation is
to state that the use of checkpoints is exclusive from the use of
snapshots (using one prevents the use of the other).  Hopefully we
don't have to stay in that state for long.

For now, a user wanting guest I/O to be at a safe point can manually
use virDomainFSFreeze()/virDomainBackupBegin()/virDomainFSThaw(); we
may decide down the road to use the flags argument of
virDomainBackupBegin() to provide automatic guest quiescing through one
API (I'm not doing it right away, because we have to worry about
undoing effects if we fail to thaw after starting the backup).

So, to summarize, creating a backup will involve the following new APIs:

 * virDomainBackupBegin:
 * @domain: a domain object
 * @diskXml: description of storage to utilize and expose during
 *           the backup, or NULL
 * @checkpointXml: description of a checkpoint to create, or NULL
 * @flags: not used yet, pass 0
 * Start a point-in-time backup job for the specified disks of a
 * running domain.
 * A backup job is mutually exclusive with domain migration
 * (particularly when the job sets up an NBD export, since it is not
 * possible to tell any NBD clients about a server migrating between
 * hosts).  For now, backup jobs are also mutually exclusive with any
 * other block job on the same device, although this restriction may
 * be lifted in a future release. Progress of the backup job can be
 * tracked via virDomainGetJobStats(). The job remains active until a
 * subsequent call to virDomainBackupEnd(), even if it no longer has
 * anything to copy.
 * There are two fundamental backup approaches.  The first, called a
 * push model, instructs the hypervisor to copy the state of the guest
 * disk to the designated storage destination (which may be on the
 * local file system or a network device); in this mode, the
 * hypervisor writes the content of the guest disk to the destination,
 * then emits VIR_DOMAIN_EVENT_ID_BLOCK_JOB_2 when the backup is
 * either complete or failed (the backup image is invalid if the job
 * is ended prior to the event being emitted).  The second, called a
 * pull model, instructs the hypervisor to expose the state of the
 * guest disk over an NBD export; a third-party client can then
 * connect to this export, and read whichever portions of the disk it
 * desires.  In this mode, there is no event; libvirt has to be
 * informed when the third-party NBD client is done and the backup
 * resources can be released.
 * The @diskXml parameter is optional but usually provided, and
 * contains details about the backup, including which backup mode to
 * use, whether the backup is incremental from a previous checkpoint,
 * which disks participate in the backup, the destination for a push
 * model backup, and the temporary storage and NBD server details for
 * a pull model backup.  If omitted, the backup attempts to default to
 * a push mode full backup of all disks, where libvirt generates a
 * filename for each disk by appending a suffix of a timestamp in
 * seconds since the Epoch.  virDomainBackupGetXMLDesc() can be called
 * to actual values selected.  For more information, see
 * formatcheckpoint.html#BackupAttributes.
 * The @checkpointXml parameter is optional; if non-NULL, then libvirt
 * behaves as if virDomainCheckpointCreateXML() were called with
 * @checkpointXml, atomically covering the same guest state that will
 * be part of the backup.  The creation of a new checkpoint allows for
 * future incremental backups.
 * Returns a non-negative job id on success, or negative on failure.
 * This operation returns quickly, such that a user can choose to
 * start a backup job between virDomainFSFreeze() and
 * virDomainFSThaw() in order to create the backup while guest I/O is
 * quiesced.
int virDomainBackupBegin(virDomainPtr domain, const char *diskXml,
                         const char *checkpointXml, unsigned int flags);

Note that this layout says that all disks participating in the backup
job have share the same incremental checkpoint as their starting point
(no way to have one backup job where disk A copies data since check1
while disk B copies data since check2).  If we need the latter, then
we could get rid of the 'incremental' parameter, and instead have each
<disk> element within checkpointXml all out an optional <checkpoint>
name as its starting point.  Also, qemu supports exposing multiple
disks through a single NBD server (you then connect multiple clients
to the one server to grab state from each disk).  So the NBD details
are listed in parallel to the <disks>.  Note that since a backup is
NOT a guest-visible action, the backup job does not alter the normal
<domain> XML.

 * virDomainBackupGetXMLDesc:
 * @domain: a domain object
 * @id: the id of an active backup job previously started with
 *      virDomainBackupBegin()
 * @flags: not used yet, pass 0
 * In some cases, a user can start a backup job without supplying all
 * details, and rely on libvirt to fill in the rest (for example,
 * selecting the port used for an NBD export). This API can then be
 * used to learn what default values were chosen.
 * Returns a NUL-terminated UTF-8 encoded XML instance, or NULL in
 * case of error.  The caller must free() the returned value.
char *
virDomainBackupGetXMLDesc(virDomainPtr domain, int id,
                          unsigned int flags);

 * virDomainBackupEnd:
 * @domain: a domain object
 * @id: the id of an active backup job previously started with
 *      virDomainBackupBegin()
 * @flags: bitwise-OR of supported virDomainBackupEndFlags
 * Conclude a point-in-time backup job @id on the given domain.
 * If the backup job uses the push model, but the event marking that
 * all data has been copied has not yet been emitted, then the command
 * fails unless @flags includes VIR_DOMAIN_BACKUP_END_ABORT.  If the
 * event has been issued, or if the backup uses the pull model, the
 * flag has no effect.
 * Returns 0 on success and -1 on failure.
int virDomainBackupEnd(virDomainPtr domain, int id, unsigned int flags);

 * virDomainCheckpointCreateXML:
 * @domain: a domain object
 * @xmlDesc: description of the checkpoint to create
 * @flags: bitwise-OR of supported virDomainCheckpointCreateFlags
 * Create a new checkpoint using @xmlDesc on a running @domain.
 * Typically, it is more common to create a new checkpoint as part of
 * kicking off a backup job with virDomainBackupBegin(); however, it
 * is also possible to start a checkpoint without a backup.
 * See formatcheckpoint.html#CheckpointAttributes document for more
 * details on @xmlDesc.
 * If @flags includes VIR_DOMAIN_CHECKPOINT_CREATE_REDEFINE, then this
 * is a request to reinstate checkpoint metadata that was previously
 * discarded, rather than creating a new checkpoint.  When redefining
 * checkpoint metadata, the current checkpoint will not be altered
 * present.  It is an error to request the
 * the domain's disk images are modified according to @xmlDesc, but
 * then the just-created checkpoint has its metadata deleted.  This
 * flag is incompatible with VIR_DOMAIN_CHECKPOINT_CREATE_REDEFINE.
 * Returns an (opaque) new virDomainCheckpointPtr on success, or NULL
 * on failure.
virDomainCheckpointCreateXML(virDomainPtr domain, const char *xmlDesc,
                             unsigned int flags);

 * virDomainCheckpointDelete:
 * @checkpoint: the checkpoint to remove
 * @flags: not used yet, pass 0
 * @flags: bitwise-OR of supported virDomainCheckpointDeleteFlags
 * Removes a checkpoint from the domain.
 * When removing a checkpoint, the record of which portions of the
 * disk were dirtied after the checkpoint will be merged into the
 * record tracked by the parent checkpoint, if any.  Likewise, if the
 * checkpoint being deleted was the current checkpoint, the parent
 * checkpoint becomes the new current checkpoint.
 * any checkpoint metadata tracked by libvirt is removed while keeping
 * the checkpoint contents intact; if a hypervisor does not require
 * any libvirt metadata to track checkpoints, then this flag is
 * silently ignored.
 * Returns 0 on success, -1 on error.
virDomainCheckpointDelete(virDomainCheckpointPtr checkpoint,
                          unsigned int flags);

// Many additional functions copying heavily from virDomainSnapshot*:

virDomainCheckpointList(virDomainPtr domain,
                        virDomainCheckpointPtr **checkpoints,
                        unsigned int flags);

virDomainCheckpointGetXMLDesc(virDomainCheckpointPtr checkpoint,
                              unsigned int flags);

virDomainCheckpointLookupByName(virDomainPtr domain,
                                const char *name,
                                unsigned int flags);

const char *
virDomainCheckpointGetName(virDomainCheckpointPtr checkpoint);

virDomainCheckpointGetDomain(virDomainCheckpointPtr checkpoint);

virDomainCheckpointGetConnect(virDomainCheckpointPtr checkpoint);

virDomainHasCurrentCheckpoint(virDomainPtr domain, unsigned int flags);

virDomainCheckpointCurrent(virDomainPtr domain, unsigned int flags);

virDomainCheckpointGetParent(virDomainCheckpointPtr checkpoint,
                             unsigned int flags);

virDomainCheckpointIsCurrent(virDomainCheckpointPtr checkpoint,
                             unsigned int flags);

virDomainCheckpointRef(virDomainCheckpointPtr checkpoint);

virDomainCheckpointFree(virDomainCheckpointPtr checkpoint);

virDomainCheckpointListChildren(virDomainCheckpointPtr checkpoint,
                                virDomainCheckpointPtr **children,
                                unsigned int flags);

Notably, none of the older racy list functions, like
virDomainSnapshotNum, virDomainSnapshotNumChildren, or
virDomainSnapshotListChildrenNames; also, for now, there is no revert
support like virDomainSnapshotRevert.

Eventually, if we add a way to roll back to the state recorded in an
earlier bitmap, we'll want to tell libvirt that it needs to create a
new bitmap as a child of an existing (non-current) checkpoint.  That
is, if we have:

check1 .... check2 .... active
     bitmap1     bitmap2

and created a backup at the same time as check2, then when we later
roll back to the state of that backup, we would want to end writes to
bitmap2 and declare that check2 is no longer current, and create a new
current check3 with associated bitmap3 and parent check1 to track all
writes since the point of the revert.  Until then, I don't think it's
possible to have more than one child without manually using the
REDEFINE flag to create such scenarios; but the API should not lock us
out of supporting multiple children in the future.

Here's my proposal for user-facing XML documentation, based on

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html>
<html xmlns="";>
    <h1>Checkpoint and Backup XML format</h1>

    <ul id="toc"></ul>

    <h2><a id="CheckpointAttributes">Checkpoint XML</a></h2>

      Libvirt is able to facilitate incremental backups by tracking
      disk checkpoints, or points in time against which it is easy to
      compute which portion of the disk has changed.  Given a full
      backup (a backup created from the creation of the disk to a
      given point in time, coupled with the creation of a disk
      checkpoint at that time), and an incremental backup (a backup
      created from just the dirty portion of the disk between the
      first checkpoint and the second backup operation), it is
      possible to do an offline reconstruction of the state of the
      disk at the time of the second backup, without having to copy as
      much data as a second full backup would require.  Most disk
      checkpoints are created in concert with a backup,
      via <code>virDomainBackupBegin()</code>; however, libvirt also
      exposes enough support to create disk checkpoints independently
      from a backup operation,
      via <code>virDomainCheckpointCreateXML()</code>.
      Attributes of libvirt checkpoints are stored as child elements of
      the <code>domaincheckpoint</code> element.  At checkpoint creation
      time, normally only the <code>name</code>, <code>description</code>,
      and <code>disks</code> elements are settable; the rest of the
      fields are ignored on creation, and will be filled in by
      libvirt in for informational purposes
      by <code>virDomainCheckpointGetXMLDesc()</code>.  However, when
      redefining a checkpoint,
      with the <code>VIR_DOMAIN_CHECKPOINT_CREATE_REDEFINE</code> flag
      of <code>virDomainCheckpointCreateXML()</code>, all of the XML
      described here is relevant.
      Checkpoints are maintained in a hierarchy.  A domain can have a
      current checkpoint, which is the most recent checkpoint compared to
      the current state of the domain (although a domain might have
      checkpoints without a current checkpoint, if checkpoints have been
      deleted in the meantime).  Creating or reverting to a checkpoint
      sets that checkpoint as current, and the prior current checkpoint is
      the parent of the new checkpoint.  Branches in the hierarchy can
      be formed by reverting to a checkpoint with a child, then creating
      another checkpoint.
      The top-level <code>domaincheckpoint</code> element may contain
      the following elements:
      <dd>The name for this checkpoint.  If the name is specified when
        initially creating the checkpoint, then the checkpoint will have
        that particular name.  If the name is omitted when initially
        creating the checkpoint, then libvirt will make up a name for
        the checkpoint, based on the time when it was created.
      <dd>A human-readable description of the checkpoint.  If the
        description is omitted when initially creating the checkpoint,
        then this field will be empty.
      <dd>On input, this is an optional listing of specific
        instructions for disk checkpoints; it is needed when making a
        checkpoint on only a subset of the disks associated with a
        domain (in particular, since qemu checkpoints require qcow2
        disks, this element may be needed on input for excluding guest
        disks that are not in qcow2 format); if omitted on input, then
        all disks participate in the checkpoint.  On output, this is
        fully populated to show the state of each disk in the
        checkpoint.  This element has a list of <code>disk</code>
        sub-elements, describing anywhere from one to all of the disks
        associated with the domain.
          <dd>This sub-element describes the checkpoint properties of
            a specific disk.  The attribute <code>name</code> is
            mandatory, and must match either the <code>&lt;target
            dev='name'/&gt;</code> or an unambiguous <code>&lt;source
            file='name'/&gt;</code> of one of
            the <a href="formatdomain.html#elementsDisks">disk
            devices</a> specified for the domain at the time of the
            checkpoint.  The attribute <code>checkpoint</code> is
            optional on input; possible values are <code>no</code>
            when the disk does not participate in this checkpoint;
            or <code>bitmap</code> if the disk will track all changes
            since the creation of this checkpoint via a bitmap, in
            which case another attribute <code>bitmap</code> will be
            the name of the tracking bitmap (defaulting to the
            checkpoint name).
      <dd>The time this checkpoint was created.  The time is specified
        in seconds since the Epoch, UTC (i.e. Unix time).  Readonly.
      <dd>The parent of this checkpoint.  If present, this element
        contains exactly one child element, name.  This specifies the
        name of the parent checkpoint of this one, and is used to
        represent trees of checkpoints.  Readonly.
      <dd>The inactive <a href="formatdomain.html">domain
        configuration</a> at the time the checkpoint was created.

    <h2><a id="BackupAttributes">Backup XML</a></h2>

      Creating a backup, whether full or incremental, is done
      via <code>virDomainBackupBegin()</code>, which takes an XML
      description of the actions to perform.  There are two general
      modes for backups: a push mode (where the hypervisor writes out
      the data to the destination file, which may be local or remote),
      and a pull mode (where the hypervisor creates an NBD server that
      a third-party client can then read as needed, and which requires
      the use of temporary storage, typically local, until the backup
      is complete).
      The instructions for beginning a backup job are provided as
      attributes and elements of the
      top-level <code>domainbackup</code> element.  This element
      includes an optional attribute <code>mode</code> which can be
      either "push" or "pull" (default push).  Where elements are
      optional on creation, <code>virDomainBackupGetXMLDesc()</code>
      can be used to see the actual values selected (for example,
      learning which port the NBD server is using in the pull model,
      or what file names libvirt generated when none were supplied).
      The following child elements are supported:
      <dd>Optional. If this element is present, it must name an
        existing checkpoint of the domain, which will be used to make
        this backup an incremental one (in the push model, only
        changes since the checkpoint are written to the destination;
        in the pull model, the NBD server uses the
        NBD_OPT_SET_META_CONTEXT extension to advertise to the client
        which portions of the export contain changes since the
        checkpoint).  If omitted, a full backup is performed.
      <dd>Present only for a pull mode backup.  Contains the same
        attributes as the <code>protocol</code> element of a disk
        attached via NBD in the domain (such as transport, socket,
        name, port, or tls), necessary to set up an NBD server that
        exposes the content of each disk at the time the backup
      <dd>This is an optional listing of instructions for disks
        participating in the backup (if omitted, all disks
        participate, and libvirt attempts to generate filenames by
        appending the current timestamp as a suffix). When provided on
        input, disks omitted from the list do not participate in the
        backup.  On output, the list is present but contains only the
        disks participating in the backup job.  This element has a
        list of <code>disk</code> sub-elements, describing anywhere
        from one to all of the disks associated with the domain.
          <dd>This sub-element describes the checkpoint properties of
            a specific disk.  The attribute <code>name</code> is
            mandatory, and must match either the <code>&lt;target
            dev='name'/&gt;</code> or an unambiguous <code>&lt;source
            file='name'/&gt;</code> of one of
            the <a href="formatdomain.html#elementsDisks">disk
            devices</a> specified for the domain at the time of the
            checkpoint.  The optional attribute <code>type</code> can
            be <code>file</code>, <code>block</code>,
            or <code>networks</code>, similar to a disk declaration
            for a domain, controls what additional sub-elements are
            needed to describe the destination (such
            as <code>protocol</code> for a network destination).  In
            push mode backups, the primary subelement
            is <code>target</code>; in pull mode, the primary sublement
            is <code>scratch</code>; but either way,
            the primary sub-element describes the file name to be used
            during the backup operation, similar to
            the <code>source</code> sub-element of a domain disk. An
            optional sublement <code>driver</code> can also be used to
            specify a destination format different from qcow2.

    <h2><a id="example">Examples</a></h2>

    <p>Using this XML to create a checkpoint of just vda on a qemu
      domain with two disks and a prior checkpoint:</p>
&lt;description&gt;Completion of updates after OS install&lt;/description&gt;
    &lt;disk name='vda' checkpoint='bitmap'/&gt;
    &lt;disk name='vdb' checkpoint='no'/&gt;

    <p>will result in XML similar to this from
&lt;description&gt;Completion of updates after OS install&lt;/description&gt;
    &lt;disk name='vda' checkpoint='bitmap' bitmap='1525889631'/&gt;
    &lt;disk name='vdb' checkpoint='no'/&gt;
      &lt;disk type='file' device='disk'&gt;
        &lt;driver name='qemu' type='qcow2'/&gt;
        &lt;source file='/path/to/file1'/&gt;
        &lt;target dev='vda' bus='virtio'/&gt;
      &lt;disk type='file' device='disk' snapshot='external'&gt;
        &lt;driver name='qemu' type='raw'/&gt;
        &lt;source file='/path/to/file2'/&gt;
        &lt;target dev='vdb' bus='virtio'/&gt;

    <p>With that checkpoint created, the qcow2 image is now tracking
      all changes that occur in the image since the checkpoint via
      the persistent bitmap named <code>1525889631</code>.  Now, we
      can make a subsequent call
      to <code>virDomainBackupBegin()</code> to perform an incremental
      backup of just this data, using the following XML to start a
      pull model NBD export of the vda disk:
&lt;domainbackup mode="pull"&gt;
  &lt;server transport="unix" socket="/path/to/server"/&gt;
    &lt;disk name='vda' type='file'/&gt;
      &lt;scratch file=/path/to/file1.scratch'/&gt;

Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization: |

libvir-list mailing list

Reply via email to