On 10/9/18 8:29 AM, Nir Soffer wrote:
On Fri, Oct 5, 2018 at 7:58 AM Eric Blake <ebl...@redhat.com> wrote:

On 10/4/18 12:05 AM, Eric Blake wrote:
The following (long) email describes a portion of the work-flow of how
my proposed incremental backup APIs will work, along with the backend
QMP commands that each one executes.  I will reply to this thread with
further examples (the first example is long enough to be its own email).
This is an update to a thread last posted here:
https://www.redhat.com/archives/libvir-list/2018-June/msg01066.html


More to come in part 2.


- Second example: a sequence of incremental backups via pull model

In the first example, we did not create a checkpoint at the time of the
full pull. That means we have no way to track a delta of changes since
that point in time.


Why do we want to support backup without creating a checkpoint?

Fleecing. If you want to examine a portion of the disk at a given point in time, then kicking off a pull model backup gives you access to the state of the disk at that time, and your actions are transient. Ending the job when you are done with the fleece cleans up everything needed to perform the fleece operation, and since you did not intend to capture a full (well, a complete) incremental backup, but were rather grabbing just a subset of the disk, you really don't want that point in time to be recorded as a new checkpoint.

Also, incremental backups (which are what require checkpoints) are limited to qcow2 disks, but full backups can be performed on any format (including raw disks). If you have a guest that does not use qcow2 disks, you can perform a full backup, but cannot create a checkpoint.


If we don't have any real use case, I suggest to always require a
checkpoint.

But we do have real cases for backup without checkpoint.



Let's repeat the full backup (reusing the same
backup.xml from before), but this time, we'll add a new parameter, a
second XML file for describing the checkpoint we want to create.

Actually, it was easy enough to get virsh to write the XML for me
(because it was very similar to existing code in virsh that creates XML
for snapshot creation):

$ $virsh checkpoint-create-as --print-xml $dom check1 testing \
     --diskspec sdc --diskspec sdd | tee check1.xml
<domaincheckpoint>
    <name>check1</name>


We should use an id, not a name, even of name is name is also unique like
in most libvirt apis.

In RHV we will use always use a UUID for this.

Nothing prevents you from using a UUID as your name. But this particular choice of XML (<name>) matches what already exists in the snapshot XML.



    <description>testing</description>
    <disks>
      <disk name='sdc'/>
      <disk name='sdd'/>
    </disks>
</domaincheckpoint>

I had to supply two --diskspec arguments to virsh to select just the two
qcow2 disks that I am using in my example (rather than every disk in the
domain, which is the default when <disks> is not present).


So <disks /> is valid configuration, selecting all disks, or not having
"disks" element
selects all disks?

It's about a one-line change to get whichever behavior you find more useful. Right now, I'm leaning towards: <disks> omitted == backup all disks, <disks> present: you MUST have at least one <disk> subelement that explicitly requests a checkpoint (because any omitted <disk> when <disks> is present is skipped). A checkpoint only makes sense as long as there is at least one disk to create a checkpoint with.

But I could also go with: <disks> omitted == backup all disks, <disks> present but <disk> subelements missing: the missing elements default to being backed up, and you have to explicitly provide <disk name='foo' checkpoint='no'> to skip a particular disk.

Or even: <disks> omitted, or <disks> present but <disk> subelements missing: the missing elements defer to the hypervisor for their default state, and the qemu hypervisor defaults to qcow2 disks being backed up/checkpointed and to non-qcow2 disks being omitted. But this latter one feels like more magic, which is harder to document and liable to go wrong.

A stricter version would be <disks> is mandatory, and no <disk> subelement can be missing (or else the API fails because you weren't explicit in your choice). But that's rather strict, especially since existing snapshots XML handling is not that strict.



I also picked
a name (mandatory) and description (optional) to be associated with the
checkpoint.

The backup.xml file that we plan to reuse still mentions scratch1.img
and scratch2.img as files needed for staging the pull request. However,
any contents in those files could interfere with our second backup
(after all, every cluster written into that file from the first backup
represents a point in time that was frozen at the first backup; but our
second backup will want to read the data as the guest sees it now rather
than what it was at the first backup), so we MUST regenerate the scratch
files. (Perhaps I should have just deleted them at the end of example 1
in my previous email, had I remembered when typing that mail).

$ $qemu_img create -f qcow2 -b $orig1 -F qcow2 scratch1.img
$ $qemu_img create -f qcow2 -b $orig2 -F qcow2 scratch2.img

Now, to begin the full backup and create a checkpoint at the same time.
Also, this time around, it would be nice if the guest had a chance to
freeze I/O to the disks prior to the point chosen as the checkpoint.
Assuming the guest is trusted, and running the qemu guest agent (qga),
we can do that with:

$ $virsh fsfreeze $dom
$ $virsh backup-begin $dom backup.xml check1.xml
Backup id 1 started
backup used description from 'backup.xml'
checkpoint used description from 'check1.xml'
$ $virsh fsthaw $dom


Great, this answer my (unsent) question about freeze/thaw from part 1 :-)


and eventually, we may decide to add a VIR_DOMAIN_BACKUP_BEGIN_QUIESCE
flag to combine those three steps into a single API (matching what we've
done on some other existing API).  In other words, the sequence of QMP
operations performed during virDomainBackupBegin are quick enough that
they won't stall a freeze operation (at least Windows is picky if you
stall a freeze operation longer than 10 seconds).


We use fsFreeze/fsThaw directly in RHV since we need to support external
snapshots (e.g. ceph), so we don't need this functionality, but it sounds
good
idea to make it work like snapshot.

And indeed, since a future enhancement will be figuring out how we can create a checkpoint at the same time as a snapshot (as mentioned elsewhere in the email). A snapshot and checkpoint created at the same atomic point should obviously both be able to happen at a quiescent point in guest I/O.




The tweaked $virsh backup-begin now results in a call to:
   virDomainBackupBegin(dom, "<domainbackup ...>",
     "<domaincheckpoint ...", 0)
and in turn libvirt makes a similar sequence of QMP calls as before,
with a slight modification in the middle:
{"execute":"nbd-server-start",...
{"execute":"blockdev-add",...


This does not work yet for network disks like "rbd" and "glusterfs"
does it mean that they will not be supported for backup?

Full backups can happen regardless of underlying format. But incremental backups require checkpoints, and checkpoints require qcow2 persistent bitmaps. As long as you have a qcow2 format on rbd or glusterfs, you should be able to create checkpoints on that image, and therefore perform incremental backups. Storage-wise, during a pull model backup, you would have your qcow2 format on remote glusterfs storage which is where the persistent bitmap is written, and temporarily also have a scratch qcow2 file on the local machine for performing copy-on-write needed to preserve the point in time semantics for as long as the backup operation is running.



{"execute":"transaction",
   "arguments":{"actions":[
    {"type":"blockdev-backup", "data":{
     "device":"$node1", "target":"backup-sdc", "sync":"none",
     "job-id":"backup-sdc" }},
    {"type":"blockdev-backup", "data":{
     "device":"$node2", "target":"backup-sdd", "sync":"none",
     "job-id":"backup-sdd" }}
    {"type":"block-dirty-bitmap-add", "data":{
     "node":"$node1", "name":"check1", "persistent":true}},
    {"type":"block-dirty-bitmap-add", "data":{
     "node":"$node2", "name":"check1", "persistent":true}}
   ]}}
{"execute":"nbd-server-add",...



What if this sequence fail in the middle? will libvirt handle all failures
and rollback to the previous state?

What is the semantics of "execute": "transaction"? does it mean that qemu
will handle all possible failures in one of the actions?

qemu already promises that a "transaction" succeeds or fails as a group. As to other failures, the full recovery sequence is handled by libvirt, and looks like:

Fail on "nbd-server-start":
 - nothing to roll back
Fail on first "blockdev-add":
 - nbd-server-stop
Fail on subsequent "blockdev-add":
 - blockdev-remove on earlier scratch file additions
 - nbd-server-stop
Fail on any "block-dirty-bitmap-add" or "x-block-dirty-bitmap-merge":
 - block-dirty-bitmap-remove on any temporary bitmaps that were created
 - blockdev-remove on all scratch file additions
 - nbd-server-stop
Fail on "transaction":
 - block-dirty-bitmap-remove on all temporary bitmaps
 - blockdev-remove on all additions
 - nbd-server-stop
Fail on "nbd-server-add" or "x-nbd-server-add-bitmap":
 - if a checkpoint was attempted during "transaction":
-- perform x-block-dirty-bitmap-enable to re-enable bitmap that was in use prior to transaction -- perform x-block-dirty-bitmap-merge to merge new bitmap into re-enabled bitmap
   -- perform block-dirty-bitmap-remove on the new bitmap
 - block-job-cancel
 - block-dirty-bitmap-remove on all temporary bitmaps
 - blockdev-remove on all scratch file additions
 - nbd-server-stop



More to come in part 3.

I still need to finish writing that, but part 3 will be a demonstration of the push model (where qemu writes the backup to a given destination, without a scratch file, and without an NBD server, but where you are limited to what qemu knows how to write).

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Reply via email to