On Fri, Oct 5, 2018 at 7:58 AM Eric Blake <ebl...@redhat.com> wrote: > On 10/4/18 12:05 AM, Eric Blake wrote: > > The following (long) email describes a portion of the work-flow of how > > my proposed incremental backup APIs will work, along with the backend > > QMP commands that each one executes. I will reply to this thread with > > further examples (the first example is long enough to be its own email). > > This is an update to a thread last posted here: > > https://www.redhat.com/archives/libvir-list/2018-June/msg01066.html > > > > > More to come in part 2. > > > > - Second example: a sequence of incremental backups via pull model > > In the first example, we did not create a checkpoint at the time of the > full pull. That means we have no way to track a delta of changes since > that point in time.
Why do we want to support backup without creating a checkpoint? If we don't have any real use case, I suggest to always require a checkpoint. > Let's repeat the full backup (reusing the same > backup.xml from before), but this time, we'll add a new parameter, a > second XML file for describing the checkpoint we want to create. > > Actually, it was easy enough to get virsh to write the XML for me > (because it was very similar to existing code in virsh that creates XML > for snapshot creation): > > $ $virsh checkpoint-create-as --print-xml $dom check1 testing \ > --diskspec sdc --diskspec sdd | tee check1.xml > <domaincheckpoint> > <name>check1</name> > We should use an id, not a name, even of name is name is also unique like in most libvirt apis. In RHV we will use always use a UUID for this. > <description>testing</description> > <disks> > <disk name='sdc'/> > <disk name='sdd'/> > </disks> > </domaincheckpoint> > > I had to supply two --diskspec arguments to virsh to select just the two > qcow2 disks that I am using in my example (rather than every disk in the > domain, which is the default when <disks> is not present). So <disks /> is valid configuration, selecting all disks, or not having "disks" element selects all disks? > I also picked > a name (mandatory) and description (optional) to be associated with the > checkpoint. > > The backup.xml file that we plan to reuse still mentions scratch1.img > and scratch2.img as files needed for staging the pull request. However, > any contents in those files could interfere with our second backup > (after all, every cluster written into that file from the first backup > represents a point in time that was frozen at the first backup; but our > second backup will want to read the data as the guest sees it now rather > than what it was at the first backup), so we MUST regenerate the scratch > files. (Perhaps I should have just deleted them at the end of example 1 > in my previous email, had I remembered when typing that mail). > > $ $qemu_img create -f qcow2 -b $orig1 -F qcow2 scratch1.img > $ $qemu_img create -f qcow2 -b $orig2 -F qcow2 scratch2.img > > Now, to begin the full backup and create a checkpoint at the same time. > Also, this time around, it would be nice if the guest had a chance to > freeze I/O to the disks prior to the point chosen as the checkpoint. > Assuming the guest is trusted, and running the qemu guest agent (qga), > we can do that with: > > $ $virsh fsfreeze $dom > $ $virsh backup-begin $dom backup.xml check1.xml > Backup id 1 started > backup used description from 'backup.xml' > checkpoint used description from 'check1.xml' > $ $virsh fsthaw $dom > Great, this answer my (unsent) question about freeze/thaw from part 1 :-) > > and eventually, we may decide to add a VIR_DOMAIN_BACKUP_BEGIN_QUIESCE > flag to combine those three steps into a single API (matching what we've > done on some other existing API). In other words, the sequence of QMP > operations performed during virDomainBackupBegin are quick enough that > they won't stall a freeze operation (at least Windows is picky if you > stall a freeze operation longer than 10 seconds). > We use fsFreeze/fsThaw directly in RHV since we need to support external snapshots (e.g. ceph), so we don't need this functionality, but it sounds good idea to make it work like snapshot. > > The tweaked $virsh backup-begin now results in a call to: > virDomainBackupBegin(dom, "<domainbackup ...>", > "<domaincheckpoint ...", 0) > and in turn libvirt makes a similar sequence of QMP calls as before, > with a slight modification in the middle: > {"execute":"nbd-server-start",... > {"execute":"blockdev-add",... > This does not work yet for network disks like "rbd" and "glusterfs" does it mean that they will not be supported for backup? > {"execute":"transaction", > "arguments":{"actions":[ > {"type":"blockdev-backup", "data":{ > "device":"$node1", "target":"backup-sdc", "sync":"none", > "job-id":"backup-sdc" }}, > {"type":"blockdev-backup", "data":{ > "device":"$node2", "target":"backup-sdd", "sync":"none", > "job-id":"backup-sdd" }} > {"type":"block-dirty-bitmap-add", "data":{ > "node":"$node1", "name":"check1", "persistent":true}}, > {"type":"block-dirty-bitmap-add", "data":{ > "node":"$node2", "name":"check1", "persistent":true}} > ]}} > {"execute":"nbd-server-add",... > What if this sequence fail in the middle? will libvirt handle all failures and rollback to the previous state? What is the semantics of "execute": "transaction"? does it mean that qemu will handle all possible failures in one of the actions? (Will continue later) > > The only change was adding more actions to the "transaction" command - > in addition to kicking off the fleece image in the scratch nodes, it > ALSO added a persistent bitmap to each of the original images, to track > all changes made after the point of the transaction. The bitmaps are > persistent - at this point (well, it's better if you wait until after > backup-end), you could shut the guest down and restart it, and libvirt > will still remember that the checkpoint exists, and qemu will continue > track guest writes via the bitmap. However, the backup job itself is > currently live-only, and shutting down the guest while a backup > operation is in effect will lose track of the backup job. What that > really means is that if the guest shuts down, your current backup job is > hosed (you cannot ever get back the point-in-time data from your API > request - as your next API request will be a new point in time) - but > you have not permanently ruined the guest, and your recovery is to just > start a new backup. > > Pulling the data out from the backup is unchanged from example 1; virsh > backup-dumpxml will show details about the job (yes, the job id is still > 1 for now), and when ready, virsh backup-end will end the job and > gracefully take down the NBD server with no difference in QMP commands > from before. Thus, the creation of a checkpoint didn't change any of > the fundamentals of capturing the current backup, but rather is in > preparation for the next step. > > $ $virsh backup-end $dom 1 > Backup id 1 completed > $ rm scratch1.img scratch2.img > > [We have not yet designed how qemu bitmaps will interact with external > snapshots - but I see two likely scenarios: > 1. Down the road, I add a virDomainSnapshotCheckpointCreateXML() API, > which adds a checkpointXML parameter but otherwise behaves like the > existing virDomainSnapshotCreateXML - if that API is added in a > different release than my current API proposals, that's yet another > libvirt.so rebase to pickup the new API. > 2. My current proposal of virDomainBackupBegin(dom, "<domainbackup>", > "<domaincheckpoint>", flags) could instead be tweaked to a single XML > parameter, virDomainBackupBegin(dom, " > <domainbackup> > <domaincheckpoint> ... </domaincheckpoint> > </domainbackup>", flags) prior to adding my APIs to libvirt 4.9, then > down the road, we also tweak <domainsnapshot> to take an optional > <domaincheckpoint> sub-element, and thus reuse the existing > virDomainSnapshotCreateXML() to now also create checkpoints without a > further API addition. > Speak up now if you have a preference between the two ideas] > > Now that we have concluded the full backup and created a checkpoint, we > can do more things with the checkpoint (it is persistent, after all). > For example: > > $ $virsh checkpoint-list $dom > Name Creation Time > -------------------------------------------- > check1 2018-10-04 15:02:24 -0500 > > called virDomainListCheckpoints(dom, &array, 0) under the hood to get a > list of virDomainCheckpointPtr objects, then called > virDomainCheckpointGetXMLDesc(array[0], 0) to scrape the XML describing > that checkpoint in order to display information. Or another approach, > using virDomainCheckpointGetXMLDesc(virDomainCheckpointCurrent(dom, 0), 0): > > $ $virsh checkpoint-current $dom | head > <domaincheckpoint> > <name>check1</name> > <description>testing</description> > <creationTime>1538683344</creationTime> > <disks> > <disk name='vda' checkpoint='no'/> > <disk name='sdc' checkpoint='bitmap' bitmap='check1'/> > <disk name='sdd' checkpoint='bitmap' bitmap='check1'/> > </disks> > <domain type='kvm'> > > which shows the current checkpoint (that is, the checkpoint owning the > bitmap that is still receiving live updates), and which bitmap names in > the qcow2 files are in use. For convenience, it also recorded the full > <domain> description at the time the checkpoint was captured (I used > head to limit the size of this email), so that if you later hot-plug > things, you still have a record of what state the machine had at the > time the checkpoint was created. > > The XML output of a checkpoint description is normally static, but > sometimes it is useful to know an approximate size of the guest data > that has been dirtied since a checkpoint was created (a dynamic value > that grows as a guest dirties more clusters). For that, it makes sense > to have a flag to request the dynamic data; it's also useful to have a > flag that suppresses the (length) <domain> output: > > $ $virsh checkpoint-current $dom --size --no-domain > <domaincheckpoint> > <name>check1</name> > <description>testing</description> > <creationTime>1538683344</creationTime> > <disks> > <disk name='vda' checkpoint='no'/> > <disk name='sdc' checkpoint='bitmap' bitmap='check1' size='1048576'/> > <disk name='sdd' checkpoint='bitmap' bitmap='check1' size='65536'/> > </disks> > </domaincheckpoint> > > This maps to virDomainCheckpointGetXMLDesc(chk, > VIR_DOMAIN_CHECKPOINT_XML_NO_DOMAIN | VIR_DOMAIN_CHECKPOINT_XML_SIZE). > Under the hood, libvirt calls > {"execute":"query-block"} > and converts the bitmap size reported by qemu into an estimate of the > number of bytes that would be required if you were to start a backup > from that checkpoint right now. Note that the result is just an > estimate of the storage taken by guest-visible data; you'll probably > want to use 'qemu-img measure' to convert that into a size of how much a > matching qcow2 image would require when metadata is added in; also > remember that the number is constantly growing as the guest writes and > causes more of the image to become dirty. But having a feel for how > much has changed can be useful for determining if continuing a chain of > incremental backups still makes more sense, or if enough of the guest > data has changed that doing a full backup is smarter; it is also useful > for preallocating how much storage you will need for an incremental backup. > > Technically, libvirt mapping that a checkpoint size request to a single > {"execute":"query-block"} works only when querying the size of the > current bitmap. The command also works when querying the cumulative size > since an older checkpoint, but under the hood, libvirt must juggle > things to create a temporary bitmap, call a few > x-block-dirty-bitmap-merge, query the size of that temporary bitmap, > then clean things back up again (after all, size(A) + size(B) >= > size(A|B), depending on how many clusters were touched during both A and > B's tracking of dirty clusters). Again, a nice benefit of having > libvirt manage multiple qemu bitmaps under a single libvirt API. > > Of course, the real reason we created a checkpoint with our full backup > is that we want to take an incremental backup next, rather than > repeatedly taking full backups. For this, we need a one-line > modification to our backup XML to add an <incremental> element; we also > want to update our checkpoint XML to start yet another checkpoint when > we run our first incremental backup. > > $ cat > backup.xml <<EOF > <domainbackup mode='pull'> > <server transport='tcp' name='localhost' port='10809'/> > <incremental>check1</incremental> > <disks> > <disk name='$orig1' type='file'> > <scratch file='$PWD/scratch1.img'/> > </disk> > <disk name='sdd' type='file'> > <scratch file='$PWD/scratch2.img'/> > </disk> > </disks> > </domainbackup> > EOF > $ $virsh checkpoint-create-as --print-xml $dom check2 \ > --diskspec sdc --diskspec sdd | tee check2.xml > <domaincheckpoint> > <name>check2</name> > <disks> > <disk name='sdc'/> > <disk name='sdd'/> > </disks> > </domaincheckpoint> > $ $qemu_img create -f qcow2 -b $orig1 -F qcow2 scratch1.img > $ $qemu_img create -f qcow2 -b $orig2 -F qcow2 scratch2.img > > And again, it's time to kick off the backup job: > > $ $virsh backup-begin $dom backup.xml check2.xml > Backup id 1 started > backup used description from 'backup.xml' > checkpoint used description from 'check2.xml' > > This time, the incremental backup causes libvirt to do a bit more work > under the hood: > > {"execute":"nbd-server-start", > "arguments":{"addr":{"type":"inet", > "data":{"host":"localhost", "port":"10809"}}}} > {"execute":"blockdev-add", > "arguments":{"driver":"qcow2", "node-name":"backup-sdc", > "file":{"driver":"file", > "filename":"$PWD/scratch1.img"}, > "backing":"'$node1'"}} > {"execute":"blockdev-add", > "arguments":{"driver":"qcow2", "node-name":"backup-sdd", > "file":{"driver":"file", > "filename":"$PWD/scratch2.img"}, > "backing":"'$node2'"}} > {"execute":"block-dirty-bitmap-add", > "arguments":{"node":"$node1", "name":"backup-sdc"}} > {"execute":"x-block-dirty-bitmap-merge", > "arguments":{"node":"$node1", "src_name":"check1", > "dst_name":"backup-sdc"}}' > {"execute":"block-dirty-bitmap-add", > "arguments":{"node":"$node2", "name":"backup-sdd"}} > {"execute":"x-block-dirty-bitmap-merge", > "arguments":{"node":"$node2", "src_name":"check1", > "dst_name":"backup-sdd"}}' > {"execute":"transaction", > "arguments":{"actions":[ > {"type":"blockdev-backup", "data":{ > "device":"$node1", "target":"backup-sdc", "sync":"none", > "job-id":"backup-sdc" }}, > {"type":"blockdev-backup", "data":{ > "device":"$node2", "target":"backup-sdd", "sync":"none", > "job-id":"backup-sdd" }}, > {"type":"x-block-dirty-bitmap-disable", "data":{ > "node":"$node1", "name":"backup-sdc"}}, > {"type":"x-block-dirty-bitmap-disable", "data":{ > "node":"$node2", "name":"backup-sdd"}}, > {"type":"x-block-dirty-bitmap-disable", "data":{ > "node":"$node1", "name":"check1"}}, > {"type":"x-block-dirty-bitmap-disable", "data":{ > "node":"$node2", "name":"check1"}}, > {"type":"block-dirty-bitmap-add", "data":{ > "node":"$node1", "name":"check2", "persistent":true}}, > {"type":"block-dirty-bitmap-add", "data":{ > "node":"$node2", "name":"check2", "persistent":true}} > ]}} > {"execute":"nbd-server-add", > "arguments":{"device":"backup-sdc", "name":"sdc"}} > {"execute":"nbd-server-add", > "arguments":{"device":"backup-sdd", "name":"sdd"}} > {"execute":"x-nbd-server-add-bitmap", > "arguments":{"name":"sdc", "bitmap":"backup-sdc"}} > {"execute":"x-nbd-server-add-bitmap", > "arguments":{"name":"sdd", "bitmap":"backup-sdd"}} > > Two things stand out here, different from the earlier full backup. First > is that libvirt is now creating a temporary non-persistent bitmap, > merging all data fom check1 into the temporary, then freezing writes > into the temporary bitmap during the transaction, and telling NBD to > expose the bitmap to clients. The second is that since we want this > backup to start a new checkpoint, we disable the old bitmap and create a > new one. The two additions are independent - it is possible to create an > incremental backup [<incremental> in backup XML]) without triggering a > new checkpoint [presence of non-null checkpoint XML]. In fact, taking > an incremental backup without creating a checkpoint is effectively doing > differential backups, where multiple backups started at different times > each contain all cumulative changes since the same original point in > time, such that later backups are larger than earlier backups, but you > no longer have to chain those backups to one another to reconstruct the > state in any one of the backups). > > Now that the pull-model backup job is running, we want to scrape the > data off the NBD server. Merely reading nbd://localhost:10809/sdc will > read the full contents of the disk - but that defeats the purpose of > using the checkpoint in the first place to reduce the amount of data to > be backed up. So, let's modify our image-scraping loop from the first > example, to now have one client utilizing the x-dirty-bitmap command > line extension to drive other clients. Note: that extension is marked > experimental in part because it has screwy semantics: if you use it, you > can't reliably read any data from the NBD server, but instead can > interpret 'qemu-img map' output by treating any "data":false lines as > dirty, and "data":true entries as unchanged. > > $ image_opts=driver=nbd,export=sdc,server.type=inet, > $ image_opts+=server.host=localhost,server.port=10809, > $ image_opts+=x-dirty-bitmap=qemu:dirty-bitmap:backup-sdc > $ $qemu_img create -f qcow2 inc12.img $size_of_orig1 > $ $qemu_img rebase -u -f qcow2 -F raw -b nbd://localhost:10809/sdc \ > inc12.img > $ while read line; do > [[ $line =~ .*start.:.([0-9]*).*length.:.([0-9]*).*data.:.false.* ]] || > continue > start=${BASH_REMATCH[1]} len=${BASH_REMATCH[2]} > qemu-io -C -c "r $start $len" -f qcow2 inc12.img > done < <($qemu_img map --output=json --image-opts > > $image_optsdriver=nbd,export=sdc,server.type=inet,server.host=localhost,server.port=10809,x-dirty-bitmap=qemu:dirty-bitmap:backup-sdc) > $ $qemu_img rebase -u -f qcow2 -b '' inc12.img > > As captured, inc12.img is an incomplete qcow2 file (it only includes > clusters touched by the guest since the last incremental or full > backup); but since we output into a qcow2 file, we can easily repair the > damage: > > $ $qemu_img rebase -u -f qcow2 -F qcow2 -b full1.img inc12.img > > creating the qcow2 chain 'full1.img <- inc12.img' that contains > identical guest-visible contents as would be present in a full backup > done at the same moment. > > Of course, with the backups now captured, we clean up: > > $ $virsh backup-end $dom 1 > Backup id 1 completed > $ rm scratch1.img scratch2.img > > and this time, virDomainBackupEnd() had to do one additional bit of work > to delete the temporary bitmaps: > > {"execute":"nbd-server-remove", > "arguments":{"name":"sdc"}} > {"execute":"nbd-server-remove", > "arguments":{"name":"sdd"}} > {"execute":"nbd-server-stop"} > {"execute":"block-job-cancel", > "arguments":{"device":"backup-sdc"}} > {"execute":"block-job-cancel", > "arguments":{"device":"backup-sdd"}} > {"execute":"blockdev-del", > "arguments":{"node-name":"backup-sdc"}} > {"execute":"blockdev-del", > "arguments":{"node-name":"backup-sdd"}} > {"execute":"block-dirty-bitmap-remove", > "arguments":{"node":"$node1", "name":"backup-sdc"}} > {"execute":"block-dirty-bitmap-remove", > "arguments":{"node":"$node2", "name":"backup-sdd"}} > > At this point, it should be fairly obvious that you can create more > incremental backups, by repeatedly updating the <incremental> line in > backup.xml, and adjusting the checkpoint XML to move on to a successive > name. And while incremental backups are the most common (using the > current active checkpoint as the <incremental> when starting the next), > the scheme is also set up to permit differential backups from any > existing checkpoint to the current point in time (since libvirt is > already creating a temporary bitmap as its basis for the > x-nbd-server-add-bitmap, all it has to do is just add an appropriate > number of x-block-dirty-bitmap-merge calls to collect all bitmaps in the > chain from the requested checkpoint to the current checkpoint). > > More to come in part 3. > > -- > Eric Blake, Principal Software Engineer > Red Hat, Inc. +1-919-301-3266 > Virtualization: qemu.org | libvirt.org >
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list