Re: [Qemu-devel] [Qemu-block] RFC: use case for adding QMP, block jobs & multiple exports to qemu-nbd ?

Daniel P. Berrange Thu, 02 Nov 2017 10:05:43 -0700

On Thu, Nov 02, 2017 at 05:40:28PM +0100, Kashyap Chamarthy wrote:
> [Cc: Matt Booth from Nova upstream; so not snipping the email to retain
> context for Matt.]
> 
> On Thu, Nov 02, 2017 at 12:02:23PM +0000, Daniel P. Berrange wrote:
> > I've been thinking about a potential design/impl improvement for the way
> > that OpenStack Nova handles disk images when booting virtual machines, and
> > thinking if some enhancements to qemu-nbd could be beneficial...
> 
> Just read-through, very intereesting idea.  A couple of things inline.
> 
> > At a high level, OpenStack has a repository of disk images (Glance), and
> > when we go to boot a VM, Nova copies the disk image out of the repository
> > onto the local host's image cache. We doing this, Nova may also enlarge
> > disk image (eg if the original image has 10GB size, it may do a qemu-img
> > resize to 40GB). Nova then creates a qcow2 overlay with backing file
> > pointing to its local cache. Multiple VMs can be booted in parallel each
> > with their own overlay pointing to the same backing file
> > 
> > The problem with this approach is that VM startup is delayed while we copy
> > the disk image from the glance repository to the local cache, and again
> > while we do the image resize (though the latter is pretty quick really
> > since its just changing metadata in the image and/or host filesystem)
> > 
> > One might suggest that we avoid the local disk copy and just point the
> > VM directly at an NBD server running in the remote image repository, but
> > this introduces a centralized point of failure. With the local disk copy
> > VMs can safely continue running even if the image repository dies. Running
> > from the local image cache can offer better performance too, particularly
> > if having SSD storage. 
> > 
> > Conceptually what I want to start with is a 3 layer chain
> > 
> >    master-disk1.qcow2  (qemu-nbd)
> >           |
> >           |  (format=raw, proto=nbd)
> >       |
> >    cache-disk1.qcow2   (qemu-system-XXX)
> >           |
> >           |  (format=qcow2, proto=file)
> >       |
> >           +-  vm-a-disk1.qcow2   (qemu-system-XXX)
> > 
> > NB vm-?-disk.qcow2 sizes may different than the backing file.
> > Sometimes OS disk images are built with a fairly small root filesystem
> > size, and the guest OS will grow its root FS to fill the actual disk
> > size allowed to the specific VM instance.
> > 
> > The cache-disk1.qcow2 is on each local virt host that needs disk1, and
> > created when the first VM is launched. Further launched VMs can all use
> > this same cached disk.  Now the cache-disk1.qcow2 is not useful as is,
> > because it has no allocated clusters, so after its created we need to
> > be able to stream content into it from master-disk1.qcow2, in parallel
> > with the VM A booting off vm-a-disk1.qcow2
> > 
> > If there was only a single VM, this would be easy enough, because we
> > can use drive-mirror monitor command to pull master-disk1.qcow2 data
> > into cache-disk1.qcow2 and then remove the backing chain leaving just
> > 
> >    cache-disk1.qcow2   (qemu-system-XXX)
> >           |
> 
> Just for my own understanding: in this hypothetical single VM diagram,
> you denote a QEMU binary ("qemu-system-XXX") for 'cache-disk1.qcow2'
> because it will be issuing 'drive-mirror' / 'blockdev-mirror' to the
> 'qemu-nbd' that exported 'master-disk1.qcow2', and "un-chain"
> post completion of 'mirror' job.  Yes?


In this diagram the same QEMU process has both cache-disk1.qcow2 and
vm-a-disk1.qcow2 open - its just a regular backing file setup.

> 
> >         |  (format=qcow2, proto=file)
> >           |
> >           +-  vm-a-disk1.qcow2  (qemu-system-XXX)
> > 
> > The problem is that many VMs are wanting to use cache-disk1.qcow2 as
> > their disk's backing file, and only one process is permitted to be
> > writing to disk backing file at any time.
> 
> Can you explain a bit more about how many VMs are trying to write to
> write to the same backing file 'cache-disk1.qcow2'?  I'd assume it's
> just the "immutable" local backing store (once the previous 'mirror' job
> is completed), based on which Nova creates a qcow2 overlay for each
> instance it boots.

An arbitrary number of  vm-*-disk1.qcow2 files could exist all using
the same cache-disk1.qcow2 image. Its only limited by how many VMs
you can fit on the host. By definition you can only ever have a single
process writing to a qcow2 file though, otherwise corruption will quickly
follow.

> When I pointed this e-mail of yours to Matt Booth on Freenode Nova IRC
> channel, he said the intermediate image (cache-disk1.qcow2) is a COR
> Copy-On-Read).  I realize what COR is -- everytime you read a cluster
> from the backing file, you write that locally, to avoid reading it
> again.

qcow2 doesn't give you COR, only COW. So every read request would have a miss
in cache-disk1.qcow2 and thus have to be fetched from master-disk1.qcow2. The
use of drive-mirror to pull master-disk1.qcow2 contents into cache-disk1.qcow
makes up for the lack of COR by populating cache-disk1.qcow2 in the background.

> > So I can't use the drive-mirror
> > in the QEMU processes to deal with this; all QEMU's must see their
> > backing file in a consistent read-only state
> > 
> > I've been wondering if it is possible to add an extra layer of NBD to
> > deal with this scenario. i.e. start off with:
> > 
> >    master-disk1.qcow2  (qemu-nbd)
> >           |
> >           |  (format=raw, proto=nbd)
> >       |
> >    cache-disk1.qcow2  (qemu-nbd)
> >           |
> >           |  (format=raw, proto=nbd)
> >       |
> >           +-  vm-a-disk1.qcow2  (qemu-system-XXX)
> >           +-  vm-b-disk1.qcow2  (qemu-system-XXX)
> >           +-  vm-c-disk1.qcow2  (qemu-system-XXX)
> > 
> > 
> > In this model 'cache-disk1.qcow2' would be opened read-write by a
> > qemu-nbd server process, but exported read-only to QEMU. qemu-nbd
> > would then do a drive mirror to stream the contents of
> > master-disk1.qcow2 into its cache-disk1.qcow2, concurrently with
> > servicing read requests from many QEMU's vm-*-disk1.qcow2 files
> > over NBD. When the drive mirror is complete we would again cut
> > the backing file to give
> > 
> >    cache-disk1.qcow2  (qemu-nbd)
> >           |
> >           |  (format=raw, proto=nbd)
> >       |
> >           +-  vm-a-disk1.qcow2  (qemu-system-XXX)
> >           +-  vm-b-disk1.qcow2  (qemu-system-XXX)
> >           +-  vm-c-disk1.qcow2  (qemu-system-XXX)
> > 
> > Since qemu-nbd no longer needs write to cache-disk1.qcow2 at this point,
> > we can further pivot all the QEMU servers to make vm-*-disk1.qcow2 use
> > format=qcow2,proto=file, allowing the local qemu-nbd to close the disk
> > image, and potentially exit (assuming it doesn't have other disks to
> > service). This would leave
> > 
> >    cache-disk1.qcow2  (qemu-system-XXX)
> >           |
> >           |  (format=qcow2, proto=file)
> >       |
> >           +-  vm-a-disk1.qcow2  (qemu-system-XXX)
> >           +-  vm-b-disk1.qcow2  (qemu-system-XXX)
> >           +-  vm-c-disk1.qcow2  (qemu-system-XXX)
> > 
> > Conceptually QEMU has all the pieces neccessary to support this kind of
> > approach to disk images, but they're not exposed by qemu-nbd as it has
> > no QMP interface of its own.
> > 
> > Another more minor issue is that the disk image repository may have
> > 1000's of images in it, and I don't want to be running 1000's of
> > qemu-nbd instances. I'd like 1 server to export many disks. I could
> > use iscsi in the disk image repository instead to deal with that, 
> > only having the qemu-nbd processes running on the local virt host
> > for the duration of populating cache-disks1.qcow2 from master-disk1.qcow2
> > The iscsi server admin commands are pretty unplesant to use compared
> > to QMP though, so its appealing to use NBD for everything.
> > 
> > After all that long background explanation, what I'm wondering is whether
> > there is any interest / desire to extend qemu-nbd to have more advanced
> > featureset than simply exporting a single disk image which must be listed
> > at startup time.
> > 
> >  - Ability to start qemu-nbd up with no initial disk image connected
> >  - Option to have a QMP interface to control qemu-nbd
> >  - Commands to add / remove individual disk image exports
> >  - Commands for doing the drive-mirror / backing file pivot
> > 
> > It feels like this wouldn't require significant new functionality in either
> > QMP or block layer. It ought to be mostly a cache of taking existing QMP
> > code and wiring it up in qemu-nbd, and only exposing a whitelisted subset
> > of existing QMP commands related to block backends. 
> > 
> > One alternative approach to doing this would be to suggest that we should
> > instead just spawn qemu-system-x86_64 with '--machine none' and use that
> > as a replacement for qemu-nbd, since it already has a built-in NBD server
> > which can do many exports at once and arbitrary block jobs.
> > 
> > I'm concerned that this could end up being a be a game of whack-a-mole
> > though, constantly trying to cut out/down all the bits of system emulation
> > in the machine emulators to get its resource overhead to match the low
> > overhead of standalone qemu-nbd.
> > 

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [Qemu-devel] [Qemu-block] RFC: use case for adding QMP, block jobs & multiple exports to qemu-nbd ?

Reply via email to