Re: [Qemu-block] RFC: use case for adding QMP, block jobs & multiple exports to qemu-nbd ?

Kashyap Chamarthy Thu, 02 Nov 2017 09:41:20 -0700

[Cc: Matt Booth from Nova upstream; so not snipping the email to retain
context for Matt.]


On Thu, Nov 02, 2017 at 12:02:23PM +0000, Daniel P. Berrange wrote:
> I've been thinking about a potential design/impl improvement for the way
> that OpenStack Nova handles disk images when booting virtual machines, and
> thinking if some enhancements to qemu-nbd could be beneficial...

Just read-through, very intereesting idea.  A couple of things inline.

> At a high level, OpenStack has a repository of disk images (Glance), and
> when we go to boot a VM, Nova copies the disk image out of the repository
> onto the local host's image cache. We doing this, Nova may also enlarge
> disk image (eg if the original image has 10GB size, it may do a qemu-img
> resize to 40GB). Nova then creates a qcow2 overlay with backing file
> pointing to its local cache. Multiple VMs can be booted in parallel each
> with their own overlay pointing to the same backing file
> 
> The problem with this approach is that VM startup is delayed while we copy
> the disk image from the glance repository to the local cache, and again
> while we do the image resize (though the latter is pretty quick really
> since its just changing metadata in the image and/or host filesystem)
> 
> One might suggest that we avoid the local disk copy and just point the
> VM directly at an NBD server running in the remote image repository, but
> this introduces a centralized point of failure. With the local disk copy
> VMs can safely continue running even if the image repository dies. Running
> from the local image cache can offer better performance too, particularly
> if having SSD storage. 
> 
> Conceptually what I want to start with is a 3 layer chain
> 
>    master-disk1.qcow2  (qemu-nbd)
>           |
>           |  (format=raw, proto=nbd)
>         |
>    cache-disk1.qcow2   (qemu-system-XXX)
>           |
>           |  (format=qcow2, proto=file)
>         |
>           +-  vm-a-disk1.qcow2   (qemu-system-XXX)
> 
> NB vm-?-disk.qcow2 sizes may different than the backing file.
> Sometimes OS disk images are built with a fairly small root filesystem
> size, and the guest OS will grow its root FS to fill the actual disk
> size allowed to the specific VM instance.
> 
> The cache-disk1.qcow2 is on each local virt host that needs disk1, and
> created when the first VM is launched. Further launched VMs can all use
> this same cached disk.  Now the cache-disk1.qcow2 is not useful as is,
> because it has no allocated clusters, so after its created we need to
> be able to stream content into it from master-disk1.qcow2, in parallel
> with the VM A booting off vm-a-disk1.qcow2
> 
> If there was only a single VM, this would be easy enough, because we
> can use drive-mirror monitor command to pull master-disk1.qcow2 data
> into cache-disk1.qcow2 and then remove the backing chain leaving just
> 
>    cache-disk1.qcow2   (qemu-system-XXX)
>           |

Just for my own understanding: in this hypothetical single VM diagram,
you denote a QEMU binary ("qemu-system-XXX") for 'cache-disk1.qcow2'
because it will be issuing 'drive-mirror' / 'blockdev-mirror' to the
'qemu-nbd' that exported 'master-disk1.qcow2', and "un-chain"
post completion of 'mirror' job.  Yes?

>           |  (format=qcow2, proto=file)
>           |
>           +-  vm-a-disk1.qcow2  (qemu-system-XXX)
> 
> The problem is that many VMs are wanting to use cache-disk1.qcow2 as
> their disk's backing file, and only one process is permitted to be
> writing to disk backing file at any time.

Can you explain a bit more about how many VMs are trying to write to
write to the same backing file 'cache-disk1.qcow2'?  I'd assume it's
just the "immutable" local backing store (once the previous 'mirror' job
is completed), based on which Nova creates a qcow2 overlay for each
instance it boots.

When I pointed this e-mail of yours to Matt Booth on Freenode Nova IRC
channel, he said the intermediate image (cache-disk1.qcow2) is a COR
Copy-On-Read).  I realize what COR is -- everytime you read a cluster
from the backing file, you write that locally, to avoid reading it
again.

> So I can't use the drive-mirror
> in the QEMU processes to deal with this; all QEMU's must see their
> backing file in a consistent read-only state
> 
> I've been wondering if it is possible to add an extra layer of NBD to
> deal with this scenario. i.e. start off with:
> 
>    master-disk1.qcow2  (qemu-nbd)
>           |
>           |  (format=raw, proto=nbd)
>         |
>    cache-disk1.qcow2  (qemu-nbd)
>           |
>           |  (format=raw, proto=nbd)
>         |
>           +-  vm-a-disk1.qcow2  (qemu-system-XXX)
>           +-  vm-b-disk1.qcow2  (qemu-system-XXX)
>           +-  vm-c-disk1.qcow2  (qemu-system-XXX)
> 
> 
> In this model 'cache-disk1.qcow2' would be opened read-write by a
> qemu-nbd server process, but exported read-only to QEMU. qemu-nbd
> would then do a drive mirror to stream the contents of
> master-disk1.qcow2 into its cache-disk1.qcow2, concurrently with
> servicing read requests from many QEMU's vm-*-disk1.qcow2 files
> over NBD. When the drive mirror is complete we would again cut
> the backing file to give
> 
>    cache-disk1.qcow2  (qemu-nbd)
>           |
>           |  (format=raw, proto=nbd)
>         |
>           +-  vm-a-disk1.qcow2  (qemu-system-XXX)
>           +-  vm-b-disk1.qcow2  (qemu-system-XXX)
>           +-  vm-c-disk1.qcow2  (qemu-system-XXX)
> 
> Since qemu-nbd no longer needs write to cache-disk1.qcow2 at this point,
> we can further pivot all the QEMU servers to make vm-*-disk1.qcow2 use
> format=qcow2,proto=file, allowing the local qemu-nbd to close the disk
> image, and potentially exit (assuming it doesn't have other disks to
> service). This would leave
> 
>    cache-disk1.qcow2  (qemu-system-XXX)
>           |
>           |  (format=qcow2, proto=file)
>         |
>           +-  vm-a-disk1.qcow2  (qemu-system-XXX)
>           +-  vm-b-disk1.qcow2  (qemu-system-XXX)
>           +-  vm-c-disk1.qcow2  (qemu-system-XXX)
> 
> Conceptually QEMU has all the pieces neccessary to support this kind of
> approach to disk images, but they're not exposed by qemu-nbd as it has
> no QMP interface of its own.
> 
> Another more minor issue is that the disk image repository may have
> 1000's of images in it, and I don't want to be running 1000's of
> qemu-nbd instances. I'd like 1 server to export many disks. I could
> use iscsi in the disk image repository instead to deal with that, 
> only having the qemu-nbd processes running on the local virt host
> for the duration of populating cache-disks1.qcow2 from master-disk1.qcow2
> The iscsi server admin commands are pretty unplesant to use compared
> to QMP though, so its appealing to use NBD for everything.
> 
> After all that long background explanation, what I'm wondering is whether
> there is any interest / desire to extend qemu-nbd to have more advanced
> featureset than simply exporting a single disk image which must be listed
> at startup time.
> 
>  - Ability to start qemu-nbd up with no initial disk image connected
>  - Option to have a QMP interface to control qemu-nbd
>  - Commands to add / remove individual disk image exports
>  - Commands for doing the drive-mirror / backing file pivot
> 
> It feels like this wouldn't require significant new functionality in either
> QMP or block layer. It ought to be mostly a cache of taking existing QMP
> code and wiring it up in qemu-nbd, and only exposing a whitelisted subset
> of existing QMP commands related to block backends. 
> 
> One alternative approach to doing this would be to suggest that we should
> instead just spawn qemu-system-x86_64 with '--machine none' and use that
> as a replacement for qemu-nbd, since it already has a built-in NBD server
> which can do many exports at once and arbitrary block jobs.
> 
> I'm concerned that this could end up being a be a game of whack-a-mole
> though, constantly trying to cut out/down all the bits of system emulation
> in the machine emulators to get its resource overhead to match the low
> overhead of standalone qemu-nbd.
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> 

-- 
/kashyap

Re: [Qemu-block] RFC: use case for adding QMP, block jobs & multiple exports to qemu-nbd ?

Reply via email to