* 'Hrvoje Ribicic' via ganeti-devel <[email protected]> [2015-09-01 20:37:49 +0200]:
> On Tue, Sep 1, 2015 at 6:44 PM, Dimitris Aragiorgis < > [email protected]> wrote: > > > * Hrvoje Ribicic <[email protected]> [2015-08-31 17:48:02 +0200]: > > > > > On Tue, Aug 18, 2015 at 12:06 PM, Dimitris Aragiorgis < > > > [email protected]> wrote: > > > > > > > This is a design document detailing the refactoring of device > > > > handling in the KVM Hypervisor. More specifically, it will use the > > > > latest QEMU device model and modify the current hotplug implementation > > > > so that both PCI and SCSI devices can be managed. > > > > > > > > Signed-off-by: Dimitris Aragiorgis <[email protected]> > > > > --- > > > > Makefile.am | 1 + > > > > doc/design-draft.rst | 1 + > > > > doc/design-scsi-kvm.rst | 220 > > > > +++++++++++++++++++++++++++++++++++++++++++++++ > > > > 3 files changed, 222 insertions(+) > > > > create mode 100644 doc/design-scsi-kvm.rst > > > > > > > > diff --git a/Makefile.am b/Makefile.am > > > > index 12cda9f..88690a3 100644 > > > > --- a/Makefile.am > > > > +++ b/Makefile.am > > > > @@ -717,6 +717,7 @@ docinput = \ > > > > doc/design-reservations.rst \ > > > > doc/design-resource-model.rst \ > > > > doc/design-restricted-commands.rst \ > > > > + doc/design-scsi-kvm.rst \ > > > > doc/design-shared-storage.rst \ > > > > doc/design-shared-storage-redundancy.rst \ > > > > doc/design-ssh-ports.rst \ > > > > diff --git a/doc/design-draft.rst b/doc/design-draft.rst > > > > index 3f2a1f3..83a392c 100644 > > > > --- a/doc/design-draft.rst > > > > +++ b/doc/design-draft.rst > > > > @@ -25,6 +25,7 @@ Design document drafts > > > > design-configlock.rst > > > > design-multi-storage-htools.rst > > > > design-repaird.rst > > > > + design-scsi-kvm.rst > > > > > > > > .. vim: set textwidth=72 : > > > > .. Local Variables: > > > > diff --git a/doc/design-scsi-kvm.rst b/doc/design-scsi-kvm.rst > > > > new file mode 100644 > > > > index 0000000..af46177 > > > > --- /dev/null > > > > +++ b/doc/design-scsi-kvm.rst > > > > @@ -0,0 +1,220 @@ > > > > +========== > > > > +KVM + SCSI > > > > +========== > > > > + > > > > +.. contents:: :depth: 4 > > > > + > > > > +This is a design document detailing the refactoring of device > > > > +handling in the KVM Hypervisor. More specifically, it will use > > > > +the latest QEMU device model and modify the hotplug implementation > > > > +so that both PCI and SCSI devices can be managed. > > > > + > > > > + > > > > +Current state and shortcomings > > > > +============================== > > > > + > > > > +Ganeti currently supports SCSI virtual devices in the KVM hypervisor > > by > > > > +setting the `disk_type` hvparam to `scsi`. Ganeti will eventually > > > > +instruct QEMU to use the deprecated device model (i.e. -drive > > if=scsi), > > > > +which will expose the backing store as an emulated SCSI device. This > > > > +means that currently SCSI pass-through is not supported. > > > > + > > > > +On the other hand, the current hotplug implementation > > > > +:doc:`design-hotplug` uses the latest QEMU > > > > +device model (via the -device option) and is tailored to paravirtual > > > > +devices, which leads to buggy behavior: if we hotplug a disk to an > > > > +instance that is configured with disk_type=scsi hvparam, the > > > > +disk which will get hot-plugged eventually will be a VirtIO device > > > > +(i.e., virtio-blk-pci) on the PCI bus. > > > > + > > > > +The current implementation of creating the QEMU command line is > > > > +error-prone, since an instance might not be able to boot due to PCI > > slot > > > > +congestion. > > > > + > > > > + > > > > +Proposed changes > > > > +================ > > > > + > > > > +We change the way that the KVM hypervisor handles block devices by > > > > +introducing latest QEMU device model for SCSI devices as well, so that > > > > +scsi-cd, scsi-hd, scsi-block, and scsi-generic device drivers are > > > > +supported too. Additionally we refactor the hotplug implementation in > > > > +order to support hotplugging of SCSI devices too. Finally, we change > > the > > > > +way we keep track of device info inside runtime files, and the way we > > > > +place each device upon instance startup. > > > > + > > > > +Design decisions > > > > +================ > > > > + > > > > +How to identify each device? > > > > + > > > > +Currently KVM does not support arbitrary IDs for devices; supported > > are > > > > +only names starting with a letter, with max 32 chars length, and only > > > > +including the '.', '_', '-' special chars. Currently we generate an ID > > > > +with the following format: <device type>-<part of uuid>-pci-<slot>. > > > > +This assumes that the device will be plugged in a certain slot on the > > > > +PCI bus. Since we want to support devices on a SCSI bus too and adding > > > > +the PCI slot to the ID is redundant, we keep only the first two parts > > of > > > > +the existing ID. > > > > + > > > > + > > > > +Which buses does the guest eventually see? > > > > + > > > > +By default QEMU starts with a single PCI bus named "pci.0". In case a > > > > +SCSI controller is added on this bus, a SCSI bus is created with > > > > +the corresponding name: "scsi.0". > > > > +Any SCSI disks will be attached on this SCSI bus. Currently Ganeti > > does > > > > +not explicitly use a SCSI controller via a command line option, but > > lets > > > > +QEMU add one automatically if needed. Here, in case we have a SCSI > > disk, > > > > +a SCSI controller is explicitly added via the -device option. For the > > > > +SCSI controller, we do not specify the PCI slot to use, but let QEMU > > find > > > > +the first available (see below). > > > > > > > > > > As we are doing this explicitly rather than implicitly, does it change > > the > > > order of PCI devices? In patch 2, the SCSI controller addition comes > > before > > > the addition of the balloon device, and I think the patch series relies > > on > > > the same order being created implicitly by QEMU? If so, this could > > prevent > > > migration of an instance using both of these features. > > > > > > > > > > During migration we rebuild the QEMU command line on the target node based > > on the already created runtime file. For old VMs the balloon option will > > be present in the kvm_cmd entry of the runtime file while the controller > > option will be missing. This will not change since runtime files are > > created before starting the instance and kvm_cmd entry changes only > > after reboot. So migration of old VMs will not be a problem. > > > > Excellent, thanks! > > > > > > > > + > > > > + > > > > +What type of SCSI controller to use? > > > > + > > > > +QEMU uses the `lsi` controller by default. To make this configurable > > we > > > > +add a new hvparam, `scsi_controller_type`. The available types will be > > > > +`lsi`, `megasas`, and `virtio-scsi-pci`. > > > > + > > > > + > > > > +Where to place the devices upon instance startup? > > > > + > > > > +By default QEMU boots with the first three slots of the default PCI > > bus > > > > +(pci.0) occupied: slot 0 for Host bridge, slot 1 for ISA bridge, and > > > > +slot 2 for VGA controller. Thereafter, the slots depend on the QEMU > > > > +options passed in the command line. > > > > + > > > > +The main reason that we want to be fully aware of the configuration > > of a > > > > +running instance (machine type, PCI and SCSI bus state, devices, etc.) > > > > +is that in case of migration a QEMU process with the exact same > > > > +configuration should be created on the target node. The configuration > > is > > > > +kept in the runtime file created just before starting the instance. > > > > +Since hotplug has been introduced, the only thing that can change > > after > > > > +starting an instance is the configuration related to NICs and Disks. > > > > + > > > > +Before implementing hotplug, Ganeti did not specify PCI slots > > > > +explicitly, but let QEMU decide how to place the devices on the > > > > +corresponding bus. This does not work if we want to have hotplug-able > > > > +devices and migrate-able VMs. Currently, upon runtime file creation, > > we > > > > +try to reserve PCI slots based on the hvparams, the disks, and the > > NICs > > > > +of the instance. This has three major shortcomings: first, we have to > > be > > > > +aware which options modify the PCI bus which is practically impossible > > > > +due to the huge amount of QEMU options, second, QEMU may change the > > > > +default PCI configuration from version to version, and third, we > > cannot > > > > +know if the extra options passed by the user via the `kvm_extra` > > hvparam > > > > +modify the PCI bus. > > > > + > > > > +All the above makes the current implementation error prone: an > > instance > > > > +might not be able to boot if we explicitly add a NIC/Disk on a > > specific > > > > +PCI slot that QEMU has already used for another device while parsing > > > > +its command line options. Besides that, now, we want to use the SCSI > > bus > > > > +as well so the above mechanism is insufficient. Here, we decide to put > > > > +only disks and NICs on specific slots on the corresponding bus, and > > let > > > > +QEMU put everything else automatically. To this end, we decide to let > > > > +the first 8 PCI slots be managed by QEMU, and we start adding PCI > > > > +devices (VirtIO block and network devices) from the 9th slot and > > > > +forward. As far as the SCSI bus is concerned, we decide to put each > > SCSI > > > > +disk on a different scsi-id (which corresponds to a different target > > > > +number in SCSI terminology). The SCSI bus will not have any default > > > > +reservations. > > > > + > > > > > > > > > > In the very rare case that someone manages to saturate the first 8 slots > > > through QEMU reservations and command switches, not managing the > > > reservations stemming from various options puts a hard limit on what > > Ganeti > > > can do. While it might not be a concern now, nor a reasonable concern, > > this > > > could be mitigated by picking our reservations from the back side of the > > > PCI bus, starting from 16 in our case. > > > > > > > I thought that too but there is an ugly and probably unwanted situation: > > > > If we have 2 disks the first one (disk/0 for Ganeti) will appear inside > > the VM as /dev/sdb and the second (disk/1) as /dev/sda. So I suggest to > > keep it as is. > > > > Ack, I did not think of that, and now it seems like an obvious flaw. Thanks! > > > > > Regarding the number 16: neither the design doc nor the patch series > > > explain why the PCI bus was shrunk. Any details on that? > > > > > > > Count me wrong on this. I wrongly assumed that the PCI bus has only 16 > > slots when I tried addr=0x20 and got 'Property 'virtio-net-pci.addr' > > doesn't take value '0x20'' (of course value 0x1f is totally acceptable). > > So I am going to change the design doc and the corresponding patches so > > that Ganeti: > > > > - lets QEMU manage the first 16 slots (0x00 - 0x0f) > > - places devices (Disks/NICs) from the 17th slot (0x10) onwards > > > > > Great, I approve of that completely. > > > > > > > > > + > > > > +How to keep track of the bus state of a running instance? > > > > + > > > > +To be able to hotplug a device, we need to know which slot is > > > > +available on the desired bus. Until now, we were using the > > ``query-pci`` > > > > +QMP command that returns the state of the PCI buses (i.e., which > > devices > > > > +occupy which slots). Unfortunately, there is no equivalent for the > > SCSI > > > > +buses. We could use the ``info qtree`` HMP command that practically > > > > +dumps in plain text the whole device tree. This makes it really hard > > to > > > > +parse. So we decide to generate the bus state of a running instance > > > > +through our local runtime files. > > > > + > > > > + > > > > +What info should be kept in runtime files? > > > > + > > > > +Runtime files are used for instance migration (to run a QEMU process > > on > > > > +the target node with the same configuration) and for hotplug actions > > (to > > > > +update the configuration of a running instance so that it can be > > > > +migrated). Until now we were using devices only on the PCI bus, so > > only > > > > +each device's PCI slot should be kept in the runtime file. This is > > > > +obviously not enough. We decide to replace the `pci` slot of Disk and > > > > +NIC configuration objects, with an `hvinfo` dict. It will contain all > > > > +necessary info for constructing the appropriate -device QEMU option. > > > > +Specifically the `driver`, `id`, and `bus` parameters will be present > > to > > > > +all kind of devices. PCI devices will have the `addr` parameter, SCSI > > > > +devices will have `channel`, `scsi-id`, and `lun`. NICs and Disks will > > > > +have the extra `netdev` and `drive` parameters correspondingly. > > > > + > > > > > > > > > > Removing the "pci" field does not allow us to perform downgrades. If an > > > user upgrades the cluster, reboots the instance, and downgrades the > > > cluster, the instance will become unmigratable, as will any instances > > > created on the upgraded version. While this is not permanent damage, we > > > like our users' uptime, and generating the "pci" field redundantly would > > > mostly prevent this problem. It would be declared deprecated and removed > > at > > > a late enough time when downgrades to a version without hvinfo would not > > be > > > a problem. > > > > > > > Well we can use _UpgradeSerializedRuntime() for the downgrade too. If > > hvinfo is found and NIC/Disk config objects do not have such a slot then we > > can easily translate it to a "pci" field. Note that an instance with the > > new disk types can not be managed by an old cluster. But I guess we > > cannot support everything :) > > > > The biggest problem with this approach is that we have very little control > over which older versions of Ganeti users are running, and it's there that > we would have to modify _UpgradeSerializedRuntime to create the "pci" field. > So doing this would necessitate that we push a lot of older versions of > Ganeti ;) > I see your point. And this should not be an option. Maybe it is a good time to add another entry inside the runtime file denoting the Ganeti version that has created it. But this can be done in a separate patch set. > While I agree maintaining the redundant field feels like stabbing oneself > in the gut, we could do it just before the runtime serialization, which > minimizes the spread of evil around the system. Unfortunately, the use case > of "try a newer version, find it a bit too new, downgrade" is quite a > common one and we should plan for it. > > As far as the new disk types go: indeed, there are some things we should > not even try. > Well if we let the stale pci there along with hvinfo as you say, in case of a downgrade, while using NIC/Disk.FromDict(), we are going to get: TypeError: Object NIC doesn't support the parameter 'hvinfo' It seams that this corner case (upgrade-reboot-downgrade) cannot be easily supported. If we had the version mentioned above we could have a proper check, a proper error message and we could fail gracefully in such a case. > > > > If we did not keep this information contained in the runtime file but in > > > the configuration instead, this would not be an issue as we could use the > > > existing upgrade and downgrade infrastructure to fix things up. Any idea > > > what motivated the original choice of keeping the parameters / cmdline > > > separate from the config? > > > > > > > Well I guess the main reason is that gnt-instance modify changes things > > in config but not in runtime. When hotplug was introduced, it was the > > only action that could change the running instance and therefore the > > runtime file should change accordingly (for migrations). Second argument > > is that this kind of info is too hypervisor-specific (that's why I chose > > hvinfo) and should not contaminate config.data. So I suggest to keep > > hiding this kind of info away from confd and let noded handle it. I > > think the solution with _UpgradeSerializedRuntime() is straightforward > > and pretty easy to implement. > > > > Thanks! I appreciate the configuration not being a mess, but this solution > still gives us downgrade issues, and still leads to issues like: > https://code.google.com/p/ganeti/issues/detail?id=901 > Regarding this issue I think we sould just use the hvparms stored in runtime file rather than the ones that come with the instance (up_hvp instead of conf_hvp). > > > > > > > > > > + > > > > +How to deal with existing instances? > > > > + > > > > +Only existing instances with paravirtual devices (configured via the > > > > +disk_type and nic_type hvparam) use the latest QEMU device model. Only > > > > +these have the `pci` slot filled. We will use the existing > > > > +_UpgradeSerializedRuntime() method to migrate the old runtime format > > > > +with `pci` slot in Disk and NIC configuration objects to the new one > > > > +with `hvinfo` instead. The new hvinfo will contain the old driver > > > > +(either virtio-blk-pci or virtio-net-pci), the old id > > > > +(hotdisk-123456-pci-4), the default PCI bus (pci.0), and the old PCI > > > > +slot (addr=4). This way those devices will still be hotplug-able, and > > > > +the instance will still be migrate-able. When those instances are > > > > +rebooted, the hvinfo will be re-generated. > > > > + > > > > + > > > > +Configuration changes > > > > +--------------------- > > > > + > > > > +The ``NIC`` and ``Disk`` objects get one extra slot: ``hvinfo``. It is > > > > +hypervisor-specific and will never reach config.data. In case of the > > KVM > > > > +Hypervisor it will contain all necessary info for constructing the > > -device > > > > +QEMU option. Existing entries in runtime files that had a `pci` slot > > > > +will be upgraded to have the corresponding `hvinfo` (see above). > > > > + > > > > +The new `scsi_controller_type` hvparam is added to denote what type of > > > > +SCSI controller should be added to PCI bus if we have a SCSI disk. > > > > +Allowed values will be `lsi`, `virtio-scsi-pci`, and `megasas`. > > > > +We decide to use `lsi` by default since this is the one that QEMU > > > > +adds automatically if not specified explicitly by an option. > > > > + > > > > + > > > > +Hypervisor changes > > > > +------------------ > > > > + > > > > +The current implementation verifies if a hotplug action has succeeded > > > > +by scanning the PCI bus and searching for a specific device ID. This > > > > +will change, and we will use the ``query-block`` along with the > > > > +``query-pci`` QMP command to find block devices that are attached to > > the > > > > +SCSI bus as well. > > > > + > > > > +Up until now, if `disk_type` hvparam was set to `scsi`, QEMU would > > use the > > > > +deprecated device model and end up using SCSI emulation, e.g.: > > > > + > > > > + :: > > > > + > > > > + -drive > > file=/var/run/ganeti/instance-disks/test:0,if=scsi,format=raw > > > > + > > > > +Now the equivalent, which will also enable hotplugging, will be to set > > > > +disk_type to `scsi-hd`. The QEMU command line will include: > > > > + > > > > + :: > > > > + > > > > + -drive > > > > > > file=/var/run/ganeti/instance-disks/test:0,if=none,format=raw,id=hotdisk-12345 > > > > + -device > > > > > > scsi-hd,id=hotdisk-12345,drive=hotdisk-12345,bus=scsi.0,channel=0,scsi-id=0,lun=0 > > > > + > > > > + > > > > +User interface > > > > +-------------- > > > > + > > > > +The `disk_type` hvparam will additionally support the `scsi-hd`, > > > > +`scsi-block`, and `scsi-generic` values. The first one is equivalent > > to > > > > +the existing `scsi` value and will make QEMU emulate a SCSI device, > > > > +while the last two will add support for SCSI pass-through and will > > > > +require a real SCSI device on the host. > > > > + > > > > +.. vim: set textwidth=72 : > > > > +.. Local Variables: > > > > +.. mode: rst > > > > +.. fill-column: 72 > > > > +.. End: > > > > -- > > > > 1.7.10.4 > > > > > > > > > > > Anyway, a very positive change, and I look forward to it making it in > > soon! > > > > > > > Thanks a lot! I have your points noted and if you agree with my > > suggestions I'll include them in a v2 patch-set after a first > > round of review. > > > > Great, looking forward to it! > > > > > > Cheers, > > dimara > > > > > Cheers, > Riba > > > > > Hrvoje Ribicic > > > Ganeti Engineering > > > Google Germany GmbH > > > Dienerstr. 12, 80331, München > > > > > > Geschäftsführer: Graham Law, Christine Elizabeth Flores > > > Registergericht und -nummer: Hamburg, HRB 86891 > > > Sitz der Gesellschaft: Hamburg > > > > Hrvoje Ribicic > Ganeti Engineering > Google Germany GmbH > Dienerstr. 12, 80331, München > > Geschäftsführer: Graham Law, Christine Elizabeth Flores > Registergericht und -nummer: Hamburg, HRB 86891 > Sitz der Gesellschaft: Hamburg
signature.asc
Description: Digital signature
