* 'Hrvoje Ribicic' via ganeti-devel <[email protected]> [2015-09-01 
20:37:49 +0200]:

> On Tue, Sep 1, 2015 at 6:44 PM, Dimitris Aragiorgis <
> [email protected]> wrote:
> 
> > * Hrvoje Ribicic <[email protected]> [2015-08-31 17:48:02 +0200]:
> >
> > > On Tue, Aug 18, 2015 at 12:06 PM, Dimitris Aragiorgis <
> > > [email protected]> wrote:
> > >
> > > > This is a design document detailing the refactoring of device
> > > > handling in the KVM Hypervisor. More specifically, it will use the
> > > > latest QEMU device model and modify the current hotplug implementation
> > > > so that both PCI and SCSI devices can be managed.
> > > >
> > > > Signed-off-by: Dimitris Aragiorgis <[email protected]>
> > > > ---
> > > >  Makefile.am             |    1 +
> > > >  doc/design-draft.rst    |    1 +
> > > >  doc/design-scsi-kvm.rst |  220
> > > > +++++++++++++++++++++++++++++++++++++++++++++++
> > > >  3 files changed, 222 insertions(+)
> > > >  create mode 100644 doc/design-scsi-kvm.rst
> > > >
> > > > diff --git a/Makefile.am b/Makefile.am
> > > > index 12cda9f..88690a3 100644
> > > > --- a/Makefile.am
> > > > +++ b/Makefile.am
> > > > @@ -717,6 +717,7 @@ docinput = \
> > > >         doc/design-reservations.rst \
> > > >         doc/design-resource-model.rst \
> > > >         doc/design-restricted-commands.rst \
> > > > +       doc/design-scsi-kvm.rst \
> > > >         doc/design-shared-storage.rst \
> > > >         doc/design-shared-storage-redundancy.rst \
> > > >         doc/design-ssh-ports.rst \
> > > > diff --git a/doc/design-draft.rst b/doc/design-draft.rst
> > > > index 3f2a1f3..83a392c 100644
> > > > --- a/doc/design-draft.rst
> > > > +++ b/doc/design-draft.rst
> > > > @@ -25,6 +25,7 @@ Design document drafts
> > > >     design-configlock.rst
> > > >     design-multi-storage-htools.rst
> > > >     design-repaird.rst
> > > > +   design-scsi-kvm.rst
> > > >
> > > >  .. vim: set textwidth=72 :
> > > >  .. Local Variables:
> > > > diff --git a/doc/design-scsi-kvm.rst b/doc/design-scsi-kvm.rst
> > > > new file mode 100644
> > > > index 0000000..af46177
> > > > --- /dev/null
> > > > +++ b/doc/design-scsi-kvm.rst
> > > > @@ -0,0 +1,220 @@
> > > > +==========
> > > > +KVM + SCSI
> > > > +==========
> > > > +
> > > > +.. contents:: :depth: 4
> > > > +
> > > > +This is a design document detailing the refactoring of device
> > > > +handling in the KVM Hypervisor. More specifically, it will use
> > > > +the latest QEMU device model and modify the hotplug implementation
> > > > +so that both PCI and SCSI devices can be managed.
> > > > +
> > > > +
> > > > +Current state and shortcomings
> > > > +==============================
> > > > +
> > > > +Ganeti currently supports SCSI virtual devices in the KVM hypervisor
> > by
> > > > +setting the `disk_type` hvparam to `scsi`. Ganeti will eventually
> > > > +instruct QEMU to use the deprecated device model (i.e. -drive
> > if=scsi),
> > > > +which will expose the backing store as an emulated SCSI device. This
> > > > +means that currently SCSI pass-through is not supported.
> > > > +
> > > > +On the other hand, the current hotplug implementation
> > > > +:doc:`design-hotplug` uses the latest QEMU
> > > > +device model (via the -device option) and is tailored to paravirtual
> > > > +devices, which leads to buggy behavior: if we hotplug a disk to an
> > > > +instance that is configured with disk_type=scsi hvparam, the
> > > > +disk which will get hot-plugged eventually will be a VirtIO device
> > > > +(i.e., virtio-blk-pci) on the PCI bus.
> > > > +
> > > > +The current implementation of creating the QEMU command line is
> > > > +error-prone, since an instance might not be able to boot due to PCI
> > slot
> > > > +congestion.
> > > > +
> > > > +
> > > > +Proposed changes
> > > > +================
> > > > +
> > > > +We change the way that the KVM hypervisor handles block devices by
> > > > +introducing latest QEMU device model for SCSI devices as well, so that
> > > > +scsi-cd, scsi-hd, scsi-block, and scsi-generic device drivers are
> > > > +supported too. Additionally we refactor the hotplug implementation in
> > > > +order to support hotplugging of SCSI devices too. Finally, we change
> > the
> > > > +way we keep track of device info inside runtime files, and the way we
> > > > +place each device upon instance startup.
> > > > +
> > > > +Design decisions
> > > > +================
> > > > +
> > > > +How to identify each device?
> > > > +
> > > > +Currently KVM does not support arbitrary IDs for devices; supported
> > are
> > > > +only names starting with a letter, with max 32 chars length, and only
> > > > +including the '.', '_', '-' special chars. Currently we generate an ID
> > > > +with the following format: <device type>-<part of uuid>-pci-<slot>.
> > > > +This assumes that the device will be plugged in a certain slot on the
> > > > +PCI bus. Since we want to support devices on a SCSI bus too and adding
> > > > +the PCI slot to the ID is redundant, we keep only the first two parts
> > of
> > > > +the existing ID.
> > > > +
> > > > +
> > > > +Which buses does the guest eventually see?
> > > > +
> > > > +By default QEMU starts with a single PCI bus named "pci.0". In case a
> > > > +SCSI controller is added on this bus, a SCSI bus is created with
> > > > +the corresponding name: "scsi.0".
> > > > +Any SCSI disks will be attached on this SCSI bus. Currently Ganeti
> > does
> > > > +not explicitly use a SCSI controller via a command line option, but
> > lets
> > > > +QEMU add one automatically if needed. Here, in case we have a SCSI
> > disk,
> > > > +a SCSI controller is explicitly added via the -device option. For the
> > > > +SCSI controller, we do not specify the PCI slot to use, but let QEMU
> > find
> > > > +the first available (see below).
> > > >
> > >
> > > As we are doing this explicitly rather than implicitly, does it change
> > the
> > > order of PCI devices? In patch 2, the SCSI controller addition comes
> > before
> > > the addition of the balloon device, and I think the patch series relies
> > on
> > > the same order being created implicitly by QEMU? If so, this could
> > prevent
> > > migration of an instance using both of these features.
> > >
> > >
> >
> > During migration we rebuild the QEMU command line on the target node based
> > on the already created runtime file. For old VMs the balloon option will
> > be present in the kvm_cmd entry of the runtime file while the controller
> > option will be missing. This will not change since runtime files are
> > created before starting the instance and kvm_cmd entry changes only
> > after reboot. So migration of old VMs will not be a problem.
> >
> 
> Excellent, thanks!
> 
> 
> >
> > > > +
> > > > +
> > > > +What type of SCSI controller to use?
> > > > +
> > > > +QEMU uses the `lsi` controller by default. To make this configurable
> > we
> > > > +add a new hvparam, `scsi_controller_type`. The available types will be
> > > > +`lsi`, `megasas`, and `virtio-scsi-pci`.
> > > > +
> > > > +
> > > > +Where to place the devices upon instance startup?
> > > > +
> > > > +By default QEMU boots with the first three slots of the default PCI
> > bus
> > > > +(pci.0) occupied: slot 0 for Host bridge, slot 1 for ISA bridge, and
> > > > +slot 2 for VGA controller. Thereafter, the slots depend on the QEMU
> > > > +options passed in the command line.
> > > > +
> > > > +The main reason that we want to be fully aware of the configuration
> > of a
> > > > +running instance (machine type, PCI and SCSI bus state, devices, etc.)
> > > > +is that in case of migration a QEMU process with the exact same
> > > > +configuration should be created on the target node. The configuration
> > is
> > > > +kept in the runtime file created just before starting the instance.
> > > > +Since hotplug has been introduced, the only thing that can change
> > after
> > > > +starting an instance is the configuration related to NICs and Disks.
> > > > +
> > > > +Before implementing hotplug, Ganeti did not specify PCI slots
> > > > +explicitly, but let QEMU decide how to place the devices on the
> > > > +corresponding bus. This does not work if we want to have hotplug-able
> > > > +devices and migrate-able VMs. Currently, upon runtime file creation,
> > we
> > > > +try to reserve PCI slots based on the hvparams, the disks, and the
> > NICs
> > > > +of the instance. This has three major shortcomings: first, we have to
> > be
> > > > +aware which options modify the PCI bus which is practically impossible
> > > > +due to the huge amount of QEMU options, second, QEMU may change the
> > > > +default PCI configuration from version to version, and third, we
> > cannot
> > > > +know if the extra options passed by the user via the `kvm_extra`
> > hvparam
> > > > +modify the PCI bus.
> > > > +
> > > > +All the above makes the current implementation error prone: an
> > instance
> > > > +might not be able to boot if we explicitly add a NIC/Disk on a
> > specific
> > > > +PCI slot that QEMU has already used for another device while parsing
> > > > +its command line options. Besides that, now, we want to use the SCSI
> > bus
> > > > +as well so the above mechanism is insufficient. Here, we decide to put
> > > > +only disks and NICs on specific slots on the corresponding bus, and
> > let
> > > > +QEMU put everything else automatically. To this end, we decide to let
> > > > +the first 8 PCI slots be managed by QEMU, and we start adding PCI
> > > > +devices (VirtIO block and network devices) from the 9th slot and
> > > > +forward. As far as the SCSI bus is concerned, we decide to put each
> > SCSI
> > > > +disk on a different scsi-id (which corresponds to a different target
> > > > +number in SCSI terminology). The SCSI bus will not have any default
> > > > +reservations.
> > > > +
> > > >
> > >
> > > In the very rare case that someone manages to saturate the first 8 slots
> > > through QEMU reservations and command switches, not managing the
> > > reservations stemming from various options puts a hard limit on what
> > Ganeti
> > > can do. While it might not be a concern now, nor a reasonable concern,
> > this
> > > could be mitigated by picking our reservations from the back side of the
> > > PCI bus, starting from 16 in our case.
> > >
> >
> > I thought that too but there is an ugly and probably unwanted situation:
> >
> > If we have 2 disks the first one (disk/0 for Ganeti) will appear inside
> > the VM as /dev/sdb and the second (disk/1) as /dev/sda. So I suggest to
> > keep it as is.
> >
> 
> Ack, I did not think of that, and now it seems like an obvious flaw. Thanks!
> 
> 
> > > Regarding the number 16: neither the design doc nor the patch series
> > > explain why the PCI bus was shrunk. Any details on that?
> > >
> >
> > Count me wrong on this. I wrongly assumed that the PCI bus has only 16
> > slots when I tried addr=0x20 and got 'Property 'virtio-net-pci.addr'
> > doesn't take value '0x20'' (of course value 0x1f is totally acceptable).
> > So I am going to change the design doc and the corresponding patches so
> > that Ganeti:
> >
> >  - lets QEMU manage the first 16 slots (0x00 - 0x0f)
> >  - places devices (Disks/NICs) from the 17th slot (0x10) onwards
> >
> >
> Great, I approve of that completely.
> 
> 
> > >
> > > > +
> > > > +How to keep track of the bus state of a running instance?
> > > > +
> > > > +To be able to hotplug a device, we need to know which slot is
> > > > +available on the desired bus. Until now, we were using the
> > ``query-pci``
> > > > +QMP command that returns the state of the PCI buses (i.e., which
> > devices
> > > > +occupy which slots). Unfortunately, there is no equivalent for the
> > SCSI
> > > > +buses. We could use the ``info qtree`` HMP command that practically
> > > > +dumps in plain text the whole device tree. This makes it really hard
> > to
> > > > +parse. So we decide to generate the bus state of a running instance
> > > > +through our local runtime files.
> > > > +
> > > > +
> > > > +What info should be kept in runtime files?
> > > > +
> > > > +Runtime files are used for instance migration (to run a QEMU process
> > on
> > > > +the target node with the same configuration) and for hotplug actions
> > (to
> > > > +update the configuration of a running instance so that it can be
> > > > +migrated). Until now we were using devices only on the PCI bus, so
> > only
> > > > +each device's PCI slot should be kept in the runtime file. This is
> > > > +obviously not enough. We decide to replace the `pci` slot of Disk and
> > > > +NIC configuration objects, with an `hvinfo` dict. It will contain all
> > > > +necessary info for constructing the appropriate -device QEMU option.
> > > > +Specifically the `driver`, `id`, and `bus` parameters will be present
> > to
> > > > +all kind of devices. PCI devices will have the `addr` parameter, SCSI
> > > > +devices will have `channel`, `scsi-id`, and `lun`. NICs and Disks will
> > > > +have the extra `netdev` and `drive` parameters correspondingly.
> > > > +
> > > >
> > >
> > > Removing the "pci" field does not allow us to perform downgrades. If an
> > > user upgrades the cluster, reboots the instance, and downgrades the
> > > cluster, the instance will become unmigratable, as will any instances
> > > created on the upgraded version. While this is not permanent damage, we
> > > like our users' uptime, and generating the "pci" field redundantly would
> > > mostly prevent this problem. It would be declared deprecated and removed
> > at
> > > a late enough time when downgrades to a version without hvinfo would not
> > be
> > > a problem.
> > >
> >
> > Well we can use _UpgradeSerializedRuntime() for the downgrade too. If
> > hvinfo is found and NIC/Disk config objects do not have such a slot then we
> > can easily translate it to a "pci" field. Note that an instance with the
> > new disk types can not be managed by an old cluster. But I guess we
> > cannot support everything :)
> >
> 
> The biggest problem with this approach is that we have very little control
> over which older versions of Ganeti users are running, and it's there that
> we would have to modify _UpgradeSerializedRuntime to create the "pci" field.
> So doing this would necessitate that we push a lot of older versions of
> Ganeti ;)
> 

I see your point. And this should not be an option. Maybe it is a good
time to add another entry inside the runtime file denoting the Ganeti
version that has created it. But this can be done in a separate patch
set.

> While I agree maintaining the redundant field feels like stabbing oneself
> in the gut, we could do it just before the runtime serialization, which
> minimizes the spread of evil around the system. Unfortunately, the use case
> of "try a newer version, find it a bit too new, downgrade" is quite a
> common one and we should plan for it.
> 
> As far as the new disk types go: indeed, there are some things we should
> not even try.
> 

Well if we let the stale pci there along with hvinfo as you say, in case
of a downgrade, while using NIC/Disk.FromDict(), we are going to get:

  TypeError: Object NIC doesn't support the parameter 'hvinfo'

It seams that this corner case (upgrade-reboot-downgrade) cannot be
easily supported. If we had the version mentioned above we could have a
proper check, a proper error message and we could fail gracefully in
such a case.

> 
> > > If we did not keep this information contained in the runtime file but in
> > > the configuration instead, this would not be an issue as we could use the
> > > existing upgrade and downgrade infrastructure to fix things up. Any idea
> > > what motivated the original choice of keeping the parameters / cmdline
> > > separate from the config?
> > >
> >
> > Well I guess the main reason is that gnt-instance modify changes things
> > in config but not in runtime. When hotplug was introduced, it was the
> > only action that could change the running instance and therefore the
> > runtime file should change accordingly (for migrations). Second argument
> > is that this kind of info is too hypervisor-specific (that's why I chose
> > hvinfo) and should not contaminate config.data. So I suggest to keep
> > hiding this kind of info away from confd and let noded handle it. I
> > think the solution with _UpgradeSerializedRuntime() is straightforward
> > and pretty easy to implement.
> >
> 
> Thanks! I appreciate the configuration not being a mess, but this solution
> still gives us downgrade issues, and still leads to issues like:
> https://code.google.com/p/ganeti/issues/detail?id=901
> 

Regarding this issue I think we sould just use the hvparms stored in
runtime file rather than the ones that come with the instance
(up_hvp instead of conf_hvp).

> 
> >
> > >
> > > > +
> > > > +How to deal with existing instances?
> > > > +
> > > > +Only existing instances with paravirtual devices (configured via the
> > > > +disk_type and nic_type hvparam) use the latest QEMU device model. Only
> > > > +these have the `pci` slot filled. We will use the existing
> > > > +_UpgradeSerializedRuntime() method to migrate the old runtime format
> > > > +with `pci` slot in Disk and NIC configuration objects to the new one
> > > > +with `hvinfo` instead. The new hvinfo will contain the old driver
> > > > +(either virtio-blk-pci or virtio-net-pci), the old id
> > > > +(hotdisk-123456-pci-4), the default PCI bus (pci.0), and the old PCI
> > > > +slot (addr=4). This way those devices will still be hotplug-able, and
> > > > +the instance will still be migrate-able. When those instances are
> > > > +rebooted, the hvinfo will be re-generated.
> > > > +
> > > > +
> > > > +Configuration changes
> > > > +---------------------
> > > > +
> > > > +The ``NIC`` and ``Disk`` objects get one extra slot: ``hvinfo``. It is
> > > > +hypervisor-specific and will never reach config.data. In case of the
> > KVM
> > > > +Hypervisor it will contain all necessary info for constructing the
> > -device
> > > > +QEMU option. Existing entries in runtime files that had a `pci` slot
> > > > +will be upgraded to have the corresponding `hvinfo` (see above).
> > > > +
> > > > +The new `scsi_controller_type` hvparam is added to denote what type of
> > > > +SCSI controller should be added to PCI bus if we have a SCSI disk.
> > > > +Allowed values will be `lsi`, `virtio-scsi-pci`, and `megasas`.
> > > > +We decide to use `lsi` by default since this is the one that QEMU
> > > > +adds automatically if not specified explicitly by an option.
> > > > +
> > > > +
> > > > +Hypervisor changes
> > > > +------------------
> > > > +
> > > > +The current implementation verifies if a hotplug action has succeeded
> > > > +by scanning the PCI bus and searching for a specific device ID. This
> > > > +will change, and we will use the ``query-block`` along with the
> > > > +``query-pci`` QMP command to find block devices that are attached to
> > the
> > > > +SCSI bus as well.
> > > > +
> > > > +Up until now, if `disk_type` hvparam was set to `scsi`, QEMU would
> > use the
> > > > +deprecated device model and end up using SCSI emulation, e.g.:
> > > > +
> > > > +  ::
> > > > +
> > > > +    -drive
> > file=/var/run/ganeti/instance-disks/test:0,if=scsi,format=raw
> > > > +
> > > > +Now the equivalent, which will also enable hotplugging, will be to set
> > > > +disk_type to `scsi-hd`. The QEMU command line will include:
> > > > +
> > > > +  ::
> > > > +
> > > > +    -drive
> > > >
> > file=/var/run/ganeti/instance-disks/test:0,if=none,format=raw,id=hotdisk-12345
> > > > +    -device
> > > >
> > scsi-hd,id=hotdisk-12345,drive=hotdisk-12345,bus=scsi.0,channel=0,scsi-id=0,lun=0
> > > > +
> > > > +
> > > > +User interface
> > > > +--------------
> > > > +
> > > > +The `disk_type` hvparam will additionally support the `scsi-hd`,
> > > > +`scsi-block`, and `scsi-generic` values. The first one is equivalent
> > to
> > > > +the existing `scsi` value and will make QEMU emulate a SCSI device,
> > > > +while the last two will add support for SCSI pass-through and will
> > > > +require a real SCSI device on the host.
> > > > +
> > > > +.. vim: set textwidth=72 :
> > > > +.. Local Variables:
> > > > +.. mode: rst
> > > > +.. fill-column: 72
> > > > +.. End:
> > > > --
> > > > 1.7.10.4
> > > >
> > > >
> > > Anyway, a very positive change, and I look forward to it making it in
> > soon!
> > >
> >
> > Thanks a lot! I have your points noted and if you agree with my
> > suggestions I'll include them in a v2 patch-set after a first
> > round of review.
> >
> 
> Great, looking forward to it!
> 
> 
> >
> > Cheers,
> > dimara
> >
> >
> Cheers,
> Riba
> 
> 
> > > Hrvoje Ribicic
> > > Ganeti Engineering
> > > Google Germany GmbH
> > > Dienerstr. 12, 80331, München
> > >
> > > Geschäftsführer: Graham Law, Christine Elizabeth Flores
> > > Registergericht und -nummer: Hamburg, HRB 86891
> > > Sitz der Gesellschaft: Hamburg
> >
> 
> Hrvoje Ribicic
> Ganeti Engineering
> Google Germany GmbH
> Dienerstr. 12, 80331, München
> 
> Geschäftsführer: Graham Law, Christine Elizabeth Flores
> Registergericht und -nummer: Hamburg, HRB 86891
> Sitz der Gesellschaft: Hamburg

Attachment: signature.asc
Description: Digital signature

Reply via email to