On Thu, Jul 11, 2013 at 02:39:24pm +0200, Guido Trotter wrote: > On Thu, Jul 11, 2013 at 2:31 PM, Vangelis Koukis <[email protected]> wrote: > > On Thu, Jul 11, 2013 at 12:27:00pm +0200, Guido Trotter wrote: > >> +list > >> > >> > >> ---------- Forwarded message ---------- > >> From: Guido Trotter <[email protected]> > >> Date: Thu, Jul 11, 2013 at 12:26 PM > >> Subject: Re: [PATCH master] Add hotplug design doc > >> To: Dimitris Aragiorgis <[email protected]> > >> > >> > >> On Fri, Jul 5, 2013 at 7:48 PM, Dimitris Aragiorgis <[email protected]> > >> wrote: > >> > + lists > >> > > >> > * Guido Trotter <[email protected]> [2013-07-05 12:57:33 +0200]: > >> > > >> >> On Fri, Jul 5, 2013 at 8:50 AM, Dimitris Aragiorgis <[email protected]> > >> >> wrote: > >> >> > This is the design behind the first hotplug implementation > >> >> > for the KVM hypervisor. > >> >> > > >> >> > Signed-off-by: Dimitris Aragiorgis <[email protected]> > >> >> > --- > >> >> > > >> >> > Hello team, > >> >> > > >> >> > This is the updated design doc for hotplug. It includes all > >> >> > modifications/ suggestions that have been discussed in the last > >> >> > thread. > >> >> > I will wait for your comments or eventually the final ACK so that I > >> >> > can > >> >> > proceed with implementation patches. > >> >> > > >> >> > Thanks a lot, > >> >> > dimara > >> >> > > >> >> > Makefile.am | 1 + > >> >> > doc/design-draft.rst | 1 + > >> >> > doc/design-hotplug.rst | 222 > >> >> > ++++++++++++++++++++++++++++++++++++++++++++++++ > >> >> > 3 files changed, 224 insertions(+) > >> >> > create mode 100644 doc/design-hotplug.rst > >> >> > > >> >> > diff --git a/Makefile.am b/Makefile.am > >> >> > index 91f3f37..fda6f58 100644 > >> >> > --- a/Makefile.am > >> >> > +++ b/Makefile.am > >> >> > @@ -435,6 +435,7 @@ docinput = \ > >> >> > doc/design-cpu-pinning.rst \ > >> >> > doc/design-device-uuid-name.rst \ > >> >> > doc/design-draft.rst \ > >> >> > + doc/design-hotplug.rst \ > >> >> > doc/design-htools-2.3.rst \ > >> >> > doc/design-http-server.rst \ > >> >> > doc/design-impexp2.rst \ > >> >> > diff --git a/doc/design-draft.rst b/doc/design-draft.rst > >> >> > index 0e454cd..4c1c692 100644 > >> >> > --- a/doc/design-draft.rst > >> >> > +++ b/doc/design-draft.rst > >> >> > @@ -20,6 +20,7 @@ Design document drafts > >> >> > design-internal-shutdown.rst > >> >> > design-glusterfs-ganeti-support.rst > >> >> > design-openvswitch.rst > >> >> > + design-hotplug.rst > >> >> > > >> >> > .. vim: set textwidth=72 : > >> >> > .. Local Variables: > >> >> > diff --git a/doc/design-hotplug.rst b/doc/design-hotplug.rst > >> >> > new file mode 100644 > >> >> > index 0000000..ff4ff95 > >> >> > --- /dev/null > >> >> > +++ b/doc/design-hotplug.rst > >> >> > @@ -0,0 +1,222 @@ > >> >> > +======= > >> >> > +Hotplug > >> >> > +======= > >> >> > + > >> >> > +.. contents:: :depth: 4 > >> >> > + > >> >> > +This is a design document detailing the implementation of device > >> >> > +hotplugging in Ganeti. The logic used is hypervisor agnostic but > >> >> > still > >> >> > +the initial implementation will target the KVM hypervisor. The > >> >> > +implementation adds ``python-fdsend`` as a new dependency. > >> >> > + > >> >> > >> >> Can you please specify, as we agreed, that python-fdsend is an > >> >> optional dependency, and if not present Ganeti will still work, but > >> >> hotplugging won't be possible? > >> >> > >> > > >> > Yes sure. Just like affinity module. BTW only NIC hotplug depends on > >> > fdsend > >> > so we could still support disk hotplug. > >> > > >> >> > + > >> >> > +Current state and shortcomings > >> >> > +============================== > >> >> > + > >> >> > +Currently, Ganeti supports addition/removal/modification of devices > >> >> > +(NICs, Disks) but the actual modification takes place only after > >> >> > +rebooting the instance. To this end an instance cannot change > >> >> > network, > >> >> > +get a new disk etc. without a hard reboot. > >> >> > + > >> >> > +Until now, in case of KVM hypervisor, code does not name devices nor > >> >> > +places them in specific PCI slots. Devices are appended in the KVM > >> >> > +command and Ganeti lets KVM decide where to place them. This means > >> >> > that > >> >> > +there is a possibility a device that resides in PCI slot 5, after a > >> >> > +reboot (due to another device removal) to be moved to another PCI > >> >> > slot > >> >> > +and probably get renamed too (due to udev rules, etc.). > >> >> > + > >> >> > +In order migration to succeed, the process on the target node should > >> >> > be > >> >> > +started with exactly the same machine version, CPU architecture and > >> >> > PCI > >> >> > +configuration with the running process. During instance > >> >> > creation/startup > >> >> > +ganeti creates a KVM runtime file with all the necessary information > >> >> > to > >> >> > +generate the KVM command. This runtime file is used during instance > >> >> > +migration to start a new identical KVM process. The current format > >> >> > +includes the fixed part of the final KVM command, a list of NICs', > >> >> > +and hvparams dict. It does not favor easy manipulations concerning > >> >> > +disks, because they are encapsulated in the fixed KVM command. > >> >> > + > >> >> > +Proposed changes > >> >> > +================ > >> >> > + > >> >> > +For the case of the KVM hypervisor, QEMU exposes 32 PCI slots to the > >> >> > +instance. Disks and NICs occupy some of these slots. Recent versions > >> >> > of > >> >> > +QEMU have introduced monitor commands that allow addition/removal of > >> >> > PCI > >> >> > +devices. Devices are referenced based on their name or position on > >> >> > the > >> >> > +virtual PCI bus. To be able to use these commands, we need to be > >> >> > able to > >> >> > +assign each device a unique name. > >> >> > + > >> >> > +To keep track where each device is plugged into, we add the > >> >> > +``pci`` slot to Disk and NIC objects, but we save it only in runtime > >> >> > +files, since it is hypervisor specific info. This is added for easy > >> >> > +object manipulation and is ensured not to be written back to the > >> >> > config. > >> >> > + > >> >> > +We propose to make use of QEMU 1.0 monitor commands so that > >> >> > +modifications to devices take effect instantly without the need for > >> >> > hard > >> >> > +reboot. The only change exposed to the end-user will be the addition > >> >> > of > >> >> > +a ``--hotplug`` option to the ``gnt-instance modify`` command. > >> >> > + > >> >> > +Upon hotplugging the PCI configuration of an instance is changed. > >> >> > +Runtime files should be updated correspondingly. Currently this is > >> >> > +impossible in case of disk hotplug because disks are included in > >> >> > command > >> >> > +line entry of the runtime file, contrary to NICs that are correctly > >> >> > +treated separately. We change the format of runtime files, we remove > >> >> > +disks from the fixed KVM command and create new entry containing them > >> >> > +only. KVM options concerning disk are generated during > >> >> > +``_ExecuteKVMCommand()``, just like NICs. > >> >> > + > >> >> > +Design decisions > >> >> > +================ > >> >> > + > >> >> > +Which should be each device ID? Currently KVM does not support > >> >> > arbitrary > >> >> > +IDs for devices; supported are only names starting with a letter, > >> >> > max 32 > >> >> > +chars length, and only including '.' '_' '-' special chars. > >> >> > +We use the device pci slot and name it after <device type>-pci-<slot> > >> >> > +(for debugging purposes we could add a part of uuid as well). > >> >> > >> >> Didn't we decide for just <device-type>-<part-of-uuid>-<slot> ? > >> >> > >> > > >> > Well I did that in order kvm command to be readable and not full of > >> > random numbers. Adding the part of uuid is simple just another line of > >> > code and > >> > nothing more. OK. I 'll change it to <device-type>-<part-of-uuid>-<slot>. > >> > > >> >> > + > >> >> > +Who decides where to hotplug each device? As long as this is a > >> >> > +hypervisor specific matter, there is no point for the master node to > >> >> > +decide such a thing. Master node just has to request noded to > >> >> > hotplug a > >> >> > +device. To this end, hypervisor specific code should parse the > >> >> > current > >> >> > +PCI configuration (i.e. ``info pci`` QEMU monitor command), find the > >> >> > first > >> >> > +available slot and hotplug the device. Having noded to decide where > >> >> > to > >> >> > +hotplug a device we ensure that no error will occur due to duplicate > >> >> > +slot assignment (if masterd keeps track of PCI reservations and noded > >> >> > +fails to return the PCI slot that the device was plugged into then > >> >> > next > >> >> > +hotplug will fail). > >> >> > + > >> >> > +Where should we keep track of devices' PCI slots? As already > >> >> > mentioned, > >> >> > +we must keep track of devices PCI slots to successfully migrate > >> >> > +instances. First option is to save this info to config data, which > >> >> > would > >> >> > +allow us to place each device at the same PCI slot after reboot. This > >> >> > +would require to make the hypervisor return the PCI slot chosen for > >> >> > each > >> >> > +device, and storing this information to config data. Additionally the > >> >> > +whole instance configuration should be returned with PCI slots filled > >> >> > +after instance start and each instance should keep track of current > >> >> > PCI > >> >> > +reservations. We decide not to go towards this direction in order to > >> >> > +keep it simple and do not add hypervisor specific info to > >> >> > configuration > >> >> > +data (``pci_reservations`` at instance level and ``pci`` at device > >> >> > +level). For the aforementioned reason, we decide to store this info > >> >> > only > >> >> > +in KVM runtime files. > >> >> > + > >> >> > +Where to place the devices upon instance startup? QEMU has by > >> >> > default 4 > >> >> > +pre-occupied PCI slots. So, hypervisor can use the remaining ones for > >> >> > +disks and NICs. Currently, PCI configuration is not preserved after > >> >> > +reboot. Each time an instance starts, KVM assigns PCI slots to > >> >> > devices > >> >> > +based on their ordering in Ganeti configuration, i.e. the second disk > >> >> > +will be placed after the first, the third NIC after the second, etc. > >> >> > +Since we decided that there is no need to keep track of devices PCI > >> >> > +slots, there is no need to change current functionality. > >> >> > + > >> >> > +How to deal with existing instances? Hotplug depends on runtime file > >> >> > +manipulation. It stores there pci info and every device the kvm > >> >> > process is > >> >> > +currently using. Existing files have no pci info in devices and have > >> >> > block > >> >> > +devices encapsulated inside kvm_cmd entry. Thus hotplugging of > >> >> > existing devices > >> >> > +will not be possible. > >> >> > Still migration and hotplugging of new devices will > >> >> > +succeed. The workaround will happen upon loading kvm runtime: if we > >> >> > detect old > >> >> > +style format we will add an empty list for block devices and upon > >> >> > saving kvm > >> >> > +runtime we will include this empty list as well. Switching entirely > >> >> > to new > >> >> > +format will happen upon instance reboot. > >> >> > + > >> >> > + > >> >> > +Configuration changes > >> >> > +--------------------- > >> >> > + > >> >> > +The ``NIC`` and ``Disk`` objects get one extra slot: ``pci``. It > >> >> > refers to > >> >> > +PCI slot that the device gets plugged into. > >> >> > + > >> >> > +In order to be able to live migrate successfully, runtime files > >> >> > should > >> >> > +be updated every time a live modification (hotplug) takes place. To > >> >> > this > >> >> > +end we change the format of runtime files. The KVM options referring > >> >> > to > >> >> > +instance's disks are no longer recorded as part of the KVM command > >> >> > line. > >> >> > +Disks are treated separately, just as we treat NICs right now. We > >> >> > insert > >> >> > +and remove entries to reflect the current PCI configuration. > >> >> > + > >> >> > + > >> >> > +Backend changes > >> >> > +--------------- > >> >> > + > >> >> > +Introduce one new RPC call: > >> >> > + > >> >> > +- hotplug_device(DEVICE_TYPE, ACTION, device, ...) > >> >> > + > >> >> > +where DEVICE_TYPE can be either NIC or Disk, and ACTION either > >> >> > REMOVE or ADD. > >> >> > + > >> >> > +Hypervisor changes > >> >> > +------------------ > >> >> > + > >> >> > +We implement hotplug on top of the KVM hypervisor. We take advantage > >> >> > of > >> >> > +QEMU 1.0 monitor commands (``device_add``, ``device_del``, > >> >> > +``drive_add``, ``drive_del``, ``netdev_add``,`` netdev_del``). QEMU > >> >> > +refers to devices based on their id. We use ``uuid`` to name them > >> >> > +properly. If a device is about to be hotplugged we parse the output > >> >> > of > >> >> > +``info pci`` and find the occupied PCI slots. We choose the first > >> >> > +available and the whole device object is appended to the > >> >> > corresponding > >> >> > +entry in the runtime file. > >> >> > + > >> >> > +Concerning NIC handling, we build on the top of the existing logic > >> >> > +(first create a tap with _OpenTap() and then pass its file > >> >> > descriptor to > >> >> > +the KVM process). To this end we need to pass access rights to the > >> >> > +corresponding file descriptor over the monitor socket (UNIX domain > >> >> > +socket). The open file is passed as a socket-level control message > >> >> > +(SCM), using the ``fdsend`` python library. > >> >> > + > >> >> > + > >> >> > +User interface > >> >> > +-------------- > >> >> > + > >> >> > +The new ``--hotplug`` option to gnt-instance modify is introduced, > >> >> > which > >> >> > +forces live modifications. > >> >> > + > >> >> > + > >> >> > +Enabling hotplug > >> >> > +++++++++++++++++ > >> >> > + > >> >> > +Hotplug will be optional during gnt-instance modify. For existing > >> >> > +instance, after installing a version that supports hotplugging we > >> >> > +have the restriction that hotplug will not be supported for existing > >> >> > +devices. The reason is that old runtime files lack of: > >> >> > + > >> >> > +1. Device pci configuration info. > >> >> > + > >> >> > +2. Separate block device entry. > >> >> > + > >> >> > +Hotplug will be supported only for KVM in the first implementation. > >> >> > For > >> >> > +all other hypervisors, backend will raise an Exception case hotplug > >> >> > is > >> >> > +requested. > >> >> > + > >> >> > + > >> >> > +NIC hotplug > >> >> > ++++++++++++ > >> >> > + > >> >> > +The user can add/modify/remove NICs either with hotplugging or not. > >> >> > If a > >> >> > +NIC is to be added a tap is created first and configured properly > >> >> > with > >> >> > +kvm-vif-bridge script. Then the instance gets a new network > >> >> > interface. > >> >> > +Since there is no QEMU monitor command to modify a NIC, we modify a > >> >> > NIC > >> >> > +by temporary removing the existing one and adding a new with the new > >> >> > +configuration. When removing a NIC the corresponding tap gets > >> >> > removed as > >> >> > +well. > >> >> > + > >> >> > >> >> Please specify that this (modify as add/remove) is a potentially > >> >> dangerous operation and there will be warnings. > >> >> > >> > > >> > I will handle it just like we handle migrations. On the client side add > >> > a "BIG WARNING. Continue? [y/N]" > >> > > >> >> > +:: > >> >> > + > >> >> > + gnt-instance modify --net add --hotplug test > >> >> > + gnt-instance modify --net 1:mac=aa:00:00:55:44:33 --hotplug test > >> >> > + gnt-instance modify --net 1:remove --hotplug test > >> >> > + > >> >> > + > >> >> > +Disk hotplug > >> >> > +++++++++++++ > >> >> > + > >> >> > +The user can add and remove disks with hotplugging or not. QEMU > >> >> > monitor > >> >> > +supports resizing of disks, however the initial implementation will > >> >> > +support only disk addition/deletion. > >> >> > + > >> >> > +:: > >> >> > + > >> >> > + gnt-instance modify --disk add:size=1G --hotplug test > >> >> > + gnt-instance modify --net 1:remove --hotplug test > >> >> > + > >> >> > +.. vim: set textwidth=72 : > >> >> > +.. Local Variables: > >> >> > +.. mode: rst > >> >> > +.. fill-column: 72 > >> >> > +.. End: > >> >> > >> >> Please finally specify the status about supporting non-root and chroot > >> >> with hotplug. > >> >> Will this work from the first version? > >> >> > >> > > >> > Well after testing and digging (most of it done by psomas [cc]) we > >> > mention the > >> > following: > >> > > >> > - nic hotplugging succeeds both with uid pools and chroot. > >> > - disk hotplugging will fail. I propose for those cases, check hvparams > >> > on hypervisor level and if security_model is other than SM_NONE or if > >> > use_chroot is True just report a warning and continue. The device will > >> > be available after reboot. > >> > > >> > KVM 1.2 or 1.3 has introduced add-fd command which may solve the problem > >> > but > >> > I haven't tested it at all. Debian jessie has still 1.1.2 so there is > >> > no reason to hurry. > >> > > >> > Are you OK with the above? If yes can I could send you a design doc > >> > interdiff and then the rest of the patches. > >> > > >> > >> All sounds good, with one note: there were reports that it was > >> possible to do hotplug of disks inside a chroot by linking or creating > >> the device there, hotplugging it, and then removing the device. Would > >> it make sense to do this? (or do say that it's going to be done in the > >> next version). > >> > >> Thanks, > >> > >> Guido > >> > > > > Hello Guido, > > > > we've actually had an entensive discussion internally about the problems > > with device hotplugging when uid pooling, chroot or both are enabled. > > > > Thanks for the extra information. > > > These problems do not have anything to do with hotplugging per se, > > they're a direct result of the way uid pooling and chroot work. > > > > I'll try to summarize the problems here and provide a few approaches we > > want to explore. The summary: We will extend the design doc so that it > > mentions that hotplugging is not supported when KVM runs in a chroot, or > > when the security model is not None. We will also incorporate a > > discussion of possible solutions, as outlined below, but I think it'd be > > best not to commit to a specific solution yet ("it's going to be done > > <this way> in the next version"). > > > > Yes, please add the content below to the design doc, as it's going to > be useful for the future. > > > The problems: > > * uid pool: KVM starts as root, opens e.g., > > /var/run/ganeti/instance-disks/snf-40416:0 which points to /dev/drbd3, > > which is owned by root:disk, then it setuid()s itself to user 'pool152'. > > After a while, we ask the KVM process to hotplug a new disk, e.g., > > /var/run/ganeti/instance-disks/snf-40416:1 -> /dev/drbd4, > > which is owned by root:disk. KVM cannot open the block device, because > > it no longer runs as root. > > > > * chroot: same as above, this time KVM chroot's itself to the empty > > directory /var/run/chroot-hypervisor after initialization. We ask it to > > hotplug > > /var/run/ganeti/instance-disks/snf-40416:1 -> /dev/drbd4 > > and it fails, because there is no such device node in the chroot > > directory. > > > > Possible solutions: > > * Latest KVM supports passing file descriptors through the monitor UNIX > > domain socket. This way, the priviledged Ganeti noded can open the > > block device itself, then pass the open file descriptor through the > > monitor socket to the chrooted KVM process which runs under a pooled > > uid. This will not work with earlier KVM versions, e.g., the one in > > Debian Jessie. > > > > No problem at all, we should still start supporting it as we can, as > there will definitely be backports for wheezy, or people using it on > other distros. Plus with our "use feature when detected" method it > should be easy to use it "when it's there" > > > * For uid pool, we can have noded chown() the block device to the > > corresponding uid, so the KVM process can open() it successfully. > > Once KVM has opened the device, we can chown() the device node back to > > its previous owner, presumably root. This has the disadvantage that > > should the operation be aborted (e.g., noded dies), the device node > > will be left with a non-root owner, which could be a security risk. > > > > Ack, indeed. Perhaps this behavior could be used as a fallback *if* > the above doesn't work. Or even not supported at all, and we could say > "hotplug+uid pool only works with the right version of qemu" > > > Another workaround could be chown()ing the device node in a pre hook, > > but this is a very ugly hack, that has all of the disadvantages of the > > above solution, without any advantages. > > > > * For chroot, we have to create the device nodes inside the chroot > > itself. Once KVM has opened the device node, we can remove it, for > > security reasons. Linking the device node will not work, since name > > resolution for symlinks is also restricted by the chroot. Again, > > if the process is aborted half way, the device node remains. We could > > also do a bind mount of /dev inside the chroot, but this would defeat > > the whole purpose of running inside a chroot in the first place. > > Indeed, it should be just "create" and then "delete". Of course in > this case and the one above cluster verify should check whether there > are leftovers. > > > > > Finally, we can combine both approaches: If a KVM process runs as > > non-root inside a chroot, we have to mknod() and chown() a device node > > inside the chroot, and remove it immediately afterwards. > > > > We will extend the design doc to mention specifically that hotplugging > > is incompatible with uid pooling or chroot, due to the way they work, > > and also warn the user at runtime. We will also include a summary of the > > options above, which we will implement in a later version. > > > > How does all this sound? > > > > Sounds perfect. I'd be even ok with just the fd version being > implemented, saying that we depend on a newer version of qemu: nobody > says we have to support all possible features on plain wheezy. > > Thanks, > > Guido
[+list] Thanks Guido, We will send a final update to the design doc right away, then patches for hotplug support. We'll continue with implementing the fd version, and the mknod()/chown() version as a fallback in case of older qemu. Vangelis.
signature.asc
Description: Digital signature
