Re: [Qemu-devel] live migration vs device assignment (motivation)

Alexander Duyck Fri, 25 Dec 2015 14:32:21 -0800

On Thu, Dec 24, 2015 at 11:03 PM, Lan Tianyu <[email protected]> wrote:
> Merry Christmas.
> Sorry for later response due to personal affair.
>
> On 2015年12月14日 03:30, Alexander Duyck wrote:
>>> > These sounds we need to add a faked bridge for migration and adding a
>>> > driver in the guest for it. It also needs to extend PCI bus/hotplug
>>> > driver to do pause/resume other devices, right?
>>> >
>>> > My concern is still that whether we can change PCI bus/hotplug like that
>>> > without spec change.
>>> >
>>> > IRQ should be general for any devices and we may extend it for
>>> > migration. Device driver also can make decision to support migration
>>> > or not.
>> The device should have no say in the matter.  Either we are going to
>> migrate or we will not.  This is why I have suggested my approach as
>> it allows for the least amount of driver intrusion while providing the
>> maximum number of ways to still perform migration even if the device
>> doesn't support it.
>
> Even if the device driver doesn't support migration, you still want to
> migrate VM? That maybe risk and we should add the "bad path" for the
> driver at least.


At a minimum we should have support for hot-plug if we are expecting
to support migration.  You would simply have to hot-plug the device
before you start migration and then return it after.  That is how the
current bonding approach for this works if I am not mistaken.

The advantage we are looking to gain is to avoid removing/disabling
the device for as long as possible.  Ideally we want to keep the
device active through the warm-up period, but if the guest doesn't do
that we should still be able to fall back on the older approaches if
needed.

>>
>> The solution I have proposed is simple:
>>
>> 1.  Extend swiotlb to allow for a page dirtying functionality.
>>
>>      This part is pretty straight forward.  I'll submit a few patches
>> later today as RFC that can provided the minimal functionality needed
>> for this.
>
> Very appreciate to do that.
>
>>
>> 2.  Provide a vendor specific configuration space option on the QEMU
>> implementation of a PCI bridge to act as a bridge between direct
>> assigned devices and the host bridge.
>>
>>      My thought was to add some vendor specific block that includes a
>> capabilities, status, and control register so you could go through and
>> synchronize things like the DMA page dirtying feature.  The bridge
>> itself could manage the migration capable bit inside QEMU for all
>> devices assigned to it.  So if you added a VF to the bridge it would
>> flag that you can support migration in QEMU, while the bridge would
>> indicate you cannot until the DMA page dirtying control bit is set by
>> the guest.
>>
>>      We could also go through and optimize the DMA page dirtying after
>> this is added so that we can narrow down the scope of use, and as a
>> result improve the performance for other devices that don't need to
>> support migration.  It would then be a matter of adding an interrupt
>> in the device to handle an event such as the DMA page dirtying status
>> bit being set in the config space status register, while the bit is
>> not set in the control register.  If it doesn't get set then we would
>> have to evict the devices before the warm-up phase of the migration,
>> otherwise we can defer it until the end of the warm-up phase.
>>
>> 3.  Extend existing shpc driver to support the optional "pause"
>> functionality as called out in section 4.1.2 of the Revision 1.1 PCI
>> hot-plug specification.
>
> Since your solution has added a faked PCI bridge. Why not notify the
> bridge directly during migration via irq and call device driver's
> callback in the new bridge driver?
>
> Otherwise, the new bridge driver also can check whether the device
> driver provides migration callback or not and call them to improve the
> passthough device's performance during migration.

This is basically what I had in mind.  Though I would take things one
step further.  You don't need to add any new call-backs if you make
use of the existing suspend/resume logic.  For a VF this does exactly
what you would need since the VFs don't support wake on LAN so it will
simply clear the bus master enable and put the netdev in a suspended
state until the resume can be called.

The PCI hot-plug specification calls out that the OS can optionally
implement a "pause" mechanism which is meant to be used for high
availability type environments.  What I am proposing is basically
extending the standard SHPC capable PCI bridge so that we can support
the DMA page dirtying for everything hosted on it, add a vendor
specific block to the config space so that the guest can notify the
host that it will do page dirtying, and add a mechanism to indicate
that all hot-plug events during the warm-up phase of the migration are
pause events instead of full removals.

I've been poking around in the kernel and QEMU code and the part I
have been trying to sort out is how to get QEMU based pci-bridge to
use the SHPC driver because from what I can tell the driver never
actually gets loaded on the device as it is left in the control of
ACPI hot-plug.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] live migration vs device assignment (motivation)

Reply via email to