Re: [Qemu-devel] live migration vs device assignment (motivation)

Alexander Duyck Sun, 13 Dec 2015 11:30:49 -0800

On Sun, Dec 13, 2015 at 7:47 AM, Lan, Tianyu <[email protected]> wrote:
>
>
> On 12/11/2015 1:16 AM, Alexander Duyck wrote:
>>
>> On Thu, Dec 10, 2015 at 6:38 AM, Lan, Tianyu <[email protected]> wrote:
>>>
>>>
>>>
>>> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote:
>>>>>
>>>>>
>>>>> Ideally, it is able to leave guest driver unmodified but it requires
>>>>> the
>>>>>>
>>>>>> hypervisor or qemu to aware the device which means we may need a
>>>>>> driver
>>>>>> in
>>>>>> hypervisor or qemu to handle the device on behalf of guest driver.
>>>>
>>>>
>>>> Can you answer the question of when do you use your code -
>>>>      at the start of migration or
>>>>      just before the end?
>>>
>>>
>>>
>>> Just before stopping VCPU in this version and inject VF mailbox irq to
>>> notify the driver if the irq handler is installed.
>>> Qemu side also will check this via the faked PCI migration capability
>>> and driver will set the status during device open() or resume() callback.
>>
>>
>> The VF mailbox interrupt is a very bad idea.  Really the device should
>> be in a reset state on the other side of a migration.  It doesn't make
>> sense to have the interrupt firing if the device is not configured.
>> This is one of the things that is preventing you from being able to
>> migrate the device while the interface is administratively down or the
>> VF driver is not loaded.
>
>
> From my opinion, if VF driver is not loaded and hardware doesn't start
> to work, the device state doesn't need to be migrated.
>
> We may add a flag for driver to check whether migration happened during it's
> down and reinitialize the hardware and clear the flag when system try to put
> it up.
>
> We may add migration core in the Linux kernel and provide some helps
> functions to facilitate to add migration support for drivers.
> Migration core is in charge to sync status with Qemu.
>
> Example.
> migration_register()
> Driver provides
> - Callbacks to be called before and after migration or for bad path
> - Its irq which it prefers to deal with migration event.


You would be better off just using function pointers in the pci_driver
struct and let the PCI driver registration take care of all that.

> migration_event_check()
> Driver calls it in the irq handler. Migration core code will check
> migration status and call its callbacks when migration happens.

No, this is still a bad idea.  You haven't addressed what you do when
the device has had interrupts disabled such as being in the down
state.

This is the biggest issue I see with your whole patch set.  It
requires the driver containing certain changes and being in a certain
state.  You cannot put those expectations on the guest.  You really
need to try and move as much of this out to existing functionality as
possible.

>>
>> My thought on all this is that it might make sense to move this
>> functionality into a PCI-to-PCI bridge device and make it a
>> requirement that all direct-assigned devices have to exist behind that
>> device in order to support migration.  That way you would be working
>> with a directly emulated device that would likely already be
>> supporting hot-plug anyway.  Then it would just be a matter of coming
>> up with a few Qemu specific extensions that you would need to add to
>> the device itself.  The same approach would likely be portable enough
>> that you could achieve it with PCIe as well via the same configuration
>> space being present on the upstream side of a PCIe port or maybe a
>> PCIe switch of some sort.
>>
>> It would then be possible to signal via your vendor-specific PCI
>> capability on that device that all devices behind this bridge require
>> DMA page dirtying, you could use the configuration in addition to the
>> interrupt already provided for hot-plug to signal things like when you
>> are starting migration, and possibly even just extend the shpc
>> functionality so that if this capability is present you have the
>> option to pause/resume instead of remove/probe the device in the case
>> of certain hot-plug events.  The fact is there may be some use for a
>> pause/resume type approach for PCIe hot-plug in the near future
>> anyway.  From the sounds of it Apple has required it for all
>> Thunderbolt device drivers so that they can halt the device in order
>> to shuffle resources around, perhaps we should look at something
>> similar for Linux.
>>
>> The other advantage behind grouping functions on one bridge is things
>> like reset domains.  The PCI error handling logic will want to be able
>> to reset any devices that experienced an error in the event of
>> something such as a surprise removal.  By grouping all of the devices
>> you could disable/reset/enable them as one logical group in the event
>> of something such as the "bad path" approach Michael has mentioned.
>>
>
> These sounds we need to add a faked bridge for migration and adding a
> driver in the guest for it. It also needs to extend PCI bus/hotplug
> driver to do pause/resume other devices, right?
>
> My concern is still that whether we can change PCI bus/hotplug like that
> without spec change.
>
> IRQ should be general for any devices and we may extend it for
> migration. Device driver also can make decision to support migration
> or not.

The device should have no say in the matter.  Either we are going to
migrate or we will not.  This is why I have suggested my approach as
it allows for the least amount of driver intrusion while providing the
maximum number of ways to still perform migration even if the device
doesn't support it.

The solution I have proposed is simple:

1.  Extend swiotlb to allow for a page dirtying functionality.

     This part is pretty straight forward.  I'll submit a few patches
later today as RFC that can provided the minimal functionality needed
for this.

2.  Provide a vendor specific configuration space option on the QEMU
implementation of a PCI bridge to act as a bridge between direct
assigned devices and the host bridge.

     My thought was to add some vendor specific block that includes a
capabilities, status, and control register so you could go through and
synchronize things like the DMA page dirtying feature.  The bridge
itself could manage the migration capable bit inside QEMU for all
devices assigned to it.  So if you added a VF to the bridge it would
flag that you can support migration in QEMU, while the bridge would
indicate you cannot until the DMA page dirtying control bit is set by
the guest.

     We could also go through and optimize the DMA page dirtying after
this is added so that we can narrow down the scope of use, and as a
result improve the performance for other devices that don't need to
support migration.  It would then be a matter of adding an interrupt
in the device to handle an event such as the DMA page dirtying status
bit being set in the config space status register, while the bit is
not set in the control register.  If it doesn't get set then we would
have to evict the devices before the warm-up phase of the migration,
otherwise we can defer it until the end of the warm-up phase.

3.  Extend existing shpc driver to support the optional "pause"
functionality as called out in section 4.1.2 of the Revision 1.1 PCI
hot-plug specification.

     Note I call out "extend" here instead of saying to add this.
Basically what we should do is provide a means of quiescing the device
without unloading the driver.  This is called out as something the OS
vendor can optionally implement in the PCI hot-plug specification.  On
OSes that wouldn't support this it would just be treated as a standard
hot-plug event.   We could add a capability, status, and control bit
in the vendor specific configuration block for this as well and if we
set the status bit would indicate the host wants to pause instead of
remove and the control bit would indicate the guest supports "pause"
in the OS.  We then could optionally disable guest migration while the
VF is present and pause is not supported.

     To support this we would need to add a timer and if a new device
is not inserted in some period of time (60 seconds for example), or if
a different device is inserted, we need to unload the original driver
from the device.  In addition we would need to verify if drivers can
call the remove function after having called suspend without resume.
If not, we could look at adding a recovery function to remove the
driver from the device in the case of a suspend with either a failed
resume or no resume call.  Once again it would probably be useful to
have for those cases where power management suspend/resume runs into
an issue like somebody causing a surprise removal while a device was
suspended.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] live migration vs device assignment (motivation)

Reply via email to