Re: [PATCH V1 30/32] vfio-pci: save and restore

Steven Sistare Wed, 07 Oct 2020 14:26:17 -0700

On 8/20/2020 6:33 AM, Jason Zeng wrote:
> On Wed, Aug 19, 2020 at 05:15:11PM -0400, Steven Sistare wrote:
>> On 8/9/2020 11:50 PM, Jason Zeng wrote:
>>> On Fri, Aug 07, 2020 at 04:38:12PM -0400, Steven Sistare wrote:
>>>> On 8/6/2020 6:22 AM, Jason Zeng wrote:
>>>>> Hi Steve,
>>>>>
>>>>> On Thu, Jul 30, 2020 at 08:14:34AM -0700, Steve Sistare wrote:
>>>>>> @@ -3182,6 +3207,51 @@ static Property vfio_pci_dev_properties[] = {
>>>>>>      DEFINE_PROP_END_OF_LIST(),
>>>>>>  };
>>>>>>  
>>>>>> +static int vfio_pci_post_load(void *opaque, int version_id)
>>>>>> +{
>>>>>> +    int vector;
>>>>>> +    MSIMessage msg;
>>>>>> +    Error *err = 0;
>>>>>> +    VFIOPCIDevice *vdev = opaque;
>>>>>> +    PCIDevice *pdev = &vdev->pdev;
>>>>>> +
>>>>>> +    if (msix_enabled(pdev)) {
>>>>>> +        vfio_msix_enable(vdev);
>>>>>> +        pdev->msix_function_masked = false;
>>>>>> +
>>>>>> +        for (vector = 0; vector < vdev->pdev.msix_entries_nr; vector++) 
>>>>>> {
>>>>>> +            if (!msix_is_masked(pdev, vector)) {
>>>>>> +                msg = msix_get_message(pdev, vector);
>>>>>> +                vfio_msix_vector_use(pdev, vector, msg);
>>>>>> +            }
>>>>>> +        }
>>>>>
>>>>> It looks to me MSIX re-init here may lose device IRQs and impact
>>>>> device hardware state?
>>>>>
>>>>> The re-init will cause the kernel vfio driver to connect the device
>>>>> MSIX vectors to new eventfds and KVM instance. But before that, device
>>>>> IRQs will be routed to previous eventfd. Looks these IRQs will be lost.
>>>>
>>>> Thanks Jason, that sounds like a problem.  I could try reading and saving 
>>>> an 
>>>> event from eventfd before shutdown, and injecting it into the eventfd after
>>>> restart, but that would be racy unless I disable interrupts.  Or, 
>>>> unconditionally
>>>> inject a spurious interrupt after restart to kick it, in case an interrupt 
>>>> was lost.
>>>>
>>>> Do you have any other ideas?
>>>
>>> Maybe we can consider to also hand over the eventfd file descriptor, or
>>
>> I believe preserving this descriptor in isolation is not sufficient.  We 
>> would
>> also need to preserve the KVM instance which it is linked to.
>>
>>> or even the KVM fds to the new Qemu?
>>>
>>> If the KVM fds can be preserved, we will just need to restore Qemu KVM
>>> side states. But not sure how complicated the implementation would be.
>>
>> That should work, but I fear it would require many code changes in QEMU
>> to re-use descriptors at object creation time and suppress the initial 
>> configuration ioctl's, so it's not my first choice for a solution.
>>
>>> If we only preserve the eventfd fd, we can attach the old eventfd to
>>> vfio devices. But looks it may turn out we always inject an interrupt
>>> unconditionally, because kernel KVM irqfd eventfd handling is a bit
>>> different than normal user land eventfd read/write. It doesn't decrease
>>> the counter in the eventfd context. So if we read the eventfd from new
>>> Qemu, it looks will always have a non-zero counter, which requires an
>>> interrupt injection.
>>
>> Good to know, thanks.
>>
>> I will try creating a new eventfd and injecting an interrupt unconditionally.
>> I need a test case to demonstrate losing an interrupt, and fixing it with
>> injection.  Any advice?  My stress tests with a virtual function nic and a
>> directly assigned nvme block device have never failed across live update.
>>
> 
> I am not familiar with nvme devices. For nic device, to my understanding,
> stress nic testing will not have many IRQs, because nic driver usually
> enables NAPI, which only take the first interrupt, then disable interrupt
> and start polling. It will only re-enable interrupt after some packet
> quota reached or the traffic quiesces for a while. But anyway, if the
> test goes enough long time, the number of IRQs should also be big, not
> sure why it doesn't trigger any issue. Maybe we can have some study on
> the IRQ pattern for the testing and see how we can design a test case?
> or see if our assumption is wrong?
> 
> 
>>>>> And the re-init will make the device go through the procedure of
>>>>> disabling MSIX, enabling INTX, and re-enabling MSIX and vectors.
>>>>> So if the device is active, its hardware state will be impacted?
>>>>
>>>> Again thanks.  vfio_msix_enable() does indeed call 
>>>> vfio_disable_interrupts().
>>>> For a quick experiment, I deleted that call in for the post_load code 
>>>> path, and 
>>>> it seems to work fine, but I need to study it more.
>>>
>>> vfio_msix_vector_use() will also trigger this procedure in the kernel.
>>
>> Because that code path calls VFIO_DEVICE_SET_IRQS? Or something else?
>> Can you point to what it triggers in the kernel?
> 
> 
> In vfio_msix_vector_use(), I see vfio_disable_irqindex() will be invoked
> if "vdev->nr_vectors < nr + 1" is true. Since the 'vdev' is re-inited,
> so this condition should be true, and vfio_disable_irqindex() will
> trigger VFIO_DEVICE_SET_IRQS with VFIO_IRQ_SET_DATA_NONE, which will
> cause kernel to disable MSIX.
> 
>>
>>> Looks we shouldn't trigger any kernel vfio actions here? Because we
>>> preserve vfio fds, so its kernel state shouldn't be touched. Here we
>>> may only need to restore Qemu states. Re-connect to KVM instance should
>>> be done automatically when we setup the KVM irqfds with the same eventfd.
>>>
>>> BTW, if I remember correctly, it is not enough to only save MSIX state
>>> in the snapshot. We should also save the Qemu side pci config space
>>> cache to the snapshot, because Qemu's copy is not exactly the same as
>>> the kernel's copy. I encountered this before, but I don't remember which
>>> field it was.
>>
>> FYI all, Jason told me offline that qemu may emulate some pci capabilities 
>> and
>> hence keeps state in the shadow config that is never written to the kernel.
>> I need to study that.
>>
> 
> Sorry, I read the code again, see Qemu does write all config-space-write
> to kernel in vfio_pci_write_config(). Now I am also confused about what
> I was seeing previously :(. But it seems we still need to look at kernel
> code to see if mismatch is possibile for config space cache between Qemu
> and kernel.
> 
> FYI. Some discussion about the VFIO PCI config space saving/restoring in
> live migration scenario:
> https://lists.gnu.org/archive/html/qemu-devel/2020-06/msg06964.html
> 

I have coded a solution for much of the "lost interrupts" issue.
cprsave preserves the vfio err, req, and msi irq eventfd's across exec:
  vdev->err_notifier
  vdev->req_notifier
  vdev->msi_vectors[i].interrupt
  vdev->msi_vectors[i].kvm_interrupt

The KVM instance is destroyed and recreated as before.
The eventfd descriptors are found and reused during vfio_realize using
event_notifier_init_fd.  No calls to VFIO_DEVICE_SET_IRQS are made before or
after the exec.  The descriptors are attached to the new KVM instance via the
usual ioctl's on the existing code paths.

It works.  I issue cprsave, send an interrupt, wait a few seconds, then issue 
cprload.
The interrupt fires immediately after cprload.  I tested interrupt delivery to 
the 
kvm_irqchip and to qemu.

It does not support Posted Interrupts, as that involves state attached to the
VMCS, which is destroyed with the KVM instance.  That needs more study and
a likely kernel enhancement.

I will post the full code as part of the V2 patch series.

- Steve
Re: [PATCH V1 30/32] vfio-pci: save and restore

Reply via email to