On 8/20/2020 6:33 AM, Jason Zeng wrote: > On Wed, Aug 19, 2020 at 05:15:11PM -0400, Steven Sistare wrote: >> On 8/9/2020 11:50 PM, Jason Zeng wrote: >>> On Fri, Aug 07, 2020 at 04:38:12PM -0400, Steven Sistare wrote: >>>> On 8/6/2020 6:22 AM, Jason Zeng wrote: >>>>> Hi Steve, >>>>> >>>>> On Thu, Jul 30, 2020 at 08:14:34AM -0700, Steve Sistare wrote: >>>>>> @@ -3182,6 +3207,51 @@ static Property vfio_pci_dev_properties[] = { >>>>>> DEFINE_PROP_END_OF_LIST(), >>>>>> }; >>>>>> >>>>>> +static int vfio_pci_post_load(void *opaque, int version_id) >>>>>> +{ >>>>>> + int vector; >>>>>> + MSIMessage msg; >>>>>> + Error *err = 0; >>>>>> + VFIOPCIDevice *vdev = opaque; >>>>>> + PCIDevice *pdev = &vdev->pdev; >>>>>> + >>>>>> + if (msix_enabled(pdev)) { >>>>>> + vfio_msix_enable(vdev); >>>>>> + pdev->msix_function_masked = false; >>>>>> + >>>>>> + for (vector = 0; vector < vdev->pdev.msix_entries_nr; vector++) >>>>>> { >>>>>> + if (!msix_is_masked(pdev, vector)) { >>>>>> + msg = msix_get_message(pdev, vector); >>>>>> + vfio_msix_vector_use(pdev, vector, msg); >>>>>> + } >>>>>> + } >>>>> >>>>> It looks to me MSIX re-init here may lose device IRQs and impact >>>>> device hardware state? >>>>> >>>>> The re-init will cause the kernel vfio driver to connect the device >>>>> MSIX vectors to new eventfds and KVM instance. But before that, device >>>>> IRQs will be routed to previous eventfd. Looks these IRQs will be lost. >>>> >>>> Thanks Jason, that sounds like a problem. I could try reading and saving >>>> an >>>> event from eventfd before shutdown, and injecting it into the eventfd after >>>> restart, but that would be racy unless I disable interrupts. Or, >>>> unconditionally >>>> inject a spurious interrupt after restart to kick it, in case an interrupt >>>> was lost. >>>> >>>> Do you have any other ideas? >>> >>> Maybe we can consider to also hand over the eventfd file descriptor, or >> >> I believe preserving this descriptor in isolation is not sufficient. We >> would >> also need to preserve the KVM instance which it is linked to. >> >>> or even the KVM fds to the new Qemu? >>> >>> If the KVM fds can be preserved, we will just need to restore Qemu KVM >>> side states. But not sure how complicated the implementation would be. >> >> That should work, but I fear it would require many code changes in QEMU >> to re-use descriptors at object creation time and suppress the initial >> configuration ioctl's, so it's not my first choice for a solution. >> >>> If we only preserve the eventfd fd, we can attach the old eventfd to >>> vfio devices. But looks it may turn out we always inject an interrupt >>> unconditionally, because kernel KVM irqfd eventfd handling is a bit >>> different than normal user land eventfd read/write. It doesn't decrease >>> the counter in the eventfd context. So if we read the eventfd from new >>> Qemu, it looks will always have a non-zero counter, which requires an >>> interrupt injection. >> >> Good to know, thanks. >> >> I will try creating a new eventfd and injecting an interrupt unconditionally. >> I need a test case to demonstrate losing an interrupt, and fixing it with >> injection. Any advice? My stress tests with a virtual function nic and a >> directly assigned nvme block device have never failed across live update. >> > > I am not familiar with nvme devices. For nic device, to my understanding, > stress nic testing will not have many IRQs, because nic driver usually > enables NAPI, which only take the first interrupt, then disable interrupt > and start polling. It will only re-enable interrupt after some packet > quota reached or the traffic quiesces for a while. But anyway, if the > test goes enough long time, the number of IRQs should also be big, not > sure why it doesn't trigger any issue. Maybe we can have some study on > the IRQ pattern for the testing and see how we can design a test case? > or see if our assumption is wrong? > > >>>>> And the re-init will make the device go through the procedure of >>>>> disabling MSIX, enabling INTX, and re-enabling MSIX and vectors. >>>>> So if the device is active, its hardware state will be impacted? >>>> >>>> Again thanks. vfio_msix_enable() does indeed call >>>> vfio_disable_interrupts(). >>>> For a quick experiment, I deleted that call in for the post_load code >>>> path, and >>>> it seems to work fine, but I need to study it more. >>> >>> vfio_msix_vector_use() will also trigger this procedure in the kernel. >> >> Because that code path calls VFIO_DEVICE_SET_IRQS? Or something else? >> Can you point to what it triggers in the kernel? > > > In vfio_msix_vector_use(), I see vfio_disable_irqindex() will be invoked > if "vdev->nr_vectors < nr + 1" is true. Since the 'vdev' is re-inited, > so this condition should be true, and vfio_disable_irqindex() will > trigger VFIO_DEVICE_SET_IRQS with VFIO_IRQ_SET_DATA_NONE, which will > cause kernel to disable MSIX. > >> >>> Looks we shouldn't trigger any kernel vfio actions here? Because we >>> preserve vfio fds, so its kernel state shouldn't be touched. Here we >>> may only need to restore Qemu states. Re-connect to KVM instance should >>> be done automatically when we setup the KVM irqfds with the same eventfd. >>> >>> BTW, if I remember correctly, it is not enough to only save MSIX state >>> in the snapshot. We should also save the Qemu side pci config space >>> cache to the snapshot, because Qemu's copy is not exactly the same as >>> the kernel's copy. I encountered this before, but I don't remember which >>> field it was. >> >> FYI all, Jason told me offline that qemu may emulate some pci capabilities >> and >> hence keeps state in the shadow config that is never written to the kernel. >> I need to study that. >> > > Sorry, I read the code again, see Qemu does write all config-space-write > to kernel in vfio_pci_write_config(). Now I am also confused about what > I was seeing previously :(. But it seems we still need to look at kernel > code to see if mismatch is possibile for config space cache between Qemu > and kernel. > > FYI. Some discussion about the VFIO PCI config space saving/restoring in > live migration scenario: > https://lists.gnu.org/archive/html/qemu-devel/2020-06/msg06964.html > I have coded a solution for much of the "lost interrupts" issue. cprsave preserves the vfio err, req, and msi irq eventfd's across exec: vdev->err_notifier vdev->req_notifier vdev->msi_vectors[i].interrupt vdev->msi_vectors[i].kvm_interrupt The KVM instance is destroyed and recreated as before. The eventfd descriptors are found and reused during vfio_realize using event_notifier_init_fd. No calls to VFIO_DEVICE_SET_IRQS are made before or after the exec. The descriptors are attached to the new KVM instance via the usual ioctl's on the existing code paths. It works. I issue cprsave, send an interrupt, wait a few seconds, then issue cprload. The interrupt fires immediately after cprload. I tested interrupt delivery to the kvm_irqchip and to qemu. It does not support Posted Interrupts, as that involves state attached to the VMCS, which is destroyed with the KVM instance. That needs more study and a likely kernel enhancement. I will post the full code as part of the V2 patch series. - Steve