Hello, More informations on how the hardware works.
-Each VF will have its own memory and MMR, etc. That means the resources are not shared. -Each VF will have its own bus number, function number and device number. That means request ID is separated for each VF. There is also VF save/restore area for the switch. A VF regular memory (not MMR) is still accessible after a switch out. But when a function VF1 is scheduled a read to a MRR of VF number 0 could return the value of the same MMR in VF number 1 because VF number 1 is switched on and the PF processor is busy servicing VF number 1. This could confuse the guest VF driver so the unmap and block or a same goal achieving technique is required. I hope these informations makes the area of the problem to solve narrower. Best regards Benoît Canet > Le Monday 03 Jun 2013 à 12:57:45 (-0600), Alex Williamson a écrit : > On Mon, 2013-06-03 at 14:34 -0400, Don Dutile wrote: > > On 06/03/2013 02:02 PM, Alex Williamson wrote: > > > On Mon, 2013-06-03 at 18:33 +0200, Benoît Canet wrote: > > >> Hello, > > >> > > >> I plan to write a PF driver for an SR-IOV card and make the VFs work > > >> with QEMU's > > >> VFIO passthrough so I am asking the following design question before > > >> trying to > > >> write and push code. > > >> > > >> After SR-IOV being enabled on this hardware only one VF function can be > > >> active > > >> at a given time. > > > > > > Is this actually an SR-IOV device or are you trying to write a driver > > > that emulates SR-IOV for a PF? > > > > > >> The PF host kernel driver is acting as a scheduler. > > >> It switch every few milliseconds which VF is the current active function > > >> while > > >> disabling the others VFs. > > >> > > that's time-sharing of hw, which sw doesn't see ... so, ok. > > > > >> One consequence of how the hardware works is that the MMR regions of the > > >> switched off VFs must be unmapped and their io access should block until > > >> the VF > > >> is switched on again. > > > > > This violates the spec., and does impact sw -- how can one assign such a VF > > to a guest > > -- it does not work indep. of other VFs. > > > > > MMR = Memory Mapped Register? > > > > > > This seems contradictory to the SR-IOV spec, which states: > > > > > > Each VF contains a non-shared set of physical resources required > > > to deliver Function-specific > > > services, e.g., resources such as work queues, data buffers, > > > etc. These resources can be directly > > > accessed by an SI without requiring VI or SR-PCIM intervention. > > > > > > Furthermore, each VF should have a separate requester ID. What's being > > > suggested here seems like maybe that's not the case. If true, it would > > I didn't read it that way above. I read it as the PCIe end is timeshared > > btwn VFs (& PFs?). .... with some VFs disappearing (from a driver > > perspective) > > as if the device was hot unplug w/o notification. That will probably cause > > read-timeouts & SME's, bringing down most enterprise-level systems. > > Perhaps I'm reading too much into it, but using the same requester ID > would seem like justification for why the device needs to be unmapped. > Otherwise we could just stop QEMU and leave the mappings alone if we > just want to make sure access to the device is blocked while the device > is swapped out. Not the best overall throughput algorithm, but maybe a > proof of concept. Need more info about how the device actually behaves > to know for sure. Thanks, > > Alex > > > > make iommu groups challenging. Is there any VF save/restore around the > > > scheduling? > > > > > >> Each IOMMU map/unmap should be done in less than 100ns. > > > > > > I think that may be a lot to ask if we need to unmap the regions in the > > > guest and in the iommu. If the "VFs" used different requester IDs, > > > iommu unmapping whouldn't be necessary. I experimented with switching > > > between trapped (read/write) access to memory regions and mmap'd (direct > > > mapping) for handling legacy interrupts. There was a noticeable > > > performance penalty switching per interrupt. > > > > > >> As the kernel iommu module is being called by the VFIO driver the PF > > >> driver > > >> cannot interface with it. > > >> > > >> Currently the only interface of the VFIO code is for the userland QEMU > > >> process > > >> and I fear that notifying QEMU that it should do the unmap/block would > > >> take more > > >> than 100ns. > > >> > > >> Also blocking the IO access in QEMU under the BQL would freeze QEMU. > > >> > > >> Do you have and idea on how to write this required map and block/unmap > > >> feature ? > > > > > > It seems like there are several options, but I'm doubtful that any of > > > them will meet 100ns. If this is completely fake SR-IOV and there's not > > > a different requester ID per VF, I'd start with seeing if you can even > > > do the iommu_unmap/iommu_map of the MMIO BARs in under 100ns. If that's > > > close to your limit, then your only real option for QEMU is to freeze > > > it, which still involves getting multiple (maybe many) vCPUs out of VM > > > mode. That's not free either. If by some miracle you have time to > > > spare, you could remap the regions to trapped mode and let the vCPUs run > > > while vfio blocks on read/write. > > > > > > Maybe there's even a question whether mmap'd mode is worthwhile for this > > > device. Trapping every read/write is orders of magnitude slower, but > > > allows you to handle the "wait for VF" on the kernel side. > > > > > > If you can provide more info on the device design/contraints, maybe we > > > can come up with better options. Thanks, > > > > > > Alex > > > > > > _______________________________________________ > > > iommu mailing list > > > io...@lists.linux-foundation.org > > > https://lists.linuxfoundation.org/mailman/listinfo/iommu > > > > > >