On Mon, 6 Jun 2016 16:09:11 +0800 Peter Xu <pet...@redhat.com> wrote:
> On Thu, Jun 02, 2016 at 07:34:17AM -0600, Alex Williamson wrote: > > On Thu, 02 Jun 2016 13:09:27 +0000 > > "Aviv B.D." <bd.a...@gmail.com> wrote: > > > > > Hi, > > > > > > In case of hot plug vfio device there should not be any active mapping > > > to this device prior the device addition. > > > > Counter example - a device is hot added to a guest booted with iommu=pt. > > I got the same question with Aviv... > > For hot-plug devices First, let's just remove "hot-plug" from this discussion because I'm afraid someone is going to say "let's just not support hotplug devices". The issue of moving a device to a pre-populated domain can occur any time a device attaches to a new domain. It might occur if a device is added to a vfio driver in the L1 guest where it's already managing a device. It might occur any time the device is released from an L1 vfio user and returned to the static identity map in the L1 guest kernel. >, even if it is using iommu=pt, shouldn't it still > follow the steps that first init vfio device, then configure device > context entry? Let me list the steps for device addition in case I got > any mistake: > > 1. user add new VFIO device A > > 2. vfio_listener_region_add() called for device A on the IOMMU mr, > here we should create the iommu notifier. However since the context > entry still does not exist, memory_region_iommu_replay() will got > all invalid IOTLB (IOMMU_NONE entries) > > 3. guest kernel found the device, enabled the device, filled in > context entry for device A with "pass-through" (so the SLPTPTR is > invalid) > > 4. guest sent context invalidation to QEMU vIOMMU since we have CM=1 > set for guest vIOMMU > > 5. QEMU vIOMMU handle the invalidation, trigger VFIO notify to do > correct VFIO mapping for device A > > Though here step 5 should still be missing (IIUC Aviv's patch 3 still > not handled context invalidation). Just want to know whether we can > avoid the replay operation for Intel vIOMMUs (for Intel only, because > Intel has context invalidation and cache mode support, not sure about > other platform)? I'm not sure why you're so eager to avoid implementing a replay callback for VT-d. What happens when VT-d is not enabled by the guest? vfio/pci.c calls pci_device_iommu_address_space() to get the address space for the device, which results in vtd_find_add_as() which gives us a unique VTDAddressSpace.as for that device. With VT-d not in use by the guest, when do steps 3-5 occur? I agree with your reasoning when VT-d is enabled, but the BIOS/UEFI does not enable VT-d, so are we going to exclude all use cases of an assigned device prior to the guest enabling VT-d? On that same topic, I'm really dubious that we have the flexibility in our memory API to really support an IOMMU like VT-d and the layout of having a VTDAddressSpace per device, rather than exposing shared address spaces, has implications on the efficiency and locked memory requirements for vfio. In the above case with VT-d disabled, the VTDAddressSpace should alias to the system memory AddressSpace and dynamically switch to a per device address space when VT-d is enabled. AFAICT, we don't have anything resembling that sort of feature, so we rely on the IOMMU driver to replay, perhaps even configuring its own MemoryListener to IOMMU notifier gateway, which is also something that doesn't currently exist. Additionally, if there are things unique to VT-d, for instance if VT-d is enabled and we can rely on the sequence of events you've set forth, then yes, the replay mechanism should do nothing. But only the VT-d code can decide that, which implies that vfio always needs to call the replay function and a new MemoryRegionIOMMUOps callback needs to decide what to do given the current state of the vIOMMU. Thanks, Alex