On Wed, 27 Mar 2019 12:46:41 +0100 Auger Eric <eric.au...@redhat.com> wrote:
> Hi Alex, > > On 3/26/19 11:55 PM, Alex Williamson wrote: > > Conventional PCI buses pre-date requester IDs. An IOMMU cannot > > distinguish by devfn & bus between devices in a conventional PCI > > topology and therefore we cannot assign them separate AddressSpaces. > > By taking this requester ID aliasing into account, QEMU better matches > > the bare metal behavior and restrictions, and enables shared > > AddressSpace configurations that are otherwise not possible with > > guest IOMMU support. > > > > For the latter case, given any example where an IOMMU group on the > > host includes multiple devices: > > > > $ ls /sys/kernel/iommu_groups/1/devices/ > > 0000:00:01.0 0000:01:00.0 0000:01:00.1 > > > > If we incorporate a vIOMMU into the VM configuration, we're restricted > > that we can only assign one of the endpoints to the guest because a > > second endpoint will attempt to use a different AddressSpace. VFIO > > only supports IOMMU group level granularity at the container level, > > preventing this second endpoint from being assigned: > > > > qemu-system-x86_64 -machine q35... \ > > -device intel-iommu,intremap=on \ > > -device pcie-root-port,addr=1e.0,id=pcie.1 \ > > -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \ > > -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1 > > > > qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \ > > 0000:01:00.1: group 1 used in multiple address spaces > > > > However, when QEMU incorporates proper aliasing, we can make use of a > > PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that > > provides the downstream devices with the same AddressSpace, ex: > > > > qemu-system-x86_64 -machine q35... \ > > -device intel-iommu,intremap=on \ > > -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ > > -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ > > -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 > > > > While the utility of this hack may be limited, this AddressSpace > > aliasing is the correct behavior for QEMU to emulate bare metal. > > > > Signed-off-by: Alex Williamson <alex.william...@redhat.com> > > --- > > hw/pci/pci.c | 33 +++++++++++++++++++++++++++++++-- > > 1 file changed, 31 insertions(+), 2 deletions(-) > > > > diff --git a/hw/pci/pci.c b/hw/pci/pci.c > > index 35451c1e9987..38467e676f1f 100644 > > --- a/hw/pci/pci.c > > +++ b/hw/pci/pci.c > > @@ -2594,12 +2594,41 @@ AddressSpace > > *pci_device_iommu_address_space(PCIDevice *dev) > > { > > PCIBus *bus = pci_get_bus(dev); > > PCIBus *iommu_bus = bus; > > + uint8_t devfn = dev->devfn; > > > > while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) { > > - iommu_bus = pci_get_bus(iommu_bus->parent_dev); > > + PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev); > > + > > + /* > > + * Determine which requester ID alias should be used for the device > > + * based on the PCI topology. There are no requester IDs on > > convetional > conventional Oops, fixed > > + * PCI buses, therefore we push the alias up to the parent on each > > non- > > + * express bus. Which alias we use depends on whether this is a > > legacy > > + * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the > > PCIe-to- > > + * PCI bridge spec. Note that we cannot use pci_requester_id() > > here > > + * because the resulting BDF depends on the secondary bridge > > register > Didn't you mean secondary bus number register? Yes, fixed > > + * programming. We also cannot lookup the PCIBus from the bus > > number > > + * at this point for the iommu_fn. Also, requester_id_cache is the > > + * alias to the root bus, which is usually, but not necessarily > > always > > + * where we'll find our iommu_fn. > > + */ > > + if (!pci_bus_is_express(iommu_bus)) { > > + PCIDevice *parent = iommu_bus->parent_dev; > > + > > + if (pci_is_express(parent) && > > + pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) { > > + devfn = PCI_DEVFN(0, 0); > > + bus = iommu_bus; > > + } else { > > + devfn = parent->devfn; > > + bus = parent_bus; > > + } > > + } > > + > > + iommu_bus = parent_bus; > > } > > if (iommu_bus && iommu_bus->iommu_fn) { > > - return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, > > dev->devfn); > > + return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn); > I think it would make sense to comment this iommu_fn() callback's role > somewhere. While I agree, it seems a bit out of scope of this patch. Or are you suggesting that this patch fundamentally changing the role rather than trying to make it work as intended? Thanks, Alex