> On Feb 10, 2022, at 7:26 PM, Michael S. Tsirkin <m...@redhat.com> wrote: > > On Thu, Feb 10, 2022 at 04:49:33PM -0700, Alex Williamson wrote: >> On Thu, 10 Feb 2022 18:28:56 -0500 >> "Michael S. Tsirkin" <m...@redhat.com> wrote: >> >>> On Thu, Feb 10, 2022 at 04:17:34PM -0700, Alex Williamson wrote: >>>> On Thu, 10 Feb 2022 22:23:01 +0000 >>>> Jag Raman <jag.ra...@oracle.com> wrote: >>>> >>>>>> On Feb 10, 2022, at 3:02 AM, Michael S. Tsirkin <m...@redhat.com> wrote: >>>>>> >>>>>> On Thu, Feb 10, 2022 at 12:08:27AM +0000, Jag Raman wrote: >>>>>>> >>>>>>> Thanks for the explanation, Alex. Thanks to everyone else in the thread >>>>>>> who >>>>>>> helped to clarify this problem. >>>>>>> >>>>>>> We have implemented the memory isolation based on the discussion in the >>>>>>> thread. We will send the patches out shortly. >>>>>>> >>>>>>> Devices such as “name" and “e1000” worked fine. But I’d like to note >>>>>>> that >>>>>>> the LSI device (TYPE_LSI53C895A) had some problems - it doesn’t seem >>>>>>> to be IOMMU aware. In LSI’s case, the kernel driver is asking the >>>>>>> device to >>>>>>> read instructions from the CPU VA (lsi_execute_script() -> >>>>>>> read_dword()), >>>>>>> which is forbidden when IOMMU is enabled. Specifically, the driver is >>>>>>> asking >>>>>>> the device to access other BAR regions by using the BAR address >>>>>>> programmed >>>>>>> in the PCI config space. This happens even without vfio-user patches. >>>>>>> For example, >>>>>>> we could enable IOMMU using “-device intel-iommu” QEMU option and also >>>>>>> adding the following to the kernel command-line: “intel_iommu=on >>>>>>> iommu=nopt”. >>>>>>> In this case, we could see an IOMMU fault. >>>>>> >>>>>> So, device accessing its own BAR is different. Basically, these >>>>>> transactions never go on the bus at all, never mind get to the IOMMU. >>>>> >>>>> Hi Michael, >>>>> >>>>> In LSI case, I did notice that it went to the IOMMU. The device is >>>>> reading the BAR >>>>> address as if it was a DMA address. >>>>> >>>>>> I think it's just used as a handle to address internal device memory. >>>>>> This kind of trick is not universal, but not terribly unusual. >>>>>> >>>>>> >>>>>>> Unfortunately, we started off our project with the LSI device. So that >>>>>>> lead to all the >>>>>>> confusion about what is expected at the server end in-terms of >>>>>>> vectoring/address-translation. It gave an impression as if the request >>>>>>> was still on >>>>>>> the CPU side of the PCI root complex, but the actual problem was with >>>>>>> the >>>>>>> device driver itself. >>>>>>> >>>>>>> I’m wondering how to deal with this problem. Would it be OK if we >>>>>>> mapped the >>>>>>> device’s BAR into the IOVA, at the same CPU VA programmed in the BAR >>>>>>> registers? >>>>>>> This would help devices such as LSI to circumvent this problem. One >>>>>>> problem >>>>>>> with this approach is that it has the potential to collide with another >>>>>>> legitimate >>>>>>> IOVA address. Kindly share your thought on this. >>>>>>> >>>>>>> Thank you! >>>>>> >>>>>> I am not 100% sure what do you plan to do but it sounds fine since even >>>>>> if it collides, with traditional PCI device must never initiate cycles >>>>>> >>>>> >>>>> OK sounds good, I’ll create a mapping of the device BARs in the IOVA. >>>> >>>> I don't think this is correct. Look for instance at ACPI _TRA support >>>> where a system can specify a translation offset such that, for example, >>>> a CPU access to a device is required to add the provided offset to the >>>> bus address of the device. A system using this could have multiple >>>> root bridges, where each is given the same, overlapping MMIO aperture. >>>>> From the processor perspective, each MMIO range is unique and possibly >>>> none of those devices have a zero _TRA, there could be system memory at >>>> the equivalent flat memory address. >>> >>> I am guessing there are reasons to have these in acpi besides firmware >>> vendors wanting to find corner cases in device implementations though >>> :). E.g. it's possible something else is tweaking DMA in similar ways. I >>> can't say for sure and I wonder why do we care as long as QEMU does not >>> have _TRA. >> >> How many complaints do we get about running out of I/O port space on >> q35 because we allow an arbitrary number of root ports? What if we >> used _TRA to provide the full I/O port range per root port? 32-bit >> MMIO could be duplicated as well. > > It's an interesting idea. To clarify what I said, I suspect some devices > are broken in presence of translating bridges unless DMA > is also translated to match. > > I agree it's a mess though, in that some devices when given their own > BAR to DMA to will probably just satisfy the access from internal > memory, while others will ignore that and send it up as DMA > and both types are probably out there in the field. > > >>>> So if the transaction actually hits this bus, which I think is what >>>> making use of the device AddressSpace implies, I don't think it can >>>> assume that it's simply reflected back at itself. Conventional PCI and >>>> PCI Express may be software compatible, but there's a reason we don't >>>> see IOMMUs that provide both translation and isolation in conventional >>>> topologies. >>>> >>>> Is this more a bug in the LSI device emulation model? For instance in >>>> vfio-pci, if I want to access an offset into a BAR from within QEMU, I >>>> don't care what address is programmed into that BAR, I perform an >>>> access relative to the vfio file descriptor region representing that >>>> BAR space. I'd expect that any viable device emulation model does the >>>> same, an access to device memory uses an offset from an internal >>>> resource, irrespective of the BAR address. >>> >>> However, using BAR seems like a reasonable shortcut allowing >>> device to use the same 64 bit address to refer to system >>> and device RAM interchangeably. >> >> A shortcut that breaks when an IOMMU is involved. > > Maybe. But if that's how hardware behaves, we have little choice but > emulate it.
I was wondering if we could map the BARs into the IOVA for a limited set of devices - the ones which are designed before IOMMU such as lsi53c895a. This would ensure that we follow the spec to the best without breaking existing devices? -- Jag > >>>> It would seem strange if the driver is actually programming the device >>>> to DMA to itself and if that's actually happening, I'd wonder if this >>>> driver is actually compatible with an IOMMU on bare metal. >>> >>> You know, it's hardware after all. As long as things work vendors >>> happily ship the device. They don't really worry about theoretical issues >>> ;). >> >> We're talking about a 32-bit conventional PCI device from the previous >> century. IOMMUs are no longer theoretical, but not all drivers have >> kept up. It's maybe not the best choice as the de facto standard >> device, imo. Thanks, >> >> Alex > > More importantly lots devices most likely don't support arbitrary > configurations and break if you try to create something that matches the > spec but never or even rarely occurs on bare metal. > > -- > MST >