> On Feb 10, 2022, at 7:26 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
> 
> On Thu, Feb 10, 2022 at 04:49:33PM -0700, Alex Williamson wrote:
>> On Thu, 10 Feb 2022 18:28:56 -0500
>> "Michael S. Tsirkin" <m...@redhat.com> wrote:
>> 
>>> On Thu, Feb 10, 2022 at 04:17:34PM -0700, Alex Williamson wrote:
>>>> On Thu, 10 Feb 2022 22:23:01 +0000
>>>> Jag Raman <jag.ra...@oracle.com> wrote:
>>>> 
>>>>>> On Feb 10, 2022, at 3:02 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
>>>>>> 
>>>>>> On Thu, Feb 10, 2022 at 12:08:27AM +0000, Jag Raman wrote:    
>>>>>>> 
>>>>>>> Thanks for the explanation, Alex. Thanks to everyone else in the thread 
>>>>>>> who
>>>>>>> helped to clarify this problem.
>>>>>>> 
>>>>>>> We have implemented the memory isolation based on the discussion in the
>>>>>>> thread. We will send the patches out shortly.
>>>>>>> 
>>>>>>> Devices such as “name" and “e1000” worked fine. But I’d like to note 
>>>>>>> that
>>>>>>> the LSI device (TYPE_LSI53C895A) had some problems - it doesn’t seem
>>>>>>> to be IOMMU aware. In LSI’s case, the kernel driver is asking the 
>>>>>>> device to
>>>>>>> read instructions from the CPU VA (lsi_execute_script() -> 
>>>>>>> read_dword()),
>>>>>>> which is forbidden when IOMMU is enabled. Specifically, the driver is 
>>>>>>> asking
>>>>>>> the device to access other BAR regions by using the BAR address 
>>>>>>> programmed
>>>>>>> in the PCI config space. This happens even without vfio-user patches. 
>>>>>>> For example,
>>>>>>> we could enable IOMMU using “-device intel-iommu” QEMU option and also
>>>>>>> adding the following to the kernel command-line: “intel_iommu=on 
>>>>>>> iommu=nopt”.
>>>>>>> In this case, we could see an IOMMU fault.    
>>>>>> 
>>>>>> So, device accessing its own BAR is different. Basically, these
>>>>>> transactions never go on the bus at all, never mind get to the IOMMU.    
>>>>> 
>>>>> Hi Michael,
>>>>> 
>>>>> In LSI case, I did notice that it went to the IOMMU. The device is 
>>>>> reading the BAR
>>>>> address as if it was a DMA address.
>>>>> 
>>>>>> I think it's just used as a handle to address internal device memory.
>>>>>> This kind of trick is not universal, but not terribly unusual.
>>>>>> 
>>>>>> 
>>>>>>> Unfortunately, we started off our project with the LSI device. So that 
>>>>>>> lead to all the
>>>>>>> confusion about what is expected at the server end in-terms of
>>>>>>> vectoring/address-translation. It gave an impression as if the request 
>>>>>>> was still on
>>>>>>> the CPU side of the PCI root complex, but the actual problem was with 
>>>>>>> the
>>>>>>> device driver itself.
>>>>>>> 
>>>>>>> I’m wondering how to deal with this problem. Would it be OK if we 
>>>>>>> mapped the
>>>>>>> device’s BAR into the IOVA, at the same CPU VA programmed in the BAR 
>>>>>>> registers?
>>>>>>> This would help devices such as LSI to circumvent this problem. One 
>>>>>>> problem
>>>>>>> with this approach is that it has the potential to collide with another 
>>>>>>> legitimate
>>>>>>> IOVA address. Kindly share your thought on this.
>>>>>>> 
>>>>>>> Thank you!    
>>>>>> 
>>>>>> I am not 100% sure what do you plan to do but it sounds fine since even
>>>>>> if it collides, with traditional PCI device must never initiate cycles   
>>>>>>  
>>>>> 
>>>>> OK sounds good, I’ll create a mapping of the device BARs in the IOVA.  
>>>> 
>>>> I don't think this is correct.  Look for instance at ACPI _TRA support
>>>> where a system can specify a translation offset such that, for example,
>>>> a CPU access to a device is required to add the provided offset to the
>>>> bus address of the device.  A system using this could have multiple
>>>> root bridges, where each is given the same, overlapping MMIO aperture.  
>>>>> From the processor perspective, each MMIO range is unique and possibly  
>>>> none of those devices have a zero _TRA, there could be system memory at
>>>> the equivalent flat memory address.  
>>> 
>>> I am guessing there are reasons to have these in acpi besides firmware
>>> vendors wanting to find corner cases in device implementations though
>>> :). E.g. it's possible something else is tweaking DMA in similar ways. I
>>> can't say for sure and I wonder why do we care as long as QEMU does not
>>> have _TRA.
>> 
>> How many complaints do we get about running out of I/O port space on
>> q35 because we allow an arbitrary number of root ports?  What if we
>> used _TRA to provide the full I/O port range per root port?  32-bit
>> MMIO could be duplicated as well.
> 
> It's an interesting idea. To clarify what I said, I suspect some devices
> are broken in presence of translating bridges unless DMA
> is also translated to match.
> 
> I agree it's a mess though, in that some devices when given their own
> BAR to DMA to will probably just satisfy the access from internal
> memory, while others will ignore that and send it up as DMA
> and both types are probably out there in the field.
> 
> 
>>>> So if the transaction actually hits this bus, which I think is what
>>>> making use of the device AddressSpace implies, I don't think it can
>>>> assume that it's simply reflected back at itself.  Conventional PCI and
>>>> PCI Express may be software compatible, but there's a reason we don't
>>>> see IOMMUs that provide both translation and isolation in conventional
>>>> topologies.
>>>> 
>>>> Is this more a bug in the LSI device emulation model?  For instance in
>>>> vfio-pci, if I want to access an offset into a BAR from within QEMU, I
>>>> don't care what address is programmed into that BAR, I perform an
>>>> access relative to the vfio file descriptor region representing that
>>>> BAR space.  I'd expect that any viable device emulation model does the
>>>> same, an access to device memory uses an offset from an internal
>>>> resource, irrespective of the BAR address.  
>>> 
>>> However, using BAR seems like a reasonable shortcut allowing
>>> device to use the same 64 bit address to refer to system
>>> and device RAM interchangeably.
>> 
>> A shortcut that breaks when an IOMMU is involved.
> 
> Maybe. But if that's how hardware behaves, we have little choice but
> emulate it.

I was wondering if we could map the BARs into the IOVA for a limited set of
devices - the ones which are designed before IOMMU such as lsi53c895a.
This would ensure that we follow the spec to the best without breaking
existing devices?

--
Jag

> 
>>>> It would seem strange if the driver is actually programming the device
>>>> to DMA to itself and if that's actually happening, I'd wonder if this
>>>> driver is actually compatible with an IOMMU on bare metal.  
>>> 
>>> You know, it's hardware after all. As long as things work vendors
>>> happily ship the device. They don't really worry about theoretical issues 
>>> ;).
>> 
>> We're talking about a 32-bit conventional PCI device from the previous
>> century.  IOMMUs are no longer theoretical, but not all drivers have
>> kept up.  It's maybe not the best choice as the de facto standard
>> device, imo.  Thanks,
>> 
>> Alex
> 
> More importantly lots devices most likely don't support arbitrary
> configurations and break if you try to create something that matches the
> spec but never or even rarely occurs on bare metal.
> 
> -- 
> MST
> 

Reply via email to