> On Feb 10, 2022, at 5:53 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
> 
> On Thu, Feb 10, 2022 at 10:23:01PM +0000, Jag Raman wrote:
>> 
>> 
>>> On Feb 10, 2022, at 3:02 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
>>> 
>>> On Thu, Feb 10, 2022 at 12:08:27AM +0000, Jag Raman wrote:
>>>> 
>>>> 
>>>>> On Feb 2, 2022, at 12:34 AM, Alex Williamson <alex.william...@redhat.com> 
>>>>> wrote:
>>>>> 
>>>>> On Wed, 2 Feb 2022 01:13:22 +0000
>>>>> Jag Raman <jag.ra...@oracle.com> wrote:
>>>>> 
>>>>>>> On Feb 1, 2022, at 5:47 PM, Alex Williamson 
>>>>>>> <alex.william...@redhat.com> wrote:
>>>>>>> 
>>>>>>> On Tue, 1 Feb 2022 21:24:08 +0000
>>>>>>> Jag Raman <jag.ra...@oracle.com> wrote:
>>>>>>> 
>>>>>>>>> On Feb 1, 2022, at 10:24 AM, Alex Williamson 
>>>>>>>>> <alex.william...@redhat.com> wrote:
>>>>>>>>> 
>>>>>>>>> On Tue, 1 Feb 2022 09:30:35 +0000
>>>>>>>>> Stefan Hajnoczi <stefa...@redhat.com> wrote:
>>>>>>>>> 
>>>>>>>>>> On Mon, Jan 31, 2022 at 09:16:23AM -0700, Alex Williamson wrote:    
>>>>>>>>>>> On Fri, 28 Jan 2022 09:18:08 +0000
>>>>>>>>>>> Stefan Hajnoczi <stefa...@redhat.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, Jan 27, 2022 at 02:22:53PM -0700, Alex Williamson wrote:   
>>>>>>>>>>>>    
>>>>>>>>>>>>> If the goal here is to restrict DMA between devices, ie. 
>>>>>>>>>>>>> peer-to-peer
>>>>>>>>>>>>> (p2p), why are we trying to re-invent what an IOMMU already does? 
>>>>>>>>>>>>>        
>>>>>>>>>>>> 
>>>>>>>>>>>> The issue Dave raised is that vfio-user servers run in separate
>>>>>>>>>>>> processses from QEMU with shared memory access to RAM but no direct
>>>>>>>>>>>> access to non-RAM MemoryRegions. The virtiofs DAX Window BAR is one
>>>>>>>>>>>> example of a non-RAM MemoryRegion that can be the source/target of 
>>>>>>>>>>>> DMA
>>>>>>>>>>>> requests.
>>>>>>>>>>>> 
>>>>>>>>>>>> I don't think IOMMUs solve this problem but luckily the vfio-user
>>>>>>>>>>>> protocol already has messages that vfio-user servers can use as a
>>>>>>>>>>>> fallback when DMA cannot be completed through the shared memory RAM
>>>>>>>>>>>> accesses.
>>>>>>>>>>>> 
>>>>>>>>>>>>> In
>>>>>>>>>>>>> fact, it seems like an IOMMU does this better in providing an IOVA
>>>>>>>>>>>>> address space per BDF.  Is the dynamic mapping overhead too much? 
>>>>>>>>>>>>>  What
>>>>>>>>>>>>> physical hardware properties or specifications could we leverage 
>>>>>>>>>>>>> to
>>>>>>>>>>>>> restrict p2p mappings to a device?  Should it be governed by 
>>>>>>>>>>>>> machine
>>>>>>>>>>>>> type to provide consistency between devices?  Should each 
>>>>>>>>>>>>> "isolated"
>>>>>>>>>>>>> bus be in a separate root complex?  Thanks,        
>>>>>>>>>>>> 
>>>>>>>>>>>> There is a separate issue in this patch series regarding isolating 
>>>>>>>>>>>> the
>>>>>>>>>>>> address space where BAR accesses are made (i.e. the global
>>>>>>>>>>>> address_space_memory/io). When one process hosts multiple vfio-user
>>>>>>>>>>>> server instances (e.g. a software-defined network switch with 
>>>>>>>>>>>> multiple
>>>>>>>>>>>> ethernet devices) then each instance needs isolated memory and io 
>>>>>>>>>>>> address
>>>>>>>>>>>> spaces so that vfio-user clients don't cause collisions when they 
>>>>>>>>>>>> map
>>>>>>>>>>>> BARs to the same address.
>>>>>>>>>>>> 
>>>>>>>>>>>> I think the the separate root complex idea is a good solution. This
>>>>>>>>>>>> patch series takes a different approach by adding the concept of
>>>>>>>>>>>> isolated address spaces into hw/pci/.      
>>>>>>>>>>> 
>>>>>>>>>>> This all still seems pretty sketchy, BARs cannot overlap within the
>>>>>>>>>>> same vCPU address space, perhaps with the exception of when they're
>>>>>>>>>>> being sized, but DMA should be disabled during sizing.
>>>>>>>>>>> 
>>>>>>>>>>> Devices within the same VM context with identical BARs would need to
>>>>>>>>>>> operate in different address spaces.  For example a translation 
>>>>>>>>>>> offset
>>>>>>>>>>> in the vCPU address space would allow unique addressing to the 
>>>>>>>>>>> devices,
>>>>>>>>>>> perhaps using the translation offset bits to address a root complex 
>>>>>>>>>>> and
>>>>>>>>>>> masking those bits for downstream transactions.
>>>>>>>>>>> 
>>>>>>>>>>> In general, the device simply operates in an address space, ie. an
>>>>>>>>>>> IOVA.  When a mapping is made within that address space, we perform 
>>>>>>>>>>> a
>>>>>>>>>>> translation as necessary to generate a guest physical address.  The
>>>>>>>>>>> IOVA itself is only meaningful within the context of the address 
>>>>>>>>>>> space,
>>>>>>>>>>> there is no requirement or expectation for it to be globally unique.
>>>>>>>>>>> 
>>>>>>>>>>> If the vfio-user server is making some sort of requirement that 
>>>>>>>>>>> IOVAs
>>>>>>>>>>> are unique across all devices, that seems very, very wrong.  
>>>>>>>>>>> Thanks,      
>>>>>>>>>> 
>>>>>>>>>> Yes, BARs and IOVAs don't need to be unique across all devices.
>>>>>>>>>> 
>>>>>>>>>> The issue is that there can be as many guest physical address spaces 
>>>>>>>>>> as
>>>>>>>>>> there are vfio-user clients connected, so per-client isolated address
>>>>>>>>>> spaces are required. This patch series has a solution to that problem
>>>>>>>>>> with the new pci_isol_as_mem/io() API.    
>>>>>>>>> 
>>>>>>>>> Sorry, this still doesn't follow for me.  A server that hosts multiple
>>>>>>>>> devices across many VMs (I'm not sure if you're referring to the 
>>>>>>>>> device
>>>>>>>>> or the VM as a client) needs to deal with different address spaces per
>>>>>>>>> device.  The server needs to be able to uniquely identify every DMA,
>>>>>>>>> which must be part of the interface protocol.  But I don't see how 
>>>>>>>>> that
>>>>>>>>> imposes a requirement of an isolated address space.  If we want the
>>>>>>>>> device isolated because we don't trust the server, that's where an 
>>>>>>>>> IOMMU
>>>>>>>>> provides per device isolation.  What is the restriction of the
>>>>>>>>> per-client isolated address space and why do we need it?  The server
>>>>>>>>> needing to support multiple clients is not a sufficient answer to
>>>>>>>>> impose new PCI bus types with an implicit restriction on the VM.    
>>>>>>>> 
>>>>>>>> Hi Alex,
>>>>>>>> 
>>>>>>>> I believe there are two separate problems with running PCI devices in
>>>>>>>> the vfio-user server. The first one is concerning memory isolation and
>>>>>>>> second one is vectoring of BAR accesses (as explained below).
>>>>>>>> 
>>>>>>>> In our previous patches (v3), we used an IOMMU to isolate memory
>>>>>>>> spaces. But we still had trouble with the vectoring. So we implemented
>>>>>>>> separate address spaces for each PCIBus to tackle both problems
>>>>>>>> simultaneously, based on the feedback we got.
>>>>>>>> 
>>>>>>>> The following gives an overview of issues concerning vectoring of
>>>>>>>> BAR accesses.
>>>>>>>> 
>>>>>>>> The device’s BAR regions are mapped into the guest physical address
>>>>>>>> space. The guest writes the guest PA of each BAR into the device’s BAR
>>>>>>>> registers. To access the BAR regions of the device, QEMU uses
>>>>>>>> address_space_rw() which vectors the physical address access to the
>>>>>>>> device BAR region handlers.  
>>>>>>> 
>>>>>>> The guest physical address written to the BAR is irrelevant from the
>>>>>>> device perspective, this only serves to assign the BAR an offset within
>>>>>>> the address_space_mem, which is used by the vCPU (and possibly other
>>>>>>> devices depending on their address space).  There is no reason for the
>>>>>>> device itself to care about this address.  
>>>>>> 
>>>>>> Thank you for the explanation, Alex!
>>>>>> 
>>>>>> The confusion at my part is whether we are inside the device already when
>>>>>> the server receives a request to access BAR region of a device. Based on
>>>>>> your explanation, I get that your view is the BAR access request has
>>>>>> propagated into the device already, whereas I was under the impression
>>>>>> that the request is still on the CPU side of the PCI root complex.
>>>>> 
>>>>> If you are getting an access through your MemoryRegionOps, all the
>>>>> translations have been made, you simply need to use the hwaddr as the
>>>>> offset into the MemoryRegion for the access.  Perform the read/write to
>>>>> your device, no further translations required.
>>>>> 
>>>>>> Your view makes sense to me - once the BAR access request reaches the
>>>>>> client (on the other side), we could consider that the request has 
>>>>>> reached
>>>>>> the device.
>>>>>> 
>>>>>> On a separate note, if devices don’t care about the values in BAR
>>>>>> registers, why do the default PCI config handlers intercept and map
>>>>>> the BAR region into address_space_mem?
>>>>>> (pci_default_write_config() -> pci_update_mappings())
>>>>> 
>>>>> This is the part that's actually placing the BAR MemoryRegion as a
>>>>> sub-region into the vCPU address space.  I think if you track it,
>>>>> you'll see PCIDevice.io_regions[i].address_space is actually
>>>>> system_memory, which is used to initialize address_space_system.
>>>>> 
>>>>> The machine assembles PCI devices onto buses as instructed by the
>>>>> command line or hot plug operations.  It's the responsibility of the
>>>>> guest firmware and guest OS to probe those devices, size the BARs, and
>>>>> place the BARs into the memory hierarchy of the PCI bus, ie. system
>>>>> memory.  The BARs are necessarily in the "guest physical memory" for
>>>>> vCPU access, but it's essentially only coincidental that PCI devices
>>>>> might be in an address space that provides a mapping to their own BAR.
>>>>> There's no reason to ever use it.
>>>>> 
>>>>> In the vIOMMU case, we can't know that the device address space
>>>>> includes those BAR mappings or if they do, that they're identity mapped
>>>>> to the physical address.  Devices really need to not infer anything
>>>>> about an address.  Think about real hardware, a device is told by
>>>>> driver programming to perform a DMA operation.  The device doesn't know
>>>>> the target of that operation, it's the guest driver's responsibility to
>>>>> make sure the IOVA within the device address space is valid and maps to
>>>>> the desired target.  Thanks,
>>>> 
>>>> Thanks for the explanation, Alex. Thanks to everyone else in the thread who
>>>> helped to clarify this problem.
>>>> 
>>>> We have implemented the memory isolation based on the discussion in the
>>>> thread. We will send the patches out shortly.
>>>> 
>>>> Devices such as “name" and “e1000” worked fine. But I’d like to note that
>>>> the LSI device (TYPE_LSI53C895A) had some problems - it doesn’t seem
>>>> to be IOMMU aware. In LSI’s case, the kernel driver is asking the device to
>>>> read instructions from the CPU VA (lsi_execute_script() -> read_dword()),
>>>> which is forbidden when IOMMU is enabled. Specifically, the driver is 
>>>> asking
>>>> the device to access other BAR regions by using the BAR address programmed
>>>> in the PCI config space. This happens even without vfio-user patches. For 
>>>> example,
>>>> we could enable IOMMU using “-device intel-iommu” QEMU option and also
>>>> adding the following to the kernel command-line: “intel_iommu=on 
>>>> iommu=nopt”.
>>>> In this case, we could see an IOMMU fault.
>>> 
>>> So, device accessing its own BAR is different. Basically, these
>>> transactions never go on the bus at all, never mind get to the IOMMU.
>> 
>> Hi Michael,
>> 
>> In LSI case, I did notice that it went to the IOMMU.
> 
> Hmm do you mean you analyzed how a physical device works?
> Or do you mean in QEMU?

I mean in QEMU, I did not analyze a physical device.
> 
>> The device is reading the BAR
>> address as if it was a DMA address.
> 
> I got that, my understanding of PCI was that a device can
> not be both a master and a target of a transaction at
> the same time though. Could not find this in the spec though,
> maybe I remember incorrectly.

I see, OK. If this were to happen in a real device, PCI would raise
an error because the master and target of a transaction can’t be
the same. So you believe that this access is handled inside the
device, and doesn’t go out.

Thanks!
--
Jag

> 
>>> I think it's just used as a handle to address internal device memory.
>>> This kind of trick is not universal, but not terribly unusual.
>>> 
>>> 
>>>> Unfortunately, we started off our project with the LSI device. So that 
>>>> lead to all the
>>>> confusion about what is expected at the server end in-terms of
>>>> vectoring/address-translation. It gave an impression as if the request was 
>>>> still on
>>>> the CPU side of the PCI root complex, but the actual problem was with the
>>>> device driver itself.
>>>> 
>>>> I’m wondering how to deal with this problem. Would it be OK if we mapped 
>>>> the
>>>> device’s BAR into the IOVA, at the same CPU VA programmed in the BAR 
>>>> registers?
>>>> This would help devices such as LSI to circumvent this problem. One problem
>>>> with this approach is that it has the potential to collide with another 
>>>> legitimate
>>>> IOVA address. Kindly share your thought on this.
>>>> 
>>>> Thank you!
>>> 
>>> I am not 100% sure what do you plan to do but it sounds fine since even
>>> if it collides, with traditional PCI device must never initiate cycles
>> 
>> OK sounds good, I’ll create a mapping of the device BARs in the IOVA.
>> 
>> Thank you!
>> --
>> Jag
>> 
>>> within their own BAR range, and PCIe is software-compatible with PCI. So
>>> devices won't be able to access this IOVA even if it was programmed in
>>> the IOMMU.
>>> 
>>> As was mentioned elsewhere on this thread, devices accessing each
>>> other's BAR is a different matter.
>>> 
>>> I do not remember which rules apply to multiple functions of a
>>> multi-function device though. I think in a traditional PCI
>>> they will never go out on the bus, but with e.g. SRIOV they
>>> would probably do go out? Alex, any idea?
>>> 
>>> 
>>>> --
>>>> Jag
>>>> 
>>>>> 
>>>>> Alex

Reply via email to