Hi, since I'm late to the party I'll reply to the entire thread in one go.
On Fri, Sep 19, 2025 at 06:22:45AM +0000, Kasireddy, Vivek wrote: > I think using a PCI BAR Address works just fine in this case because the Xe > driver bound to PF on the Host can easily determine that it belongs to one > of the VFs and translate it into VRAM Address. There are PCIe bridges that support address translation, and might apply different translations for different PASIDs, so this determination would need to walk the device tree on both guest and host in a way that does not confer trust to the guest or allows it to gain access to resources through race conditions. The difficulty here is that you are building a communication mechanism that bypasses a trust boundary in the virtualization framework, so it becomes part of the virtualization framework. I believe we can avoid that to some extent by exchanging handles instead of raw pointers. I can see the point in using the dmabuf API, because it integrates well with existing 3D APIs in userspace, although I don't quite understand what the VK_KHR_external_memory_dma_buf extension actually does, besides defining a flag bit -- it seems the heavy lifting is done by the VK_KHR_external_memory_fd extension anyway. But yes, we probably want the interface to be compatible to existing sharing APIs on the host side at least, to allow the guest's "on-screen" images to be easily imported. There is some potential for a shortcut here as well, giving these buffers directly to the host's desktop compositor instead of having an application react to updates by copying the data from the area shared with the VF to the area shared between the application and the compositor -- that would also be a reason to remain close to the existing interface. It's not entirely necessary for this interface to be a dma_buf, as long as we have a conversion between a file descriptor and a BO. On the other hand, it may be desirable to allow re-exporting it as a dma_buf if we want to access it from another device as well. I'm not sure that is a likely use case though, even the horrible contraption I'm building here that has a Thunderbolt device send data directly to VRAM does not require that, because the guest would process the data and then send a different buffer to the host. Still would be nice for completeness. The other thing that seems to be looming on the horizon is that dma_buf is too limited for VRAM buffers, because once it's imported, it is pinned as well, but we'd like to keep it moveable (there was another thread on the xe mailing list about that). That might even be more important if we have limited BAR space, because then we might not want to make the memory accessible through the BAR unless imported by something that needs access through the BAR, which we've established the main use case doesn't (because it doesn't even need any kind of access). I think passing objects between trust domains should take the form of an opaque handle that is not predictable, and refers to an internal data structure with the actual parameters (so we pass these internally as well, and avoid all the awkwardness of host and guest having different world views. It doesn't matter if that path is slow, it should only be used rather seldom (at VM start and when the VM changes screen resolution). For VM startup, we probably want to provision guest "on-screen" memory and semaphores really early -- maybe it makes sense to just give each VF a sensible shared mapping like 16 MB (rounded up from 2*1080p*32bit) by default, and/or present a ROM with EFI and OpenFirmware drivers -- can VFs do that on current hardware? On Tue, Sep 23, 2025 at 05:53:06AM +0000, Kasireddy, Vivek wrote: > IIUC, it is a common practice among GPU drivers including Xe and Amdgpu > to never expose VRAM Addresses and instead have BAR addresses as DMA > addresses when exporting dmabufs to other devices. Yes, because that is how the other devices access that memory. > The problem here is that the CPU physical (aka BAR Address) is only > usable by the CPU. The address you receive from mapping a dma_buf for a particular device is not a CPU physical address, even if it is identical on pretty much all PC hardware because it is uncommon to configure the root bridge with a translation there. On my POWER9 machine, the situation is a bit different: a range in the lower 4 GB is reserved for 32-bit BARs, the memory with those physical addresses is remapped so it appears after the end of physical RAM from the point of view of PCIe devices, and the 32 bit BARs appear at the base of the PCIe bus (after the legacy ports). So, as an example (reality is a bit more complex :> ) the memory map might look like 0000000000000000..0000001fffffffff RAM 0060000000000000..006001ffffffffff PCIe domain 1 0060020000000000..006003ffffffffff PCIe domain 2 ... and the phys_addr_t I get on the CPU refers to this mapping. However, a device attached to PCIe domain 1 would see 0000000000000000..000000000000ffff Legacy I/O in PCIe domain 1 0000000000010000..00000000000fffff Legacy VGA mappings 0000000000100000..000000007fffffff 32-bit BARs in PCIe domain 1 0000000080000000..00000000ffffffff RAM (accessible to 32 bit devices) 0000000100000000..0000001fffffffff RAM (requires 64 bit addressing) 0000002000000000..000000207fffffff RAM (CPU physical address 0..2GB) 0060000080000000..006001ffffffffff 64-bit BARs in PCIe domain 1 0060020000000000..006003ffffffffff PCIe domain 2 This allows 32 bit devices to access other 32 bit devices on the same bus, and (some) physical memory, but we need to sacrifice the 1:1 mapping for host memory. The actual mapping is a bit more complex, because 64 bit BARs get mapped into the "32 bit" space to keep them accessible for 32 bit cards in the same domain, and this would also be a valid reason not to extend the BAR size even if we can. The default 256 MB aperture ends up in the "32 bit" range, so unless the BAR is resized and reallocated, the CPU and DMA addresses for the aperture *will* differ. So when a DMA buffer is created that ends up in the first 2 GB of RAM, the dma_addr_t returned for this device will have 0x2000000000 added to it, because that is the address that the device will have to use, and DMA buffers for 32 bit devices will be taken from the 2GB..4GB range because neither the first 2 GB nor anything beyond 4 GB are accessible to this device. If there is a 32 bit BAR at 0x10000000 in domain 1, then the CPU will see it at 0x60000010000000, but mapping it from another device in the same domain will return a dma_addr_t of 0x10000000 -- because that is the address that is routeable in the PCIe fabric, this is the BAR address configured into the device so it will actually respond, and the TLP will not leave the bus because it is downstream of the root bridge, so it does not affect the physical RAM. Actual numbers will be different to handle even more corner cases and I don't remember exactly how many zeroes are in each range, but you get the idea -- and this is before we've even started creating virtual machines with a different view of physical addresses. On Tue, Sep 23, 2025 at 06:01:34AM +0000, Kasireddy, Vivek wrote: > - The Xe Graphics driver running inside the Linux VM creates a buffer > (Gnome Wayland compositor's framebuffer) in the VF's portion (or share) > of the VRAM and this buffer is shared with Qemu. Qemu then requests > vfio-pci driver to create a dmabuf associated with this buffer. That's a bit late. What is EFI supposed to do? Simon
