On Wed, Jan 28, 2026 at 11:14:58AM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 27, 2026 at 04:48:36PM -0800, Matthew Brost wrote:
> > Add an IOVA interface to the DRM pagemap layer. This provides a semantic
> > wrapper around the dma-map IOVA alloc/link/sync/unlink/free API while
> > remaining flexible enough to support future high-speed interconnects
> > between devices.
> 
> I don't think this is a very clear justification.
> 
> "IOVA" and dma_addr_t should be strictly reserved for communication
> that flows through the interconnect that Linux struct device is aware
> of (ie the PCIe fabric). It should not ever be used for "high speed
> interconnects" implying some private and hidden things like
> xgmi/nvlink/ualink type stuff.
> 

Yes, the future is looking forward to xgmi/nvlink/ualink type stuff. I
agree we (DRM pagemap, GPU SVM, Xe) need a refactor to avoid using
dma_addr_t for any interfaces here once we unify this xgmi/nvlink/ualink
as dma_addr_t doesn't make tons of sense. This is a PoC the code structure.
s/IOVA/something else/ for interfaces may make sense too.

> I can't think of any reason why you'd want to delegate constructing
> the IOVA to some other code. I can imagine you'd want to get a pfn
> list from someplace else and turn that into a mapping.
>

Yes, this is exactly what I envision here. First, let me explain the
possible addressing modes on the UAL fabric:

 - Physical (akin to IOMMU passthrough)
 - Virtual (akin to IOMMU enabled)

Physical mode is straightforward — resolve the PFN to a cross-device
physical address, then install it into the initiator’s page tables along
with a bit indicating routing over the network. In this mode, the vfuncs
here are basically NOPs.

Virtual mode is the tricky one. There are addressing modes where a
virtual address must be allocated at the target device (i.e., the
address on the wire is translated at the target via a page-table walk).
This is why the code is structured the way it is, and why I envision a
UAL API that mirrors dma-map. At the initiator the initiator target
virtual addresss is installed the page tables along with a bit
indicating routing over the network.

Let me give some examples of what this would look like in a few of the
vfuncs — see [1] for the dma-map implementation. Also ignore dma_addr_t
abuse for now.

[1] https://patchwork.freedesktop.org/patch/701149/?series=160587&rev=3

struct xe_svm_iova_cookie {
        struct dma_iova_state state;
        struct ual_iova_state ual_state;
};

static void *xe_drm_pagemap_device_iova_alloc(struct drm_pagemap *dpagemap,
                                              struct device *dev, size_t length,
                                              enum dma_data_direction dir)
{
        struct device *pgmap_dev = dpagemap->drm->dev;
        struct xe_svm_iova_cookie *cookie;
        static bool locking_proved = false;

        xe_drm_pagemap_device_iova_prove_locking(&locking_proved);

        if (pgmap_dev == dev)
                return NULL;

        cookie = kzalloc(sizeof(*cookie), GFP_KERNEL);
        if (!cookie)
                return NULL;

        if (ual_distance(pgmap_dev, dev) < 0) {
                dma_iova_try_alloc(dev, &cookie->state, length >= SZ_2M ? SZ_2M 
: 0,
                                   length);
                if (dma_use_iova(&cookie->state))
                        return cookie;
        } else {
                err = ual_iova_try_alloc(pgmap_dev, &cookie->ual_state,
                                         length >= SZ_2M ? SZ_2M : 0,
                                         length);
                if (err)
                        return ERR_PTR(err);

                if (ual_use_iova(&cookie->state))
                        return cookie;
        }

        kfree(cookie);
        return NULL;
}

So, here in physical mode - 'ual_use_iova' would return false, true in virtual.

This function is also interesting because ual_iova_try_alloc in virtual
mode can allocate memory for PTEs on the target device. This is why the
kernel doc explanation for Context, along with
xe_drm_pagemap_device_iova_prove_locking, is important to ensure that
all the locking is correct.

Now this function:

static struct drm_pagemap_addr
xe_drm_pagemap_device_iova_link(struct drm_pagemap *dpagemap,
                                struct device *dev, struct page *page,
                                size_t length, size_t offset, void *cookie,
                                enum dma_data_direction dir)
{
        struct device *pgmap_dev = dpagemap->drm->dev;
        struct xe_svm_iova_cookie *__cookie = cookie;
        struct xe_device *xe = to_xe_device(dpagemap->drm);
        enum drm_interconnect_protocol prot;
        dma_addr_t addr;
        int err;

        if (dma_use_iova(&__cookie->state) {
                addr = __cookie->state.addr + offset;
                proto = XE_INTERCONNECT_P2P;
                err = dma_iova_link(dev, &__cookie->state, 
xe_page_to_pcie(page),
                                    offset, length, dir, DMA_ATTR_SKIP_CPU_SYNC 
|
                                    DMA_ATTR_MMIO);
        } else {
                addr = __cookie->ual_state.addr + offset;
                proto = XE_INTERCONNECT_VRAM;   /* Also means over fabric */
                err = ual_iova_link(dev, &__cookie->ual_state, 
xe_page_to_pcie(page),
                                    offset, length, dir);
        }
        if (err)
                addr = DMA_MAPPING_ERROR;

        return drm_pagemap_addr_encode(addr, proto, ilog2(length), dir);
}

Note that the above function can only be called in virtual mode (i.e.,
the first function returns an IOVA cookie). Here we’d jam the target’s
PTEs with physical page addresses (reclaim-safe) and return the network
virtual address.

Lastly a physical UAL example (i.e., first function returns NULL).

static struct drm_pagemap_addr
xe_drm_pagemap_device_map(struct drm_pagemap *dpagemap,
                          struct device *dev,
                          struct page *page,
                          unsigned int order,
                          enum dma_data_direction dir)
{
        struct device *pgmap_dev = dpagemap->drm->dev;
        enum drm_interconnect_protocol prot;
        dma_addr_t addr;

        if (pgmap_dev == dev || ual_distance(pgmap_dev, dev) >= 0) {
                addr = xe_page_to_dpa(page);
                prot = XE_INTERCONNECT_VRAM;
        } else {
                addr = dma_map_resource(dev,
                                        xe_page_to_pcie(page),
                                        PAGE_SIZE << order, dir,
                                        DMA_ATTR_SKIP_CPU_SYNC);
                prot = XE_INTERCONNECT_P2P;
        }

        return drm_pagemap_addr_encode(addr, prot, order, dir);
}

So, if it isn’t clear — these vfuncs hide whether PCIe P2P is being used
(IOMMU in passthrough or enabled) or UAL is being used (physical or
virtual) for DRM common layer. They manage the resources for the
connection and provide the information needed to program the initiator
PTEs (address + “use interconnect” vs. “use PCIe P2P bit”).

This reasoning is why it would be nice if drivers were allowed to
dma-map IOVA alloc/link/sync/unlink/free API for PCIe P2P directly.

> My understanding of all the private interconnects is you get an
> interconnect address and program it directly into the device HW,
> possibly with a "use interconnect" bit, and the device never touches
> the PCIe fabric at all.
> 

Yes, but see physical vs virtual explaination. The "use interconnect" is
just one part of this.

Matt

> Jason

Reply via email to