Hi Alex,

On 29/05/2026 00:15, Alex Williamson wrote:

On Wed, 27 May 2026 03:23:10 -0700
Matt Evans <[email protected]> wrote:

A VFIO DMABUF can export a subset of a BAR to userspace by fd; add
support for mmap() of this fd.  This provides another route for a
process to map BARs, except one where the process can only map a specific
subset of a BAR represented by the exported DMABUF.

mmap() support enables userspace driver designs that safely delegate
access to BAR sub-ranges to other client processes by sharing a DMABUF
fd, without having to share the (omnipotent) VFIO device fd with them.

Since the main VFIO BAR mmap() is now DMABUF-aware, this path reuses
the existing vm_ops.  But, since the lifecycle of an exported DMABUF
is still decoupled from that of the device fd it came from, the device
fd might now be closed concurrent with a VMA fault.

Extra synchronisation is added to deal with the possibility of a fault
racing with the DMABUF cleanup path.  (Note that this differs to a
DMABUF implicitly created on the mmap() path, which holds ownership of
the device fd and so prevents close-during-fault scenarios in order to
maintain the same user-facing behaviour on close.)  It does this by
temporarily taking a VFIO device registration to ensure vdev remains
valid, then vdev->memory_lock can be taken.

Suggest some general rewording of the commit log here, quite confusing.

OK!

Signed-off-by: Matt Evans <[email protected]>
---
  drivers/vfio/pci/vfio_pci_core.c   | 79 ++++++++++++++++++++++++++----
  drivers/vfio/pci/vfio_pci_dmabuf.c | 27 ++++++++++
  drivers/vfio/pci/vfio_pci_priv.h   |  2 +
  3 files changed, 99 insertions(+), 9 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index cfea59806a4f..41e049fa9a8a 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -12,6 +12,8 @@
#include <linux/aperture.h>
  #include <linux/device.h>
+#include <linux/dma-buf.h>
+#include <linux/dma-resv.h>
  #include <linux/eventfd.h>
  #include <linux/file.h>
  #include <linux/interrupt.h>
@@ -1742,19 +1744,77 @@ static vm_fault_t vfio_pci_mmap_huge_fault(struct 
vm_fault *vmf,
        vm_fault_t ret = VM_FAULT_SIGBUS;
/*
-        * We can rely on the existence of both a DMABUF (priv) and
-        * the VFIO device it was exported from (vdev).  This fault's
-        * VMA was established using vfio_pci_core_mmap_prep_dmabuf()
-        * which transfers ownership of the VFIO device fd to the
-        * DMABUF, and so the VFIO device is held open because the
-        * VMA's vm_file (DMABUF) is open.
+        * The only thing this can rely on is that the DMABUF relating
+        * to the VMA's vm_file exists (priv).
         *
-        * Since vfio_pci_dma_buf_cleanup() cannot have happened,
-        * vdev must be valid; we can take memory_lock.
+        * A DMABUF for a VFIO device fd mmap() holds a reference to
+        * the original VFIO device fd, but an explicitly-exported
+        * DMABUF does not.  The original fd might have closed,
+        * meaning this fault can race with
+        * vfio_pci_dma_buf_cleanup(), meaning priv->vdev might be
+        * NULL, and the VFIO device registration might have been
+        * dropped.
+        *
+        * With the goal of taking vdev->memory_lock in a world where
+        * vdev might not still exist:
+        *
+        * 1. Take the resv lock on the DMABUF:
+        *  - If racing cleanup got in first, the buffer is revoked;
+        *    stop/exit if so.
+        *  - If we got in first, the buffer is not revoked so vdev is
+        *    non-NULL, accessible, and cleanup _has not yet put the
+        *    VFIO device registration_.  So, the device refcount must
+        *    be >0.
+        *
+        * 2. Take vfio_device registration (refcount guaranteed >0
+        *    hereafter).
+        *
+        * 3. Unlock the DMABUF's resv lock:
+        *  - A racing cleanup can now complete.
+        *  - But, the device refcount >0, meaning the vfio_device
+        *    (and vfio_pcie_core device vdev) have not yet been
+        *    freed.  vdev is accessible, even if the DMABUF has been
+        *    revoked or cleanup has happened, because
+        *    vfio_unregister_group_dev() can't complete.
+        *
+        * 4. Take the vdev->memory_lock
+        *  - Either the DMABUF is usable, or has been cleaned up.
+        *    Whichever, it can no longer change under us.
+        *  - Test the DMABUF revocation status again: if it was
+        *    revoked between 1 and 4 return a SIGBUS. Otherwise,
+        *    return a PFN.
+        *  - It's not necessary to also take the resv lock, because
+        *    the status/vdev can't change while memory_lock is held.
+        *
+        * 5. Unlock, done.
         */
+
+       dma_resv_lock(priv->dmabuf->resv, NULL);
        vdev = READ_ONCE(priv->vdev);

I think you've again avoided the need for the READ_ONCE() by getting it
under dma_resv_lock(), so it's still unnecessary.

Reviewed, you're right ofc. This originally went in when I was using a different approach to resolve the race. I've tweaked the comment and actually it can be further simplified as this !vdev test can be removed:

+       if (priv->revoked || !vdev) {
+               pr_debug_ratelimited("%s VA 0x%lx, pgoff 0x%lx: DMABUF 
revoked/cleaned up\n",
+                                    __func__, vmf->address, vma->vm_pgoff);
+               dma_resv_unlock(priv->dmabuf->resv);
+               return VM_FAULT_SIGBUS;
+       }
+       /* vdev is usable */

...and (plain) vdev read moved here. If (holding the dma_resv_lock()) it's not revoked then vdev is usable/valid.

+
+       if (!vfio_device_try_get_registration(&vdev->vdev)) {
+               /*
+                * If vdev != NULL (above), the registration should
+                * already be >0 and so this try_get should never
+                * fail.
+                */
+               dev_warn(&vdev->pdev->dev, "%s: Unexpected registration 
failure\n",
+                        __func__);
+               dma_resv_unlock(priv->dmabuf->resv);
+               return VM_FAULT_SIGBUS;
+       }
+       dma_resv_unlock(priv->dmabuf->resv);
+
        scoped_guard(rwsem_read, &vdev->memory_lock) {
+               /* Revocation status must be re-read, under memory_lock */
                if (!priv->revoked) {
                        int pres = vfio_pci_dma_buf_find_pfn(priv, vma,
                                                             vmf->address,
@@ -1773,6 +1833,7 @@ static vm_fault_t vfio_pci_mmap_huge_fault(struct 
vm_fault *vmf,
                                    vma->vm_pgoff, (unsigned int)ret);
        }
+ vfio_device_put_registration(&vdev->vdev);
        return ret;
  }
@@ -1781,7 +1842,7 @@ static vm_fault_t vfio_pci_mmap_page_fault(struct vm_fault *vmf)
        return vfio_pci_mmap_huge_fault(vmf, 0);
  }
-static const struct vm_operations_struct vfio_pci_mmap_ops = {
+const struct vm_operations_struct vfio_pci_mmap_ops = {
        .fault = vfio_pci_mmap_page_fault,
  #ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
        .huge_fault = vfio_pci_mmap_huge_fault,
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c 
b/drivers/vfio/pci/vfio_pci_dmabuf.c
index 733607371082..4b3b15655f1d 100644
--- a/drivers/vfio/pci/vfio_pci_dmabuf.c
+++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
@@ -27,6 +27,32 @@ static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
return 0;
  }
+
+static int vfio_pci_dma_buf_mmap(struct dma_buf *dmabuf, struct vm_area_struct 
*vma)
+{
+       struct vfio_pci_dma_buf *priv = dmabuf->priv;
+
+       if (priv->revoked)
+               return -ENODEV;

Questionable validity to testing revoked without a lock, but doesn't
this also fail to follow the "map regardless, sort it out on fault"
paradigm used elsewhere in vfio-pci?  Thanks,

Kind of, though this was intentional. (I had removed a crappy comment here from the RFC, but should re-add something better...)

You're right that, if we don't test this here, the fault handler ensures safety.

In the revoke ioctl^Wfeature header UAPI comment I'd suggested a revocation prevents a new map (in addition to prevents access to an existing map) with the intention of making userspace easier to debug:

 - I'd rather trace an mmap() failure than a SIGBUS (eg. if another
   process has revoked my DMABUF and I didn't get the memo).  For a
   prior revoke this is a reliable & useful test.

 - Even if there's an active race such as userspace doing an
   ioctl(RESET) at the same time as an mmap() (!) then obviously this
   check won't always catch it, but when it does then that's useful.

I'd prefer to keep this belt-and-braces check to early-fail (with comment), if you don't mind much. It _is_ testing revoked outside of any synchronisation as anything we can lock against here doesn't have a useful effect AFAICT. (Taking the example of a concurrent ioctl(RESET), a revoke still manifests "before" (caught here) or "after" (caught in fault handler).) Maybe READ_ONCE() wouldn't hurt, if only to indicate this.


With thanks,


Matt



Alex

+       if ((vma->vm_flags & VM_SHARED) == 0)
+               return -EINVAL;
+
+       /*
+        * dma_buf_mmap_internal() has asserted that the VMA is
+        * contained within the DMABUF size before calling this.
+        */
+
+       vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+       vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
+
+       /* See comments in vfio_pci_core_mmap() re VM_ALLOW_ANY_UNCACHED. */
+       vm_flags_set(vma, VM_ALLOW_ANY_UNCACHED | VM_IO | VM_PFNMAP |
+                    VM_DONTEXPAND | VM_DONTDUMP);
+       vma->vm_private_data = priv;
+       vma->vm_ops = &vfio_pci_mmap_ops;
+
+       return 0;
+}
  #endif /* CONFIG_VFIO_PCI_DMABUF */
static void vfio_pci_dma_buf_done(struct kref *kref)
@@ -94,6 +120,7 @@ static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
  static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
  #ifdef CONFIG_VFIO_PCI_DMABUF
        .attach = vfio_pci_dma_buf_attach,
+       .mmap = vfio_pci_dma_buf_mmap,
  #endif
        .map_dma_buf = vfio_pci_dma_buf_map,
        .unmap_dma_buf = vfio_pci_dma_buf_unmap,
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index 10833aabd7fb..db2e2aeae88f 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -38,6 +38,8 @@ struct vfio_pci_dma_buf {
        u8 revoked : 1;
  };
+extern const struct vm_operations_struct vfio_pci_mmap_ops;
+
  bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev);
  void vfio_pci_intx_unmask(struct vfio_pci_core_device *vdev);


Reply via email to