On Wed, 27 May 2026 03:23:10 -0700
Matt Evans <[email protected]> wrote:

> A VFIO DMABUF can export a subset of a BAR to userspace by fd; add
> support for mmap() of this fd.  This provides another route for a
> process to map BARs, except one where the process can only map a specific
> subset of a BAR represented by the exported DMABUF.
> 
> mmap() support enables userspace driver designs that safely delegate
> access to BAR sub-ranges to other client processes by sharing a DMABUF
> fd, without having to share the (omnipotent) VFIO device fd with them.
> 
> Since the main VFIO BAR mmap() is now DMABUF-aware, this path reuses
> the existing vm_ops.  But, since the lifecycle of an exported DMABUF
> is still decoupled from that of the device fd it came from, the device
> fd might now be closed concurrent with a VMA fault.
> 
> Extra synchronisation is added to deal with the possibility of a fault
> racing with the DMABUF cleanup path.  (Note that this differs to a
> DMABUF implicitly created on the mmap() path, which holds ownership of
> the device fd and so prevents close-during-fault scenarios in order to
> maintain the same user-facing behaviour on close.)  It does this by
> temporarily taking a VFIO device registration to ensure vdev remains
> valid, then vdev->memory_lock can be taken.

Suggest some general rewording of the commit log here, quite confusing.
 
> Signed-off-by: Matt Evans <[email protected]>
> ---
>  drivers/vfio/pci/vfio_pci_core.c   | 79 ++++++++++++++++++++++++++----
>  drivers/vfio/pci/vfio_pci_dmabuf.c | 27 ++++++++++
>  drivers/vfio/pci/vfio_pci_priv.h   |  2 +
>  3 files changed, 99 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c 
> b/drivers/vfio/pci/vfio_pci_core.c
> index cfea59806a4f..41e049fa9a8a 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -12,6 +12,8 @@
>  
>  #include <linux/aperture.h>
>  #include <linux/device.h>
> +#include <linux/dma-buf.h>
> +#include <linux/dma-resv.h>
>  #include <linux/eventfd.h>
>  #include <linux/file.h>
>  #include <linux/interrupt.h>
> @@ -1742,19 +1744,77 @@ static vm_fault_t vfio_pci_mmap_huge_fault(struct 
> vm_fault *vmf,
>       vm_fault_t ret = VM_FAULT_SIGBUS;
>  
>       /*
> -      * We can rely on the existence of both a DMABUF (priv) and
> -      * the VFIO device it was exported from (vdev).  This fault's
> -      * VMA was established using vfio_pci_core_mmap_prep_dmabuf()
> -      * which transfers ownership of the VFIO device fd to the
> -      * DMABUF, and so the VFIO device is held open because the
> -      * VMA's vm_file (DMABUF) is open.
> +      * The only thing this can rely on is that the DMABUF relating
> +      * to the VMA's vm_file exists (priv).
>        *
> -      * Since vfio_pci_dma_buf_cleanup() cannot have happened,
> -      * vdev must be valid; we can take memory_lock.
> +      * A DMABUF for a VFIO device fd mmap() holds a reference to
> +      * the original VFIO device fd, but an explicitly-exported
> +      * DMABUF does not.  The original fd might have closed,
> +      * meaning this fault can race with
> +      * vfio_pci_dma_buf_cleanup(), meaning priv->vdev might be
> +      * NULL, and the VFIO device registration might have been
> +      * dropped.
> +      *
> +      * With the goal of taking vdev->memory_lock in a world where
> +      * vdev might not still exist:
> +      *
> +      * 1. Take the resv lock on the DMABUF:
> +      *  - If racing cleanup got in first, the buffer is revoked;
> +      *    stop/exit if so.
> +      *  - If we got in first, the buffer is not revoked so vdev is
> +      *    non-NULL, accessible, and cleanup _has not yet put the
> +      *    VFIO device registration_.  So, the device refcount must
> +      *    be >0.
> +      *
> +      * 2. Take vfio_device registration (refcount guaranteed >0
> +      *    hereafter).
> +      *
> +      * 3. Unlock the DMABUF's resv lock:
> +      *  - A racing cleanup can now complete.
> +      *  - But, the device refcount >0, meaning the vfio_device
> +      *    (and vfio_pcie_core device vdev) have not yet been
> +      *    freed.  vdev is accessible, even if the DMABUF has been
> +      *    revoked or cleanup has happened, because
> +      *    vfio_unregister_group_dev() can't complete.
> +      *
> +      * 4. Take the vdev->memory_lock
> +      *  - Either the DMABUF is usable, or has been cleaned up.
> +      *    Whichever, it can no longer change under us.
> +      *  - Test the DMABUF revocation status again: if it was
> +      *    revoked between 1 and 4 return a SIGBUS. Otherwise,
> +      *    return a PFN.
> +      *  - It's not necessary to also take the resv lock, because
> +      *    the status/vdev can't change while memory_lock is held.
> +      *
> +      * 5. Unlock, done.
>        */
> +
> +     dma_resv_lock(priv->dmabuf->resv, NULL);
>       vdev = READ_ONCE(priv->vdev);

I think you've again avoided the need for the READ_ONCE() by getting it
under dma_resv_lock(), so it's still unnecessary.

>  
> +     if (priv->revoked || !vdev) {
> +             pr_debug_ratelimited("%s VA 0x%lx, pgoff 0x%lx: DMABUF 
> revoked/cleaned up\n",
> +                                  __func__, vmf->address, vma->vm_pgoff);
> +             dma_resv_unlock(priv->dmabuf->resv);
> +             return VM_FAULT_SIGBUS;
> +     }
> +     /* vdev is usable */
> +
> +     if (!vfio_device_try_get_registration(&vdev->vdev)) {
> +             /*
> +              * If vdev != NULL (above), the registration should
> +              * already be >0 and so this try_get should never
> +              * fail.
> +              */
> +             dev_warn(&vdev->pdev->dev, "%s: Unexpected registration 
> failure\n",
> +                      __func__);
> +             dma_resv_unlock(priv->dmabuf->resv);
> +             return VM_FAULT_SIGBUS;
> +     }
> +     dma_resv_unlock(priv->dmabuf->resv);
> +
>       scoped_guard(rwsem_read, &vdev->memory_lock) {
> +             /* Revocation status must be re-read, under memory_lock */
>               if (!priv->revoked) {
>                       int pres = vfio_pci_dma_buf_find_pfn(priv, vma,
>                                                            vmf->address,
> @@ -1773,6 +1833,7 @@ static vm_fault_t vfio_pci_mmap_huge_fault(struct 
> vm_fault *vmf,
>                                   vma->vm_pgoff, (unsigned int)ret);
>       }
>  
> +     vfio_device_put_registration(&vdev->vdev);
>       return ret;
>  }
>  
> @@ -1781,7 +1842,7 @@ static vm_fault_t vfio_pci_mmap_page_fault(struct 
> vm_fault *vmf)
>       return vfio_pci_mmap_huge_fault(vmf, 0);
>  }
>  
> -static const struct vm_operations_struct vfio_pci_mmap_ops = {
> +const struct vm_operations_struct vfio_pci_mmap_ops = {
>       .fault = vfio_pci_mmap_page_fault,
>  #ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
>       .huge_fault = vfio_pci_mmap_huge_fault,
> diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c 
> b/drivers/vfio/pci/vfio_pci_dmabuf.c
> index 733607371082..4b3b15655f1d 100644
> --- a/drivers/vfio/pci/vfio_pci_dmabuf.c
> +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
> @@ -27,6 +27,32 @@ static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
>  
>       return 0;
>  }
> +
> +static int vfio_pci_dma_buf_mmap(struct dma_buf *dmabuf, struct 
> vm_area_struct *vma)
> +{
> +     struct vfio_pci_dma_buf *priv = dmabuf->priv;
> +
> +     if (priv->revoked)
> +             return -ENODEV;

Questionable validity to testing revoked without a lock, but doesn't
this also fail to follow the "map regardless, sort it out on fault"
paradigm used elsewhere in vfio-pci?  Thanks,

Alex

> +     if ((vma->vm_flags & VM_SHARED) == 0)
> +             return -EINVAL;
> +
> +     /*
> +      * dma_buf_mmap_internal() has asserted that the VMA is
> +      * contained within the DMABUF size before calling this.
> +      */
> +
> +     vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> +     vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
> +
> +     /* See comments in vfio_pci_core_mmap() re VM_ALLOW_ANY_UNCACHED. */
> +     vm_flags_set(vma, VM_ALLOW_ANY_UNCACHED | VM_IO | VM_PFNMAP |
> +                  VM_DONTEXPAND | VM_DONTDUMP);
> +     vma->vm_private_data = priv;
> +     vma->vm_ops = &vfio_pci_mmap_ops;
> +
> +     return 0;
> +}
>  #endif /* CONFIG_VFIO_PCI_DMABUF */
>  
>  static void vfio_pci_dma_buf_done(struct kref *kref)
> @@ -94,6 +120,7 @@ static void vfio_pci_dma_buf_release(struct dma_buf 
> *dmabuf)
>  static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
>  #ifdef CONFIG_VFIO_PCI_DMABUF
>       .attach = vfio_pci_dma_buf_attach,
> +     .mmap = vfio_pci_dma_buf_mmap,
>  #endif
>       .map_dma_buf = vfio_pci_dma_buf_map,
>       .unmap_dma_buf = vfio_pci_dma_buf_unmap,
> diff --git a/drivers/vfio/pci/vfio_pci_priv.h 
> b/drivers/vfio/pci/vfio_pci_priv.h
> index 10833aabd7fb..db2e2aeae88f 100644
> --- a/drivers/vfio/pci/vfio_pci_priv.h
> +++ b/drivers/vfio/pci/vfio_pci_priv.h
> @@ -38,6 +38,8 @@ struct vfio_pci_dma_buf {
>       u8 revoked : 1;
>  };
>  
> +extern const struct vm_operations_struct vfio_pci_mmap_ops;
> +
>  bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev);
>  void vfio_pci_intx_unmask(struct vfio_pci_core_device *vdev);
>  

Reply via email to