Hi all,
There were various suggestions in the September 2025 thread "[TECH
TOPIC] vfio, iommufd: Enabling user space drivers to vend more
granular access to client processes" [0], and LPC discussions, around
improving the situation for multi-process userspace driver designs.
This RFC series implements some of these ideas.
(Thanks for feedback on v1! Revised series, with changes noted
inline.)
Background: Multi-process USDs
==============================
The userspace driver scenario discussed in that thread involves a
primary process driving a PCIe function through VFIO/iommufd, which
manages the function-wide ownership/lifecycle. The function is
designed to provide multiple distinct programming interfaces (for
example, several independent MMIO register frames in one function),
and the primary process delegates control of these interfaces to
multiple independent client processes (which do the actual work).
This scenario clearly relies on a HW design that provides appropriate
isolation between the programming interfaces.
The two key needs are:
1. Mechanisms to safely delegate a subset of the device MMIO
resources to a client process without over-sharing wider access
(or influence over whole-device activities, such as reset).
2. Mechanisms to allow a client process to do its own iommufd
management w.r.t. its address space, in a way that's isolated
from DMA relating to other clients.
mmap() of VFIO DMABUFs
======================
This RFC addresses #1 in "vfio/pci: Support mmap() of a VFIO DMABUF",
implementing the proposals in [0] to add mmap() support to the
existing VFIO DMABUF exporter.
This enables a userspace driver to define DMABUF ranges corresponding
to sub-ranges of a BAR, and grant a given client (via a shared fd)
the capability to access (only) those sub-ranges. The VFIO device fds
would be kept private to the primary process. All the client can do
with that fd is map (or iomap via iommufd) that specific subset of
resources, and the impact of bugs/malice is contained.
(We'll follow up on #2 separately, as a related-but-distinct problem.
PASIDs are one way to achieve per-client isolation of DMA; another
could be sharing of a single IOVA space via 'constrained' iommufds.)
New in v2: To achieve this, the existing VFIO BAR mmap() path is
converted to use DMABUFs behind the scenes, in "vfio/pci: Convert BAR
mmap() to use a DMABUF" plus new helper functions, as Jason/Christian
suggested in the v1 discussion [3].
This means:
- Both regular and new DMABUF BAR mappings share the same vm_ops,
i.e. mmap()ing DMABUFs is a smaller change on top of the existing
mmap().
- The zapping of mappings occurs via vfio_pci_dma_buf_move(), and the
vfio_pci_zap_bars() originally paired with the _move()s can go
away. Each DMABUF has a unique address_space.
- It's a step towards future iommufd VFIO Type1 emulation
implementing P2P, since iommufd can now get a DMABUF from a VA that
it's mapping for IO; the VMAs' vm_file is that of the backing
DMABUF.
Revocation/reclaim
==================
Mapping a BAR subset is useful, but the lifetime of access granted to
a client needs to be managed well. For example, a protocol between
the primary process and the client can indicate when the client is
done, and when it's safe to reuse the resources elsewhere, but cleanup
can't practically be cooperative.
For robustness, we enable the driver to make the resources
guaranteed-inaccessible when it chooses, so that it can re-assign them
to other uses in future.
"vfio/pci: Permanently revoke a DMABUF on request" adds a new VFIO
device fd ioctl, VFIO_DEVICE_PCI_DMABUF_REVOKE. This takes a DMABUF
fd parameter previously exported (from that device!) and permanently
revokes the DMABUF. This notifies/detaches importers, zaps PTEs for
any mappings, and guarantees no future attachment/import/map/access is
possible by any means.
A primary driver process would use this operation when the client's
tenure ends to reclaim "loaned-out" MMIO interfaces, at which point
the interfaces could be safely re-used.
New in v2: ioctl() on VFIO driver fd, rather than DMABUF fd. A DMABUF
is revoked using code common to vfio_pci_dma_buf_move(), selectively
zapping mappings (after waiting for completion on the
dma_buf_invalidate_mappings() request).
BAR mapping access attributes
=============================
Inspired by Alex [Mastro] and Jason's comments in [0] and Mahmoud's
work in [1] with the goal of controlling CPU access attributes for
VFIO BAR mappings (e.g. WC), we can decorate DMABUFs with access
attributes that are then used by a mapping's PTEs.
I've proposed reserving a field in struct
vfio_device_feature_dma_buf's flags to specify an attribute for its
ranges. Although that keeps the (UAPI) struct unchanged, it means all
ranges in a DMABUF share the same attribute. I feel a single
attribute-to-mmap() relation is logical/reasonable. An application
can also create multiple DMABUFs to describe any BAR layout and mix of
attributes.
Tests
=====
(Still sharing the [RFC ONLY] userspace test/demo program for context,
not for merge.)
It illustrates & tests various map/revoke cases, but doesn't use the
existing VFIO selftests and relies on a (tweaked) QEMU EDU function.
I'm (still) working on integrating the scenarios into the existing
VFIO selftests.
This code has been tested in mapping DMABUFs of single/multiple
ranges, aliasing mmap()s, aliasing ranges across DMABUFs, vm_pgoff >
0, revocation, shutdown/cleanup scenarios, and hugepage mappings seem
to work correctly. I've lightly tested WC mappings also (by observing
resulting PTEs as having the correct attributes...).
Fin
===
v2 is based on next-20260310 (to build on Leon's recent series
"vfio: Wait for dma-buf invalidation to complete" [2]).
Please share your thoughts! I'd like to de-RFC if we feel this
approach is now fair.
Many thanks,
Matt
References:
[0]:
https://lore.kernel.org/linux-iommu/[email protected]/
[1]: https://lore.kernel.org/all/[email protected]/
[2]:
https://lore.kernel.org/linux-iommu/20260205-nocturnal-poetic-chamois-f566ad@houat/T/#m310cd07011e3a1461b6fda45e3f9b886ba76571a
[3]: https://lore.kernel.org/all/[email protected]/
--------------------------------------------------------------------------------
Changelog:
v2: Respin based on the feedback/suggestions:
- Transform the existing VFIO BAR mmap path to also use DMABUFs behind
the scenes, and then simply share that code for explicitly-mapped
DMABUFs.
- Refactors the export itself out of vfio_pci_core_feature_dma_buf,
and shared by a new vfio_pci_core_mmap_prep_dmabuf helper used by
the regular VFIO mmap to create a DMABUF.
- Revoke buffers using a VFIO device fd ioctl
v1: https://lore.kernel.org/all/[email protected]/
Matt Evans (10):
vfio/pci: Set up VFIO barmap before creating a DMABUF
vfio/pci: Clean up DMABUFs before disabling function
vfio/pci: Add helper to look up PFNs for DMABUFs
vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA
vfio/pci: Convert BAR mmap() to use a DMABUF
vfio/pci: Remove vfio_pci_zap_bars()
vfio/pci: Support mmap() of a VFIO DMABUF
vfio/pci: Permanently revoke a DMABUF on request
vfio/pci: Add mmap() attributes to DMABUF feature
[RFC ONLY] selftests: vfio: Add standalone vfio_dmabuf_mmap_test
drivers/vfio/pci/Kconfig | 3 +-
drivers/vfio/pci/Makefile | 3 +-
drivers/vfio/pci/vfio_pci_config.c | 18 +-
drivers/vfio/pci/vfio_pci_core.c | 123 +--
drivers/vfio/pci/vfio_pci_dmabuf.c | 425 +++++++--
drivers/vfio/pci/vfio_pci_priv.h | 46 +-
include/uapi/linux/vfio.h | 42 +-
tools/testing/selftests/vfio/Makefile | 1 +
.../vfio/standalone/vfio_dmabuf_mmap_test.c | 837 ++++++++++++++++++
9 files changed, 1339 insertions(+), 159 deletions(-)
create mode 100644
tools/testing/selftests/vfio/standalone/vfio_dmabuf_mmap_test.c
--
2.47.3