On Tue, Oct 22, 2024 at 02:08:29PM -0600, Alex Williamson wrote: > Thanks to work by Peter Xu, support is introduced in Linux v6.12 to > allow pfnmap insertions at PMD and PUD levels of the page table. This > means that provided a properly aligned mmap, the vfio driver is able > to map MMIO at significantly larger intervals than PAGE_SIZE. For > example on x86_64 (the only architecture currently supporting huge > pfnmaps for PUD), rather than 4KiB mappings, we can map device MMIO > using 2MiB and even 1GiB page table entries. > > Typically mmap will already provide PMD aligned mappings, so devices > with moderately sized MMIO ranges, even GPUs with standard 256MiB BARs, > will already take advantage of this support. However in order to better > support devices exposing multi-GiB MMIO, such as 3D accelerators or GPUs > with resizable BARs enabled, we need to manually align the mmap. > > There doesn't seem to be a way for userspace to easily learn about PMD > and PUD mapping level sizes, therefore this takes the simple approach > to align the mapping to the power-of-two size of the region, up to 1GiB, > which is currently the maximum alignment we care about. > > Cc: Peter Xu <pet...@redhat.com> > Signed-off-by: Alex Williamson <alex.william...@redhat.com>
For the longer term, maybe QEMU can provide a function to reserve a range of mmap with some specific alignment requirement. For example, currently qemu_ram_mmap() does mostly the same thing (and it hides a hugetlb fix on ppc only with 7197fb4058, which isn't a concern here). Then the complexity can hide in that function. Kind of a comment for the future only. Reviewed-by: Peter Xu <pet...@redhat.com> Thanks! > --- > hw/vfio/helpers.c | 32 ++++++++++++++++++++++++++++++-- > 1 file changed, 30 insertions(+), 2 deletions(-) > > diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c > index b9e606e364a2..913796f437f8 100644 > --- a/hw/vfio/helpers.c > +++ b/hw/vfio/helpers.c > @@ -27,6 +27,7 @@ > #include "trace.h" > #include "qapi/error.h" > #include "qemu/error-report.h" > +#include "qemu/units.h" > #include "monitor/monitor.h" > > /* > @@ -406,8 +407,35 @@ int vfio_region_mmap(VFIORegion *region) > prot |= region->flags & VFIO_REGION_INFO_FLAG_WRITE ? PROT_WRITE : 0; > > for (i = 0; i < region->nr_mmaps; i++) { > - region->mmaps[i].mmap = mmap(NULL, region->mmaps[i].size, prot, > - MAP_SHARED, region->vbasedev->fd, > + size_t align = MIN(1ULL << ctz64(region->mmaps[i].size), 1 * GiB); > + void *map_base, *map_align; > + > + /* > + * Align the mmap for more efficient mapping in the kernel. Ideally > + * we'd know the PMD and PUD mapping sizes to use as discrete > alignment > + * intervals, but we don't. As of Linux v6.12, the largest PUD size > + * supporting huge pfnmap is 1GiB (ARCH_SUPPORTS_PUD_PFNMAP is only > set > + * on x86_64). Align by power-of-two size, capped at 1GiB. > + * > + * NB. qemu_memalign() and friends actually allocate memory, whereas > + * the region size here can exceed host memory, therefore we manually > + * create an oversized anonymous mapping and clean it up for > alignment. > + */ > + map_base = mmap(0, region->mmaps[i].size + align, PROT_NONE, > + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > + if (map_base == MAP_FAILED) { > + ret = -errno; > + goto no_mmap; > + } > + > + map_align = (void *)ROUND_UP((uintptr_t)map_base, (uintptr_t)align); > + munmap(map_base, map_align - map_base); > + munmap(map_align + region->mmaps[i].size, > + align - (map_align - map_base)); > + > + region->mmaps[i].mmap = mmap(map_align, region->mmaps[i].size, prot, > + MAP_SHARED | MAP_FIXED, > + region->vbasedev->fd, > region->fd_offset + > region->mmaps[i].offset); > if (region->mmaps[i].mmap == MAP_FAILED) { > -- > 2.46.2 > -- Peter Xu