On Mon, 6 Apr 2026 23:46:53 +0000 David Matlack <[email protected]> wrote:
> On 2026-04-03 04:44 PM, Rubin Du wrote: > > Add a new VFIO PCI driver for NVIDIA GPUs that enables DMA testing > > via the Falcon (Fast Logic Controller) microcontrollers. This driver > > extracts and adapts the DMA test functionality from NVIDIA's > > gpu-admin-tools project and integrates it into the existing VFIO > > selftest framework. > > > > Falcons are general-purpose microcontrollers present on NVIDIA GPUs > > that can perform DMA operations between system memory and device > > memory. By leveraging Falcon DMA, this driver allows NVIDIA GPUs to > > be tested alongside Intel IOAT and DSA devices using the same > > selftest infrastructure. > > > > The driver is named 'nv_falcon' to reflect that it specifically > > controls the Falcon microcontrollers for DMA operations, rather > > than exposing general GPU functionality. > > > > Reference implementation: > > https://github.com/NVIDIA/gpu-admin-tools > > > > Signed-off-by: Rubin Du <[email protected]> > > Will anyone from Nvidia be able to help review the correctness of the > driver? You can tell by the versioning snafu that we've seen a bunch of versions of this internally, so we think it's correct. We also have the reference implementation in the above link if anyone else wants to compare. > > +static int gpu_poll_register(struct vfio_pci_device *device, > > + const char *name, u32 offset, > > + u32 expected, u32 mask, u32 timeout_ms) > > +{ > > + struct gpu_device *gpu = to_gpu_device(device); > > + struct timespec start, now; > > + u64 elapsed_ms; > > + u32 value; > > + > > + clock_gettime(CLOCK_MONOTONIC, &start); > > + > > + for (;;) { > > + value = gpu_read32(gpu, offset); > > + if ((value & mask) == expected) > > + return 0; > > + > > + clock_gettime(CLOCK_MONOTONIC, &now); > > + elapsed_ms = (now.tv_sec - start.tv_sec) * 1000 > > + + (now.tv_nsec - start.tv_nsec) / 1000000; > > + > > + if (elapsed_ms >= timeout_ms) > > + break; > > + > > + usleep(1000); > > + } > > + > > + dev_err(device, > > + "Timeout polling %s (0x%x): value=0x%x expected=0x%x mask=0x%x > > after %llu ms\n", > > + name, offset, value, expected, mask, > > + (unsigned long long)elapsed_ms); > > nit: You can replace ll with l in the format string and drop the cast to > unsigned long long. > > We only support VFIO selftests on 64-bit architectures, and that matches > what existing printfs for u64 use in VFIO selftests. Yup > > +static bool fsp_check_ofa_dma_support(struct vfio_pci_device *device) > > +{ > > + struct gpu_device *gpu = to_gpu_device(device); > > + u32 val = gpu_read32(gpu, NV_OFA_DMA_SUPPORT_CHECK_REG); > > + > > + return (val >> 16) != 0xbadf; > > +} > > + > > +static int size_to_dma_encoding(u64 size) > > +{ > > + size = min_t(u64, size, NV_FALCON_DMA_MAX_TRANSFER_SIZE); > > It should be impossible for size to be greater than > NV_FALCON_DMA_MAX_TRANSFER_SIZE. This should be dropped or converted > into a VFIO_ASSERT_LE(). Ok, I think the below tests can also become asserts. > > + > > + if (!size || (size & 0x3)) > > + return -EINVAL; > > + > > + return ffs(size) - 3; > > Sashiko pointed out that this can lead to partial memcpys: > > . If a non-power-of-two size is passed (for example, 24 bytes), ffs(size) will > . return the lowest set bit (bit 4, yielding an encoding for 8 bytes). > . > . Because nv_gpu_memcpy_wait() performs exactly one chunk transfer and does > not > . loop over the remainder, could this silently drop the remaining bytes and > . cause a partial data copy? Should this check if the size is a power of 2, or > . should the caller loop to handle remainders? > > https://sashiko.dev/#/patchset/20260403234444.350867-1-rubind%40nvidia.com?part=4 > > I think this nuance needs to be handled within the nv_falcon driver. > vfio_pci_driver_memcpy_start() just guarantees that size <= > NV_FALCON_DMA_MAX_TRANSFER_SIZE. > > Perhaps this is why you had the loop in an earlier version of the > patchset, in which case it was my mistake to ask you to remove it! > > When you add the loop back please add a comment to the loop to explain > that it is necessary since the region needs to be sliced up into > power-of-2 sizes for Falcon. Yes, shouldn't have been removed. We can't do arbitrary sizes, only power-of-2 down to 4-bytes. We can make this more self documenting and assert on memcpy_start for the size alignment requirement. > > +const struct vfio_pci_driver_ops nv_falcon_ops = { > > + .name = "nv_falcon", > > + .probe = nv_gpu_probe, > > + .init = nv_gpu_init, > > + .remove = nv_gpu_remove, > > + .memcpy_start = nv_gpu_memcpy_start, > > + .memcpy_wait = nv_gpu_memcpy_wait, > > +}; > > Any particular reason these functions are named nv_gpu_*() instead of > nv_falcon_*() Yes, these can be made more consistent. Thanks, Alex

