On Thu, Jan 29, 2026 at 12:14:27PM +0200, Leon Romanovsky wrote: > On Wed, Jan 28, 2026 at 01:04:49PM -0800, Matthew Brost wrote: > > On Wed, Jan 28, 2026 at 09:45:40PM +0200, Leon Romanovsky wrote: > > > On Wed, Jan 28, 2026 at 11:29:23AM -0800, Matthew Brost wrote: > > > > On Wed, Jan 28, 2026 at 01:55:31PM -0400, Jason Gunthorpe wrote: > > > > > On Wed, Jan 28, 2026 at 09:46:44AM -0800, Matthew Brost wrote: > > > > > > > > > > > It is intended to fill holes. The input pages come from the > > > > > > migrate_vma_* functions, which can return a sparsely populated > > > > > > array of > > > > > > pages for a region (e.g., it scans a 2M range but only finds > > > > > > several of > > > > > > the 512 pages eligible for migration). As a result, if (!page) is > > > > > > true > > > > > > for many entries. > > > > > > > > > > This is migration?? So something is DMA'ing from A -> B - why put > > > > > holes in the first place? Can you tightly pack the pages in the IOVA? > > > > > > > > > > > > > This could probably could be made to work. I think it would be an > > > > initial pass to figure out the IOVA size then tightly pack. > > > > > > > > Let me look at this. Probably better too as installing dummy pages is a > > > > non-zero cost as I assume dma_iova_link is a radix tree walk. > > > > > > > > > If there is no iommu then the addresses are scattered all over anyhow > > > > > so it can't be relying on some dma_addr_t relationship? > > > > > > > > Scattered dma-addresses is already handled in the copy code, likewise > > > > holes so non-issue. > > > > > > > > > > > > > > You don't have to fully populate the allocated iova, you can link from > > > > > A-B and then unlink from A-B even if B is less than the total size > > > > > requested. > > > > > > > > > > The hmm users have the holes because hmm is dynamically > > > > > adding/removing pages as it runs and it can't do anything to pack the > > > > > mapping. > > > > > > > > > > > > IOVA space? If so, what necessitates those holes? You can have > > > > > > > less mapped > > > > > > > than IOVA and dma_iova_*() API can handle it. > > > > > > > > > > > > I was actually going to ask you about this, so I’m glad you brought > > > > > > it > > > > > > up here. Again, this is a hack to avoid holes — the holes are never > > > > > > touched by our copy function, but rather skipped, so we just jam in > > > > > > a > > > > > > dummy address so the entire IOVA range has valid IOMMU pages. > > > > > > > > > > I would say what you are doing is trying to optimize unmap by > > > > > > > > Yes and make the code simplish. > > > > > > > > > unmapping everything in one shot instead of just the mapped areas, and > > > > > the WARN_ON is telling you that it isn't allowed to unmap across a > > > > > hole. > > > > > > > > > > > at the moment I’m not sure whether this warning affects actual > > > > > > functionality or if we could just delete it. > > > > > > > > > > It means the iommu page table stopped unmapping when it hit a hole and > > > > > there is a bunch of left over maps in the page table that shouldn't be > > > > > there. So yes, it is serious and cannot be deleted. > > > > > > > > > > > > > Cool, this explains the warning. > > > > > > > > > This is a possible option to teach things to detect the holes and > > > > > ignore them.. > > > > > > > > Another option — and IMO probably the best one — as it makes potential > > > > usages with holes the simplest at the driver level. Let me look at this > > > > too. > > > > > > It would be ideal if we could code a more general solution. In HMM we > > > release pages one by one, and it would be preferable to have a single-shot > > > unmap routine instead. In similar to NVMe which release all IOVA space > > > with one call to dma_iova_destroy(). > > > > > > HMM chain: > > > > > > ib_umem_odp_unmap_dma_pages() > > > -> for (...) > > > -> hmm_dma_unmap_pfn() > > > > > > After giving more thought to my earlier suggestion to use > > > hmm_pfn_to_phys(), I began to wonder why did not you use the > > > hmm_dma_*() API instead? > > > > > > > That is ill-suited for high-speed fabrics, but so is our existing > > implementation — we’re just in slightly better shape (?). It also seems > > ill-suited [1][2][3] for variable page sizes (which are possible with > > our API), as well as the way we currently program device PTEs in our > > driver. We also receive PFNs from the migrate_vma_* layer, which must > > also be mapped. > > > > I also believe the hmm_dma_* code predates the DRM code being merged, or > > was merged around the same time. > > > > We could work to unify the HMM helpers and make them usable, but that > > won’t happen overnight. The HMM layer needs quite a bit of work to > > useable, and then we’d have to propagate everything upward through > > DRM/Xe and any new users. Let me play around with this though a bit > > though to get rough idea what would need to be done here. > > > > [1] > > https://elixir.bootlin.com/linux/v6.18.6/source/drivers/infiniband/core/umem_odp.c#L255 > > [2] > > https://elixir.bootlin.com/linux/v6.18.6/source/drivers/infiniband/core/umem_odp.c#L193 > > [3] > > https://elixir.bootlin.com/linux/v6.18.6/source/drivers/infiniband/core/umem_odp.c#L104 > > > > > > Also this is some odd stuff going... Why sync after every mapping [4]. > > Right now, hmm_dma_map_pfn() user is page-based one, we need to sync > after every pagefault. >
Right. On GPUs we typically fault in chunks because it involves a copy to/from the device, which is expensive, and it’s much more efficient to transfer larger sizes. (IIRC, faulting 2M with 512 × 4K pages versus 512 separate 4K faults was about 58× faster in the former case.) THP device pages (+mTHP) make the “one fault == one page” model more palatable, but memory gets fragmented, and if THP/mTHP allocations fail, we really need the path where we can move multiple pages in a single fault. Because of that, we wouldn’t want to sync until all dma-mappings are linked. Also, I dug into hmm_dma_map_pfn; it doesn’t handle device-private pages either, which is something we need. As a long-term goal, having a function like hmm_dma_map_pfn that handled multiple pages of various sizes, supported device-private memory, sparsely populated regions, unified handling with migration PFNs, handled high-speed fabric vs. P2P, and simply returned a mapping that a driver could take and program into PTEs would be great. Perhaps that’s a goal we can work toward eventually, though making this generic across many driver/subsystem use cases seems difficult. The code I'm writing here make this generic across DRM but my driver (Xe) is the only user thus far so really impossible to know if I've got this correct until another vender jumps in which hopefully is happening soonish. Matt > > Blindly doing BIDIRECTIONAL [5]... > > It was promoted from old code, callers can provide direction. > > > > > [4] https://elixir.bootlin.com/linux/v6.18.6/source/mm/hmm.c#L826 > > [5] https://elixir.bootlin.com/linux/v6.18.6/source/mm/hmm.c#L821 > > > > > > > > > > Do you think we need flag somewhere for 'ignore holes' or can I just > > > > blindly skip them? > > > > > > Better if we will have something like dma_iova_with_holes_destroy() > > > function call to make sure that we don't hurt performance of existing > > > dma_iova_destroy() users. > > > > > > > Yes, I think this is the best route for the time being. Let me look at > > this. > > > > Matt > > > > > Thanks > > > > > > > > > > > Matt > > > > > > > > > > > > > > Jason
