On Tue, Jun 09, 2026 at 08:40:03AM +0000, Nishit Sharma wrote: > This update addresses correctness and stability issues in unified memory > migration on > multi-tile (such as PVC) systems where multiple tiles share a single device > context. > In that environment, local and peer memory paths can appear equivalent at the > device level, > which may cause cross-tile traffic to use an address form that is only valid > on the owning tile. > Under page-fault-driven migration, that can intermittently fail when one tile > accesses pages > currently resident on another tile. > The fix introduces explicit peer-aware mapping semantics so cross-tile > accesses are resolved > through a valid peer-accessible DMA path rather than tile-local addressing. >
There a bunch of different fixes in this patch that warrent individual patches which makes this a bit hard to review but will try. > Signed-off-by: Nishit Sharma <[email protected]> > --- > drivers/gpu/drm/drm_pagemap.c | 6 ++- > drivers/gpu/drm/xe/xe_pt.c | 14 +++-- > drivers/gpu/drm/xe/xe_svm.c | 97 ++++++++++++++++++++++++++++++++++- > drivers/gpu/drm/xe/xe_sync.c | 13 +++++ > include/drm/drm_gpusvm.h | 1 + > include/drm/drm_pagemap.h | 6 +++ > 6 files changed, 132 insertions(+), 5 deletions(-) > > diff --git a/drivers/gpu/drm/drm_pagemap.c b/drivers/gpu/drm/drm_pagemap.c > index 15c78eca180b..e781cbdfa494 100644 > --- a/drivers/gpu/drm/drm_pagemap.c > +++ b/drivers/gpu/drm/drm_pagemap.c > @@ -267,11 +267,15 @@ drm_pagemap_migrate_map_device_private_pages(struct > device *dev, > goto next; > > num_local_pages += NR_PAGES(order); > + addr = dpagemap->ops->device_map(dpagemap, dev, page, > order, dir); Let me start with the problem statement as I understand it so everyone here can align. I believe the issue on PVC is that a single device has two pgmaps (one for each tile), so the unseen if statement above can fall into the else clause, which skips the can_migrate_same_pagemap step. Xe sets can_migrate_same_pagemap to false, which results in a bounce to system memory when the if statement is true. We can then correctly copy the data from system memory to VRAM—this is a limitation of the copy function that should eventually be fixed, but that’s a different issue. However, this behavior is skipped when the pgmaps are from a remote tile on PVC. In that case, we call ->device_map, which checks for local VRAM based on a device check, and our copy function fails. I think what we really need is to bring congruence between the unseen if statement here and our ->device_map() virtual function. The easiest way to do this would be to blindly change the unseen if statement to a device-based check. This assumes that multi-tile devices can read/write VRAM across different tiles using our copy engines. For PVC, this works. I believe this should also work for future devices (though that may need confirmation). AMD is also starting to use this common code, and I’m not sure whether they have a multi-tile/pgmap-per-device concept or similar cross-tile communication. Another option would be to wire in a flag via drm_pagemap_migrate_details that controls whether the unseen check is pgmap-based or device-based. Then we can rely on ->device_map() to determine, based on the device, whether it should consider the device or pgmap when setting up the mapping. Let’s get Thomas’s input on this. Also, please include dri-devel on GPUSVM patches, as well as [email protected], who is working on GPUSVM for AMD. This part of the patch is an individual fix (I'll call fix A for reference) and should be sent either as a standalone patch or as part of a patch series addressing PVC. > } else { > num_peer_pages += NR_PAGES(order); > + if (dpagemap->ops->peer_map) > + addr = dpagemap->ops->peer_map(dpagemap, dev, > page, order, dir); With above, adding ->peer_map() doesn't seem like the right solution. > + else > + addr = dpagemap->ops->device_map(dpagemap, dev, > page, order, dir); > } > > - addr = dpagemap->ops->device_map(dpagemap, dev, page, order, > dir); > if (dma_mapping_error(dev, addr.addr)) > return -EFAULT; > > diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c > index 2669ff5ee747..e0aed463373b 100644 > --- a/drivers/gpu/drm/xe/xe_pt.c > +++ b/drivers/gpu/drm/xe/xe_pt.c > @@ -564,9 +564,17 @@ xe_pt_stage_bind_entry(struct xe_ptw *parent, pgoff_t > offset, > xe_walk->vma->gpuva.flags |= > XE_VMA_PTE_64K; > pte |= XE_PTE_PS64; > - } else if (XE_WARN_ON(xe_walk->needs_64K && > - is_vram)) { > - return -EINVAL; > + } else if (xe_walk->needs_64K && > + is_vram) { > + /* > + * needs_64K VM requires 64K PTEs for > VRAM but > + * the 64K PTE cannot be formed: > partial VRAM > + * eviction or partial migration > created a mixed > + * SRAM+VRAM 64K window. Return -ENOSPC > so the > + * page fault handler can evict > remaining VRAM > + * and retry with SRAM-only mappings. > + */ > + return -ENOSPC; This is different fix, I'll call 'fix B' for reference. Again please break this out into its own patch for review. On its surface, this explaination seems reasonable. > } > } > } > diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c > index e1651e70c8f0..150391c9878b 100644 > --- a/drivers/gpu/drm/xe/xe_svm.c > +++ b/drivers/gpu/drm/xe/xe_svm.c > @@ -602,8 +602,39 @@ static int xe_svm_copy(struct page **pages, > vr = xe_page_to_vr(spage); > gt = xe_migrate_exec_queue(vr->migrate)->gt; > xe = vr->xe; > + } else if (spage && xe_page_to_vr(spage) != vr) { > + /* > + * On multi-tile devices (e.g. PVC/Xe_HPC), migrate_vma > + * may return VRAM pages from different tiles within the > + * same batch because all tiles share the same > + * pgmap_owner. Flush the current copy chunk using the > + * old tile before switching to the new tile's VRAM > region. > + */ > + if (vram_addr != XE_VRAM_ADDR_INVALID) { > + xe_svm_copy_kb_stats_incr(gt, dir, > + (i - pos) * (PAGE_SIZE / > SZ_1K)); > + if (sram) > + __fence = > xe_migrate_from_vram(vr->migrate, > + i - pos, vram_addr, > + &pagemap_addr[pos], > + pre_migrate_fence); > + else > + __fence = > xe_migrate_to_vram(vr->migrate, > + i - pos, > &pagemap_addr[pos], > + vram_addr, > pre_migrate_fence); > + if (IS_ERR(__fence)) { > + err = PTR_ERR(__fence); > + goto err_out; > + } > + pre_migrate_fence = NULL; > + dma_fence_put(fence); > + fence = __fence; > + } > + vr = xe_page_to_vr(spage); > + gt = xe_migrate_exec_queue(vr->migrate)->gt; > + xe = vr->xe; > + vram_addr = XE_VRAM_ADDR_INVALID; I think we resolve 'fix A' differently this part can be dropped, not 100% sure about this without jumping on a PVC device though. > } > - XE_WARN_ON(spage && xe_page_to_vr(spage) != vr); > > /* > * CPU page and device page valid, capture physical address on > @@ -1148,6 +1179,14 @@ bool xe_svm_range_needs_migrate_to_vram(struct > xe_svm_range *range, struct xe_vm > > if (range_size < SZ_64K && !supports_4K_migration(vm->xe)) { > drm_dbg(&vm->xe->drm, "Platform doesn't support SZ_4K range > migration\n"); > + if (!supports_4K_migration(vm->xe) && > + (!IS_ALIGNED(xe_svm_range_start(range), SZ_64K) || > + !IS_ALIGNED(range_size, SZ_64K))) { > + drm_dbg(&vm->xe->drm, > + "Range not 64K-aligned for VRAM migration, > start=0x%lx size=0x%lx\n", > + xe_svm_range_start(range), (unsigned > long)range_size); > + return false; > + } I think this part of 'fix B'? > return false; > } > > @@ -1227,6 +1266,7 @@ static int __xe_svm_handle_pagefault(struct xe_vm *vm, > struct xe_vma *vma, > ktime_t start = xe_gt_stats_ktime_get(), bind_start, get_pages_start; > int err; > > + ctx.local_dpagemap = xe_tile_local_pagemap(tile); This field looks to only be set (i.e., it is unused). > lockdep_assert_held_write(&vm->lock); > xe_assert(vm->xe, xe_vma_is_cpu_addr_mirror(vma)); > > @@ -1360,6 +1400,45 @@ static int __xe_svm_handle_pagefault(struct xe_vm *vm, > struct xe_vma *vma, > goto retry; > } > > + if (err == -ENOSPC) { > + /* > + * 64K PTE not possible for needs_64K VM: mixed SRAM+VRAM in > + * the DMA address array prevented xe_pt_scan_64K from forming > + * a 64K PTE. Three invariants must hold for recovery: > + * > + * 1. Evict unconditionally via drm_gpusvm_range_evict: the > + * advisory has_devmem_pages flag may be stale (false) while > + * VRAM DMA entries still exist in the dma_addr array. > + * > + * 2. Explicitly unmap pages after eviction. > + * drm_gpusvm_range_evict clears has_dma_mapping via the MMU > + * notifier ONLY when a migration actually occurs. On a > + * second recovery attempt the pages may already be in SRAM > + * (no migration, no notifier), leaving has_dma_mapping true > + * with stale VRAM entries in dma_addr. > + * drm_gpusvm_range_unmap_pages clears this unconditionally, > + * guaranteeing that drm_gpusvm_get_pages takes the full HMM > + * path and returns fresh SRAM DMA entries. > + * > + * 3. Use goto retry (not goto get_pages) to re-lookup the > + * range with fresh dpagemap and device_private_page_owner. > + * After eviction the notifier sets tile_invalidated so > + * xe_svm_range_is_valid returns false and we proceed to > + * get_pages. --migrate_try_count becomes -1 (< 0) at > + * retry, skipping VRAM re-migration. > + */ > + range_debug(range, "PAGE FAULT - EVICT PARTIAL VRAM RETRY"); > + drm_gpusvm_range_evict(&vm->svm.gpusvm, &range->base); > + drm_gpusvm_range_unmap_pages(&vm->svm.gpusvm, &range->base, > + &ctx); The drm_gpusvm_range_unmap_pages shouldn't be needed - the notifier should handle this. > + ctx.timeslice_ms <<= 1; /* Double timeslice if we have to retry > */ I don't think the timeslice needs to be increased here. This only used when races occur. > + ctx.devmem_only = false; I'm wondering for 'fix B' if we could just avoid trying to migrate to VRAM if our range is less than 64k? I thought to code already did that? Can you explain condition for 'fix B' being required. I'm having a hard time reasoning when the problem occurs. > + ctx.devmem_possible = false; > + ctx.check_pages_threshold = 0; > + migrate_try_count = 0; > + goto retry; Part of 'fix B'. > + } > + > return err; > } > > @@ -1695,6 +1774,21 @@ xe_drm_pagemap_device_map(struct drm_pagemap *dpagemap, > return drm_pagemap_addr_encode(addr, prot, order, dir); > } > > +static struct drm_pagemap_addr > +xe_drm_pagemap_peer_map(struct drm_pagemap *dpagemap, > + struct device *dev, > + struct page *page, > + unsigned int order, > + enum dma_data_direction dir) > +{ > + dma_addr_t addr = dma_map_resource(dev, > + xe_page_to_pcie(page), > + PAGE_SIZE << order, dir, > + DMA_ATTR_SKIP_CPU_SYNC); > + > + return drm_pagemap_addr_encode(addr, XE_INTERCONNECT_P2P, order, dir); > +} > + > static void xe_drm_pagemap_device_unmap(struct drm_pagemap *dpagemap, > struct device *dev, > const struct drm_pagemap_addr *addr) > @@ -1742,6 +1836,7 @@ static void xe_pagemap_destroy(struct drm_pagemap > *dpagemap, bool from_atomic_or > > static const struct drm_pagemap_ops xe_drm_pagemap_ops = { > .device_map = xe_drm_pagemap_device_map, > + .peer_map = xe_drm_pagemap_peer_map, > .device_unmap = xe_drm_pagemap_device_unmap, > .populate_mm = xe_drm_pagemap_populate_mm, > .destroy = xe_pagemap_destroy, > diff --git a/drivers/gpu/drm/xe/xe_sync.c b/drivers/gpu/drm/xe/xe_sync.c > index 37866768d64c..d2aa888f2d03 100644 > --- a/drivers/gpu/drm/xe/xe_sync.c > +++ b/drivers/gpu/drm/xe/xe_sync.c > @@ -346,6 +346,7 @@ xe_sync_in_fence_get(struct xe_sync_entry *sync, int > num_sync, > > if (q->flags & EXEC_QUEUE_FLAG_VM) { > struct xe_exec_queue *__q; > +#if 0 > struct xe_tile *tile; > u8 id; > > @@ -354,6 +355,18 @@ xe_sync_in_fence_get(struct xe_sync_entry *sync, int > num_sync, > for_each_tlb_inval(i) > num_fence++; > } > +#endif No if 0 in code. > + /* Count fences for main queue */ > + num_fence++; > + for_each_tlb_inval(i) > + num_fence++; > + > + /* Count fences for main queue */ > + list_for_each_entry(__q, &q->multi_gt_list, multi_gt_link) { > + num_fence++; > + for_each_tlb_inval(i) > + num_fence++; > + } The above code here is a different fix - 'fix C' and should be done in a different patch. Also worth noting this is fixed here too [1] but since this part of larger refactor that will take a bit to merge, fine with a isolated fix going in ahead of [1]. [1] https://patchwork.freedesktop.org/patch/707885/?series=149888&rev=3 > > fences = kmalloc_objs(*fences, num_fence); > if (!fences) > diff --git a/include/drm/drm_gpusvm.h b/include/drm/drm_gpusvm.h > index 8a4d7134a9a7..6f03111fc8ef 100644 > --- a/include/drm/drm_gpusvm.h > +++ b/include/drm/drm_gpusvm.h > @@ -250,6 +250,7 @@ struct drm_gpusvm_ctx { > unsigned int devmem_possible :1; > unsigned int devmem_only :1; > unsigned int allow_mixed :1; > + struct drm_pagemap *local_dpagemap; This looks unused aside from being set. If this is needed, move above the bitfields. > }; > > int drm_gpusvm_init(struct drm_gpusvm *gpusvm, > diff --git a/include/drm/drm_pagemap.h b/include/drm/drm_pagemap.h > index 95eb4b66b057..414a8a19bfe7 100644 > --- a/include/drm/drm_pagemap.h > +++ b/include/drm/drm_pagemap.h > @@ -87,6 +87,12 @@ struct drm_pagemap_ops { > unsigned int order, > enum dma_data_direction dir); > > + struct drm_pagemap_addr (*peer_map)(struct drm_pagemap *dpagemap, > + struct device *dev, > + struct page *page, > + unsigned int order, > + enum dma_data_direction dir); > + I don't think this needed if 'fix A' is done differently. Matt > /** > * @device_unmap: Unmap a device address previously obtained using > @device_map. > * > -- > 2.43.0 >
