On Sat, Oct 25, 2025 at 02:04:12PM +0200, Thomas Hellström wrote: > Data present in foreign device memory may cause migration to fail. > For now, retry once after first migrating to system. > > Signed-off-by: Thomas Hellström <[email protected]> > --- > drivers/gpu/drm/xe/xe_svm.c | 19 +++++++++++++++---- > 1 file changed, 15 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c > index 9814f95cb212..41e075aa015c 100644 > --- a/drivers/gpu/drm/xe/xe_svm.c > +++ b/drivers/gpu/drm/xe/xe_svm.c > @@ -1529,13 +1529,24 @@ struct drm_pagemap *xe_vma_resolve_pagemap(struct > xe_vma *vma, struct xe_tile *t > int xe_svm_alloc_vram(struct xe_svm_range *range, const struct > drm_gpusvm_ctx *ctx, > struct drm_pagemap *dpagemap) > { > + int err, retries = 1; > + > xe_assert(range_to_vm(&range->base)->xe, > range->base.pages.flags.migrate_devmem); > range_debug(range, "ALLOCATE VRAM"); > > - return drm_pagemap_populate_mm(dpagemap, xe_svm_range_start(range), > - xe_svm_range_end(range), > - range->base.gpusvm->mm, > - ctx->timeslice_ms); > +retry: > + err = drm_pagemap_populate_mm(dpagemap, xe_svm_range_start(range), > + xe_svm_range_end(range), > + range->base.gpusvm->mm, > + ctx->timeslice_ms); > + if ((err == -EBUSY || err == -EFAULT) && retries--) {
I don't think this is what we want to do here. -EFAULT indicates that the pages are entirely present somewhere in device memory. This could be either on the local device or on a foreign device, but we don’t have enough information here to determine which case it is. If this is on our local device, we're always good. This could occur playing mremap games. If it's on a foreign device, things get trickier. If our interconnect supports atomics (e.g., UAL), we're still good. But if the interconnect doesn't support atomics (e.g., PCIe P2P), this an atomic fault, then we need to move the memory. Also, if there's no path between device memories, then of course we need to move the memory. Again, we don’t have enough information here to make the correct decision. We really need to call drm_gpusvm_range_get_pages to gather the CPU pages in order to make this kind of decision. Ideally, the logic should be built into drm_gpusvm_range_get_pages to understand atomic migration requirements. Once drm_gpusvm_range_get_pages returns, we can take appropriate action. Initially, for simplicity, this might just be a bounce to system memory. Later, it could evolve into a direct device-to-device move. The logic inside drm_gpusvm_range_get_pages would likely involve devmem_only combined with a drm_pagemap passed in, which can detect connectivity and atomic support between devices—based on the drm_pagemap extracted from the ZDD. Let know if thia makes sense, or if you have thought about doing this in a follow up. Matt > + range_debug(range, "ALLOCATE VRAM - Retry."); > + > + drm_gpusvm_range_evict(range->base.gpusvm, &range->base); > + goto retry; > + } > + > + return err; > } > > static struct drm_pagemap_addr > -- > 2.51.0 >
