On Sat, Oct 25, 2025 at 02:04:12PM +0200, Thomas Hellström wrote:
> Data present in foreign device memory may cause migration to fail.
> For now, retry once after first migrating to system.
> 
> Signed-off-by: Thomas Hellström <[email protected]>
> ---
>  drivers/gpu/drm/xe/xe_svm.c | 19 +++++++++++++++----
>  1 file changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
> index 9814f95cb212..41e075aa015c 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -1529,13 +1529,24 @@ struct drm_pagemap *xe_vma_resolve_pagemap(struct 
> xe_vma *vma, struct xe_tile *t
>  int xe_svm_alloc_vram(struct xe_svm_range *range, const struct 
> drm_gpusvm_ctx *ctx,
>                     struct drm_pagemap *dpagemap)
>  {
> +     int err, retries = 1;
> +
>       xe_assert(range_to_vm(&range->base)->xe, 
> range->base.pages.flags.migrate_devmem);
>       range_debug(range, "ALLOCATE VRAM");
>  
> -     return drm_pagemap_populate_mm(dpagemap, xe_svm_range_start(range),
> -                                    xe_svm_range_end(range),
> -                                    range->base.gpusvm->mm,
> -                                    ctx->timeslice_ms);
> +retry:
> +     err = drm_pagemap_populate_mm(dpagemap, xe_svm_range_start(range),
> +                                   xe_svm_range_end(range),
> +                                   range->base.gpusvm->mm,
> +                                   ctx->timeslice_ms);
> +     if ((err == -EBUSY || err == -EFAULT) && retries--) {

I don't think this is what we want to do here. -EFAULT indicates that
the pages are entirely present somewhere in device memory. This could be
either on the local device or on a foreign device, but we don’t have
enough information here to determine which case it is.

If this is on our local device, we're always good. This could occur
playing mremap games.

If it's on a foreign device, things get trickier. If our interconnect
supports atomics (e.g., UAL), we're still good. But if the interconnect
doesn't support atomics (e.g., PCIe P2P), this an atomic fault, then we
need to move the memory. Also, if there's no path between device
memories, then of course we need to move the memory.

Again, we don’t have enough information here to make the correct
decision. We really need to call drm_gpusvm_range_get_pages to gather
the CPU pages in order to make this kind of decision. Ideally, the logic
should be built into drm_gpusvm_range_get_pages to understand atomic
migration requirements.

Once drm_gpusvm_range_get_pages returns, we can take appropriate action.
Initially, for simplicity, this might just be a bounce to system memory.
Later, it could evolve into a direct device-to-device move.

The logic inside drm_gpusvm_range_get_pages would likely involve
devmem_only combined with a drm_pagemap passed in, which can detect
connectivity and atomic support between devices—based on the drm_pagemap
extracted from the ZDD.

Let know if thia makes sense, or if you have thought about doing this in
a follow up.

Matt

> +             range_debug(range, "ALLOCATE VRAM - Retry.");
> +
> +             drm_gpusvm_range_evict(range->base.gpusvm, &range->base);
> +             goto retry;
> +     }
> +
> +     return err;
>  }
>  
>  static struct drm_pagemap_addr
> -- 
> 2.51.0
> 

Reply via email to