On 2026-04-07 03:45, Zhang, Tiantian (Celine) wrote:
[AMD Official Use Only - AMD Internal Distribution Only]
Hi @Yang, Philip <mailto:[email protected]>,
Could you please help to review this patch, thanks a lot~
Best Regards,
Celine Zhang
-----Original Message-----
From: YuanShang Mao (River) <[email protected]>
Sent: Wednesday, April 1, 2026 5:56 PM
To: Yang, Philip <[email protected]>
Cc: Koenig, Christian <[email protected]>;
[email protected]; Zhang, Tiantian (Celine)
<[email protected]>
Subject: RE: [PATCH] drm/amdkfd: check if vm ready in svm map and
unmap to gpu
[AMD Official Use Only - AMD Internal Distribution Only]
Hi @Yang, Philip
Could help review this patch?
Thanks
River
-----Original Message-----
From: Koenig, Christian <[email protected]
<mailto:[email protected]>>
Sent: Tuesday, March 31, 2026 7:32 PM
To: YuanShang Mao (River) <[email protected]
<mailto:[email protected]>>; [email protected]
<mailto:[email protected]>; Yang, Philip
<[email protected] <mailto:[email protected]>>
Subject: Re: [PATCH] drm/amdkfd: check if vm ready in svm map and
unmap to gpu
On 3/26/26 11:36, YuanShang wrote:
> Don't map or unmap svm range to gpu if vm is not ready for updates.
>
> Why: DRM entity may already be killed when the svm worker try to
> update gpu vm.
>
> Signed-off-by: YuanShang <[email protected]
<mailto:[email protected]>>
Looks correct to me, but I think somebody else already added those checks.
@Philip is that correct? If not please help reviewing the patch.
Thanks,
Christian.
> ---
> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> index 8167fe642341..7f905a7805fa 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> @@ -1366,6 +1366,12 @@ svm_range_unmap_from_gpu(struct amdgpu_device
> *adev, struct amdgpu_vm *vm,
>
> pr_debug("CPU[0x%llx 0x%llx] -> GPU[0x%llx 0x%llx]\n", start,
last,
> gpu_start, gpu_end);
> +
> + if (!amdgpu_vm_ready(vm)) {
> + pr_debug("VM not ready, canceling unmap\n");
> + return -EINVAL;
> + }
> +
The change looks fine, but it is unnecessary after checking the details
of amdgpu_vm_ready.
It is impossible the "DRM entity may already be killed when the svm
worker try to update gpu vm",
guessing the svm worker is p->svms.restore_work, svm_range_list_fini
cancel the work or wait for
it to finish. kfd_process_wq_release does svm_range_list_fini first,
then fput(pdd->drm_file) to reduce
the vm refcount, then calls amdgpu_vm_fini, to destroy drm sched entity.
If you see the real issue, please post the dmesg log to help understand.
Regards,
Philip
> return amdgpu_vm_update_range(adev, vm, false, true, true,
false, NULL, gpu_start,
> gpu_end, init_pte_value, 0, 0, NULL, NULL,
> fence); @@ -1443,6 +1449,11 @@
> svm_range_map_to_gpu(struct kfd_process_device *pdd, struct
svm_range *prange,
> pr_debug("svms 0x%p [0x%lx 0x%lx] readonly %d\n", prange->svms,
> last_start, last_start + npages - 1, readonly);
>
> + if (!amdgpu_vm_ready(vm)) {
> + pr_debug("VM not ready, canceling map\n");
> + return -EINVAL;
> + }
> +
> for (i = offset; i < offset + npages; i++) {
> uint64_t gpu_start;
> uint64_t gpu_end;