On 12/22/25 04:33, Bingxi Guo wrote: > issue:When one of multiple processes sharing the same amdgpu device
That is illegal to begin with. Multiple processes shouldn't share the same device fd. > fd is killed, amdgpu_flush runs but amdgpu_drm_release does not, so the > vm's entitys have been stopped but bos still alive. Later, when the > KFD fd is closed, the driver unmaps BOs from the GPU VM, clears the > freed BO list, and normally submits SDMA jobs plus an > amdgpu_tlb_fence_work to wait on the job's finished fences which will > not be signaled. The problem is the KFD code and not the SDMA backend, this check here is especially racy since the entity status can change as soon as you drop the lock. So clear NAK to that. Regards, Christian. > > add check if entity is NULL or has been stopped, if so, don't submit > sdma jobs and create amdgpu_tlb_fence_work > > Signed-off-by: Bingxi Guo <[email protected]> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > index 36805dcfa159..e57d496a06e1 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > @@ -49,6 +49,13 @@ static int amdgpu_vm_sdma_alloc_job(struct > amdgpu_vm_update_params *p, > unsigned int ndw; > int r; > > + spin_lock(&entity->lock); > + if (entity->stopped) { > + spin_unlock(&entity->lock); > + return -EINVAL; > + } > + spin_unlock(&entity->lock); > + > /* estimate how many dw we need */ > ndw = AMDGPU_VM_SDMA_MIN_NUM_DW; > if (p->pages_addr)
