issue:When one of multiple processes sharing the same amdgpu device fd is killed, amdgpu_flush runs but amdgpu_drm_release does not, so the vm's entitys have been stopped but bos still alive. Later, when the KFD fd is closed, the driver unmaps BOs from the GPU VM, clears the freed BO list, and normally submits SDMA jobs plus an amdgpu_tlb_fence_work to wait on the job's finished fences which will not be signaled.
add check if entity is NULL or has been stopped, if so, don't submit sdma jobs and create amdgpu_tlb_fence_work Signed-off-by: Bingxi Guo <[email protected]> --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c index 36805dcfa159..e57d496a06e1 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c @@ -49,6 +49,13 @@ static int amdgpu_vm_sdma_alloc_job(struct amdgpu_vm_update_params *p, unsigned int ndw; int r; + spin_lock(&entity->lock); + if (entity->stopped) { + spin_unlock(&entity->lock); + return -EINVAL; + } + spin_unlock(&entity->lock); + /* estimate how many dw we need */ ndw = AMDGPU_VM_SDMA_MIN_NUM_DW; if (p->pages_addr) -- 2.43.0
