issue:When one of multiple processes sharing the same amdgpu device
fd is killed, amdgpu_flush runs but amdgpu_drm_release does not, so the
vm's entitys have been stopped but bos still alive. Later, when the
KFD fd is closed, the driver unmaps BOs from the GPU VM, clears the
freed BO list, and normally submits SDMA jobs plus an
amdgpu_tlb_fence_work to wait on the job's finished fences which will
not be signaled.

add check if entity is NULL or has been stopped, if so, don't submit
sdma jobs and create amdgpu_tlb_fence_work

Signed-off-by: Bingxi Guo <[email protected]>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
index 36805dcfa159..e57d496a06e1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
@@ -49,6 +49,13 @@ static int amdgpu_vm_sdma_alloc_job(struct 
amdgpu_vm_update_params *p,
        unsigned int ndw;
        int r;
 
+       spin_lock(&entity->lock);
+       if (entity->stopped) {
+               spin_unlock(&entity->lock);
+               return -EINVAL;
+       }
+       spin_unlock(&entity->lock);
+
        /* estimate how many dw we need */
        ndw = AMDGPU_VM_SDMA_MIN_NUM_DW;
        if (p->pages_addr)
-- 
2.43.0

Reply via email to