On 2017年09月28日 22:55, Nicolai Hähnle wrote:
From: Nicolai Hähnle <[email protected]> Highly concurrent Piglit runs can trigger a race condition where a pending SDMA job on a buffer object is never executed because the corresponding process is killed (perhaps due to a crash). Since the job's fences were never signaled, the buffer object was effectively leaked. Worse, the buffer was stuck wherever it happened to be at the time, possibly in VRAM.
Indeed good catch. Cheers, David Zhou
The symptom was user space processes stuck in interruptible waits with kernel stacks like: [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250 [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0 [<ffffffffbc5e82d2>] reservation_object_wait_timeout_rcu+0x1c2/0x300 [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 [ttm] [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm] [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm] [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm] [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm] [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 [amdgpu] [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu] [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 [amdgpu] [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 [amdgpu] [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm] [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0 [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90 [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad [<ffffffffffffffff>] 0xffffffffffffffff Signed-off-by: Nicolai Hähnle <[email protected]> Acked-by: Christian König <[email protected]> --- drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c index 54eb77cffd9b..32a99e980d78 100644 --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c @@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler *sched, amd_sched_entity_is_idle(entity)); amd_sched_rq_remove_entity(rq, entity); if (r) { struct amd_sched_job *job;/* Park the kernel for a moment to make sure it isn't processing* our enity. */ kthread_park(sched->thread); kthread_unpark(sched->thread); - while (kfifo_out(&entity->job_queue, &job, sizeof(job))) + while (kfifo_out(&entity->job_queue, &job, sizeof(job))) { + struct amd_sched_fence *s_fence = job->s_fence; + amd_sched_fence_scheduled(s_fence); + amd_sched_fence_finished(s_fence); + dma_fence_put(&s_fence->finished); sched->ops->free_job(job); + }}kfifo_free(&entity->job_queue); }static void amd_sched_entity_wakeup(struct dma_fence *f, struct dma_fence_cb *cb){ struct amd_sched_entity *entity = container_of(cb, struct amd_sched_entity, cb); entity->dependency = NULL;
_______________________________________________ amd-gfx mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/amd-gfx
