Am 11.10.2017 um 18:30 schrieb Michel Dänzer:
On 28/09/17 04:55 PM, Nicolai Hähnle wrote:
From: Nicolai Hähnle <[email protected]>

Highly concurrent Piglit runs can trigger a race condition where a pending
SDMA job on a buffer object is never executed because the corresponding
process is killed (perhaps due to a crash). Since the job's fences were
never signaled, the buffer object was effectively leaked. Worse, the
buffer was stuck wherever it happened to be at the time, possibly in VRAM.

The symptom was user space processes stuck in interruptible waits with
kernel stacks like:

     [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250
     [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0
     [<ffffffffbc5e82d2>] reservation_object_wait_timeout_rcu+0x1c2/0x300
     [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 [ttm]
     [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm]
     [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm]
     [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm]
     [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm]
     [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 [amdgpu]
     [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu]
     [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 [amdgpu]
     [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 [amdgpu]
     [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm]
     [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu]
     [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0
     [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90
     [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad
     [<ffffffffffffffff>] 0xffffffffffffffff

Signed-off-by: Nicolai Hähnle <[email protected]>
Acked-by: Christian König <[email protected]>
Since Christian's commit which introduced the problem (6af0883ed977
"drm/amdgpu: discard commands of killed processes") is in 4.14, we need
a solution for that. Should we backport Nicolai's five commits fixing
the problem, or revert 6af0883ed977?


While looking into this, I noticed that the following commits by
Christian in 4.14 each also cause hangs for me when running the piglit
gpu profile on Tonga:

457e0fee04b0 "drm/amdgpu: remove the GART copy hack"
1d00402b4da2 "drm/amdgpu: fix amdgpu_ttm_bind"

Are there fixes for these that can be backported to 4.14, or do they
need to be reverted there?
Well I'm not aware that any of those two can cause problems.

For "drm/amdgpu: remove the GART copy hack" I also don't have the slightest idea how that could be an issue. It just removes an unused code path.

Is amd-staging-drm-next stable for you?

Thanks,
Christian.
_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to