Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
Pre-existing issues: - [High] Early fence signaling allows concurrent unmapping of shared BOs before the hung GPU is stopped, reproducing the exact pagefaults the patch intended to fix. - [High] Concurrent GPU progress during hang recovery causes the recovery logic to double-increment the fence and silently drop a subsequent, unexecuted submit. -- commit 3c515db3794c4f1061da14c3adffcf64bbdd2fc5 Author: Jie Zhang <[email protected]> drm/msm: Recover HW before retire hung submit During recovery, it is not safe to retire the hung submit before we recover the GPU. Retiring the submit triggers BO free and that can result in GPU pagefaults since the GPU may be actively accessing those BOs. To fix this, retire the submits after gpu recovery is complete in recover_worker(). > diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c > index 18ed00e5f143b..9ac7740a87f01 100644 > --- a/drivers/gpu/drm/msm/msm_gpu.c > +++ b/drivers/gpu/drm/msm/msm_gpu.c > @@ -552,11 +552,11 @@ static void recover_worker(struct kthread_work *work) > msm_update_fence(ring->fctx, fence); > } [Severity: High] This isn't a bug introduced by this patch, but can concurrent GPU progress during hang recovery cause the recovery logic to double-increment the fence and silently drop a subsequent, unexecuted submit? In recover_worker(), the hung submit is identified early using: submit = find_submit(cur_ring, cur_ring->memptrs->fence + 1); Because the GPU is not halted during the time it takes to capture the crash state, it may legitimately finish the hung submit and concurrently write the next fence value to ring->memptrs->fence. Later in the function (just above this hunk), the loop reads the updated fence: uint32_t fence = ring->memptrs->fence; ... if (ring == cur_ring) ring->memptrs->fence = ++fence; This explicitly increments the already-updated fence and writes the next value back to shared memory. retire_submits() will then retire the unexecuted submit corresponding to that next fence, skipping it entirely while userspace receives a success signal. [Severity: High] This is a pre-existing issue, but does early fence signaling still allow concurrent unmapping of shared BOs before the hung GPU is stopped, reproducing the exact pagefaults this patch intended to fix? While this patch moves retire_submits() to happen after gpu->funcs->recover(), the call to msm_update_fence() remains before the recovery. Because msm_update_fence() advances the fence context's completed_fence to include the hung submit, any concurrent thread calling dma_fence_is_signaled() (such as userspace polling a sync_file) will observe the updated completed_fence. This causes dma_fence_is_signaled() to wake up all waiters and allows them to unmap the shared memory from the GPU page tables while the GPU is still actively executing. Should the loop containing msm_update_fence() also be moved after gpu->funcs->recover() to completely close this window? > > + gpu->funcs->recover(gpu); > + > /* retire completed submits, plus the one that hung: */ > retire_submits(gpu); > > - gpu->funcs->recover(gpu); > - > /* > * Replay all remaining submits starting with highest priority > * ring -- Sashiko AI review ยท https://sashiko.dev/#/patchset/[email protected]?part=2
