Thank you for your contribution! Sashiko AI review found 2 potential issue(s) 
to consider:

Pre-existing issues:
- [Critical] The indirect buffer (IB) allocation size is hardcoded to 64 bytes, 
but the initialization loop writes up to 1024 bytes based on the ring's 
alignment mask, leading to a severe out-of-bounds write.
- [High] The error handling branch for `dma_fence_wait` is dead code because 
uninterruptible waits with infinite timeouts do not return errors, resulting in 
silently ignored hardware failures.
--

commit 154fe15a3810311348d132047ef742f38daab7b7
Author: Wentao Liang <[email protected]>
drm/amdgpu: fix fence reference leak in amdgpu_gfx_run_cleaner_shader_job

This commit addresses a potential fence reference leak in the error path of
amdgpu_gfx_run_cleaner_shader_job(). It adds a dma_fence_put() before jumping
to the error label when dma_fence_wait() fails.

>     However, if dma_fence_wait() fails (e.g., due to a signal interruption),
>     the code jumps to the error label without calling dma_fence_put(),
>     resulting in a reference leak.

[Severity: High]
This isn't a bug introduced by this patch, but is it possible for
dma_fence_wait() to fail due to a signal interruption here?

Since dma_fence_wait(f, false) calls dma_fence_wait_timeout() with
intr = false and an infinite timeout, it seems it cannot be interrupted by
signals or timeout, and will always return 0.

[Severity: Critical]
This is a pre-existing issue, but can the initialization loop write past the
end of the allocation?

Looking earlier in amdgpu_gfx_run_cleaner_shader_job():

drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c:amdgpu_gfx_run_cleaner_shader_job() {
        ...
        r = amdgpu_job_alloc_with_ib(ring->adev, &entity, owner,
                                     64, 0, &job,
                                     AMDGPU_KERNEL_JOB_ID_CLEANER_SHADER);
        ...
        ib = &job->ibs[0];
        for (i = 0; i <= ring->funcs->align_mask; ++i)
                ib->ptr[i] = ring->funcs->nop;
        ...
}

The indirect buffer is allocated with a hardcoded size of 64 bytes, but the
loop writes up to ring->funcs->align_mask + 1 dwords.

If the alignment mask is 0xff (255) for modern GFX compute rings, this would
write 256 dwords (1024 bytes), which exceeds the 64 byte allocation and may
corrupt subsequent blocks in the memory pool.

> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index ff5a55f5f3c9e..2784306400a81 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> @@ -1729,8 +1729,10 @@ static int amdgpu_gfx_run_cleaner_shader_job(struct 
> amdgpu_ring *ring)
>       f = amdgpu_job_submit(job);
>  
>       r = dma_fence_wait(f, false);
> -     if (r)
> +     if (r) {
> +             dma_fence_put(f);
>               goto err;
> +     }
        ^^^^

[Severity: High]
This is a pre-existing issue, but does this make the error handling branch
dead code?

If dma_fence_wait() always returns 0, the newly added dma_fence_put() would
be unreachable.

Additionally, if the cleaner shader job hangs and the fence completes with an
error, would the failure be silently ignored since f->error is not checked?

>  
>       dma_fence_put(f);
>

-- 
Sashiko AI review ยท 
https://sashiko.dev/#/patchset/[email protected]?part=1

Reply via email to