Thank you for your contribution! Sashiko AI review found 1 potential issue(s) 
to consider:
- [High] The refactored `amdgpu_gfx_handle_priv_fault` removes the `if 
(!adev->gfx.disable_kq)` guard, which can cause spurious scheduler faults and 
GPU resets when kernel queues are disabled.
--

commit 29fea06b4689e38bf654d74c9ec057d4896972a0
Author: Ulisses Paixao <[email protected]>

drm/amd/amdgpu: remove duplicated code in gfx_v11 and gfx_v12

This commit refactors duplicated ring-searching logic from
gfx_v11_0_handle_priv_fault and gfx_v12_0_handle_priv_fault into a common
function amdgpu_gfx_handle_priv_fault in amdgpu_gfx.c.

> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index 0506b90f318e4..120843a8d1ba2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
[ ... ]
> @@ -855,6 +855,40 @@ int amdgpu_gfx_enable_kgq(struct amdgpu_device *adev, 
> int xcc_id)
>       return r;
>  }
>  
> +/**
> + * amdgpu_gfx_handle_priv_fault - Handle privileged instruction fault
> + *
> + * @adev: amdgpu_device pointer
> + * @me_id: micro-engine ID of the faulty ring
> + * @pipe_id: pipe ID of the faulty ring
> + * @queue_id: queue ID of the faulty ring
> + *
> + * This function handles privileged instruction faults by identifying
> + * the faulty ring (gfx or compute) and triggering a scheduler fault
> + */
> +void amdgpu_gfx_handle_priv_fault(struct amdgpu_device *adev,
> +                                     u8 me_id, u8 pipe_id, u8 queue_id)
> +{
> +     struct amdgpu_ring *ring;
> +     int i;
> +
> +     for (i = 0; i < adev->gfx.num_gfx_rings; i++) {
> +             ring = &adev->gfx.gfx_ring[i];
> +             if (ring->me == me_id && ring->pipe == pipe_id &&
> +                 ring->queue == queue_id) {
> +                     drm_sched_fault(&ring->sched);

[Severity: High]
Is it possible we lost the check for !adev->gfx.disable_kq during this
refactoring?

If kernel queues are disabled (e.g. when MES is managing user queues), the
kernel rings might still exist but not be actively handling submissions.
If a privileged instruction fault originates from a user queue with an
overlapping me_id, pipe_id, and queue_id, could this match an idle kernel
ring and incorrectly trigger drm_sched_fault(&ring->sched)?

This could lead to an unwarranted GPU reset via the TDR handler rather than
leaving the event for proper user queue fault handling.

> +                     return;
> +             }
> +     }
[ ... ]
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> index 6346f16c4e613..83fc2dbe8f7f5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> @@ -6684,37 +6684,12 @@ static void gfx_v11_0_handle_priv_fault(struct 
> amdgpu_device *adev,
>                                       struct amdgpu_iv_entry *entry)
>  {
>       u8 me_id, pipe_id, queue_id;
> -     struct amdgpu_ring *ring;
> -     int i;
>  
>       me_id = (entry->ring_id & 0x0c) >> 2;
>       pipe_id = (entry->ring_id & 0x03) >> 0;
>       queue_id = (entry->ring_id & 0x70) >> 4;
>  
> -     if (!adev->gfx.disable_kq) {

[Severity: High]
For context, here is where the guard was dropped from the original version
specific handler.

> -             switch (me_id) {
> -             case 0:
[ ... ]

-- 
Sashiko AI review ยท 
https://sashiko.dev/#/patchset/[email protected]?part=1

Reply via email to