Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider: - [High] The refactored `amdgpu_gfx_handle_priv_fault` removes the `if (!adev->gfx.disable_kq)` guard, which can cause spurious scheduler faults and GPU resets when kernel queues are disabled. --
commit 29fea06b4689e38bf654d74c9ec057d4896972a0 Author: Ulisses Paixao <[email protected]> drm/amd/amdgpu: remove duplicated code in gfx_v11 and gfx_v12 This commit refactors duplicated ring-searching logic from gfx_v11_0_handle_priv_fault and gfx_v12_0_handle_priv_fault into a common function amdgpu_gfx_handle_priv_fault in amdgpu_gfx.c. > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c > index 0506b90f318e4..120843a8d1ba2 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c [ ... ] > @@ -855,6 +855,40 @@ int amdgpu_gfx_enable_kgq(struct amdgpu_device *adev, > int xcc_id) > return r; > } > > +/** > + * amdgpu_gfx_handle_priv_fault - Handle privileged instruction fault > + * > + * @adev: amdgpu_device pointer > + * @me_id: micro-engine ID of the faulty ring > + * @pipe_id: pipe ID of the faulty ring > + * @queue_id: queue ID of the faulty ring > + * > + * This function handles privileged instruction faults by identifying > + * the faulty ring (gfx or compute) and triggering a scheduler fault > + */ > +void amdgpu_gfx_handle_priv_fault(struct amdgpu_device *adev, > + u8 me_id, u8 pipe_id, u8 queue_id) > +{ > + struct amdgpu_ring *ring; > + int i; > + > + for (i = 0; i < adev->gfx.num_gfx_rings; i++) { > + ring = &adev->gfx.gfx_ring[i]; > + if (ring->me == me_id && ring->pipe == pipe_id && > + ring->queue == queue_id) { > + drm_sched_fault(&ring->sched); [Severity: High] Is it possible we lost the check for !adev->gfx.disable_kq during this refactoring? If kernel queues are disabled (e.g. when MES is managing user queues), the kernel rings might still exist but not be actively handling submissions. If a privileged instruction fault originates from a user queue with an overlapping me_id, pipe_id, and queue_id, could this match an idle kernel ring and incorrectly trigger drm_sched_fault(&ring->sched)? This could lead to an unwarranted GPU reset via the TDR handler rather than leaving the event for proper user queue fault handling. > + return; > + } > + } [ ... ] > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c > b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c > index 6346f16c4e613..83fc2dbe8f7f5 100644 > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c > @@ -6684,37 +6684,12 @@ static void gfx_v11_0_handle_priv_fault(struct > amdgpu_device *adev, > struct amdgpu_iv_entry *entry) > { > u8 me_id, pipe_id, queue_id; > - struct amdgpu_ring *ring; > - int i; > > me_id = (entry->ring_id & 0x0c) >> 2; > pipe_id = (entry->ring_id & 0x03) >> 0; > queue_id = (entry->ring_id & 0x70) >> 4; > > - if (!adev->gfx.disable_kq) { [Severity: High] For context, here is where the guard was dropped from the original version specific handler. > - switch (me_id) { > - case 0: [ ... ] -- Sashiko AI review ยท https://sashiko.dev/#/patchset/[email protected]?part=1
