On 10/16, Alex Deucher wrote:
> On Thu, Oct 16, 2025 at 5:00 PM Rodrigo Siqueira <[email protected]> wrote:
> >
> > When trying to unload amdgpu in the SteamDeck (TTY mode), the following
> > set of errors happens and the system gets unstable:
> >
> > [..]
> >  [drm] Initialized amdgpu 3.64.0 for 0000:04:00.0 on minor 0
> >  amdgpu 0000:04:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test 
> > failed on gfx_0.0.0 (-110).
> >  amdgpu 0000:04:00.0: amdgpu: ib ring test failed (-110).
> > [..]
> >  amdgpu 0000:04:00.0: amdgpu: SMU: I'm not done with your previous command: 
> > SMN_C2PMSG_66:0x0000001E SMN_C2PMSG_82:0x00000000
> >  amdgpu 0000:04:00.0: amdgpu: Failed to disable gfxoff!
> >  amdgpu 0000:04:00.0: amdgpu: SMU: I'm not done with your previous command: 
> > SMN_C2PMSG_66:0x0000001E SMN_C2PMSG_82:0x00000000
> >  amdgpu 0000:04:00.0: amdgpu: Failed to disable gfxoff!
> > [..]
> >
> > When the driver initializes the GPU, the PSP validates all the firmware
> > loaded, and after that, it is not possible to load any other firmware
> > unless the device is reset. What is happening in the load/unload
> > situation is that PSP halts the GC engine because it suspects that
> > something is amiss. To address this issue, this commit ensures that the
> > GPU is reset (mode 2 reset) in the unload sequence.
> >
> > Suggested-by: Alex Deucher <[email protected]>
> > Signed-off-by: Rodrigo Siqueira <[email protected]>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 ++++++++++++-
> >  1 file changed, 12 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 0d5585bc3b04..78009b93855b 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -3613,7 +3613,7 @@ static void amdgpu_device_smu_fini_early(struct 
> > amdgpu_device *adev)
> >
> >  static int amdgpu_device_ip_fini_early(struct amdgpu_device *adev)
> >  {
> > -       int i, r;
> > +       int i, r, current_reset_method;
> >
> >         for (i = 0; i < adev->num_ip_blocks; i++) {
> >                 if (!adev->ip_blocks[i].version->funcs->early_fini)
> > @@ -3649,6 +3649,17 @@ static int amdgpu_device_ip_fini_early(struct 
> > amdgpu_device *adev)
> >                                 "failed to release exclusive mode on 
> > fini\n");
> >         }
> >
> > +       /* Reset the device before entirely removing it to avoid load issues
> > +        * caused by firmware validation.
> > +        */
> > +       current_reset_method = amdgpu_reset_method;
> > +       amdgpu_reset_method = AMD_RESET_METHOD_MODE2;
> 
> This would only be needed if the user has overridden the reset method
> via a kernel module parameter.  If they've done that they get to keep
> the pieces.  MODE2 reset is only used on certain chips so this won't
> work generally. Better to just drop this.  amdgpu_asic_reset() will
> automatically default to the right reset method for the chip.
> Alternative is to set AMD_RESET_METHOD_NONE which is the automatic
> setting.

I'll send a V3 whithout the method mode 2 setup.

Thanks a lot

> 
> Alex
> 
> > +       r = amdgpu_asic_reset(adev);
> > +       if (r)
> > +               dev_err(adev->dev, "asic reset on %s failed\n", __func__);
> > +
> > +       amdgpu_reset_method = current_reset_method;
> > +
> >         return 0;
> >  }
> >
> > --
> > 2.51.0
> >

-- 
Rodrigo Siqueira

Reply via email to