On Thu, Oct 16, 2025 at 5:00 PM Rodrigo Siqueira <[email protected]> wrote:
>
> When trying to unload amdgpu in the SteamDeck (TTY mode), the following
> set of errors happens and the system gets unstable:
>
> [..]
>  [drm] Initialized amdgpu 3.64.0 for 0000:04:00.0 on minor 0
>  amdgpu 0000:04:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test 
> failed on gfx_0.0.0 (-110).
>  amdgpu 0000:04:00.0: amdgpu: ib ring test failed (-110).
> [..]
>  amdgpu 0000:04:00.0: amdgpu: SMU: I'm not done with your previous command: 
> SMN_C2PMSG_66:0x0000001E SMN_C2PMSG_82:0x00000000
>  amdgpu 0000:04:00.0: amdgpu: Failed to disable gfxoff!
>  amdgpu 0000:04:00.0: amdgpu: SMU: I'm not done with your previous command: 
> SMN_C2PMSG_66:0x0000001E SMN_C2PMSG_82:0x00000000
>  amdgpu 0000:04:00.0: amdgpu: Failed to disable gfxoff!
> [..]
>
> When the driver initializes the GPU, the PSP validates all the firmware
> loaded, and after that, it is not possible to load any other firmware
> unless the device is reset. What is happening in the load/unload
> situation is that PSP halts the GC engine because it suspects that
> something is amiss. To address this issue, this commit ensures that the
> GPU is reset (mode 2 reset) in the unload sequence.
>
> Suggested-by: Alex Deucher <[email protected]>
> Signed-off-by: Rodrigo Siqueira <[email protected]>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 0d5585bc3b04..78009b93855b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3613,7 +3613,7 @@ static void amdgpu_device_smu_fini_early(struct 
> amdgpu_device *adev)
>
>  static int amdgpu_device_ip_fini_early(struct amdgpu_device *adev)
>  {
> -       int i, r;
> +       int i, r, current_reset_method;
>
>         for (i = 0; i < adev->num_ip_blocks; i++) {
>                 if (!adev->ip_blocks[i].version->funcs->early_fini)
> @@ -3649,6 +3649,17 @@ static int amdgpu_device_ip_fini_early(struct 
> amdgpu_device *adev)
>                                 "failed to release exclusive mode on fini\n");
>         }
>
> +       /* Reset the device before entirely removing it to avoid load issues
> +        * caused by firmware validation.
> +        */
> +       current_reset_method = amdgpu_reset_method;
> +       amdgpu_reset_method = AMD_RESET_METHOD_MODE2;

This would only be needed if the user has overridden the reset method
via a kernel module parameter.  If they've done that they get to keep
the pieces.  MODE2 reset is only used on certain chips so this won't
work generally. Better to just drop this.  amdgpu_asic_reset() will
automatically default to the right reset method for the chip.
Alternative is to set AMD_RESET_METHOD_NONE which is the automatic
setting.

Alex

> +       r = amdgpu_asic_reset(adev);
> +       if (r)
> +               dev_err(adev->dev, "asic reset on %s failed\n", __func__);
> +
> +       amdgpu_reset_method = current_reset_method;
> +
>         return 0;
>  }
>
> --
> 2.51.0
>

Reply via email to