Thanks for sharing previous context on this. 

By a further code check, I found something interesting in si_dpm_late_init that 
why there is an early return when dpm_enabled has been true.
The sequence to enable temperature range in boot is:
1. In IP hw_init(si_dpm_hw_init) ahead of late_init, set temperature range as 
part of si_thermal_start_thermal_controller
2. set adev->pm.dpm_enabled to true unconditionally in si_dpm_hw_init
3. In si_dpm_late_init, temperate range setting is still executed as we put a 
check "if (!adev->pm.dpm_enabled) return 0". Looks we should skip it when dpm 
including temperature range has been set already.

So I guess the random failure in enabling/disabling thermal alert is possibly 
by amdgpu driver does not check the return value when setting temperature in 
hw_init phase, FW randomly has not finished the process yet, while immediately, 
driver issues another same setting cycle to FW, and FW complains/returns an 
error code to driver. This may explain why a delay can work in such case. Or I 
am understanding this wrongly due to my limitation?

Hi Zhenneng,

Additionally, can you please try to modify the check to return early in 
si_dpm_late_init when adev->pm.dpm_enabled is true?

[Also I dropped some public mail lists as looks such issue is amdgpu driver 
specific]:)

> -----Original Message-----
> From: 李真能 <lizhenn...@kylinos.cn>
> Sent: Monday, March 13, 2023 9:05 AM
> To: Chen, Guchun <guchun.c...@amd.com>; Deucher, Alexander
> <alexander.deuc...@amd.com>
> Cc: David Airlie <airl...@linux.ie>; Pan, Xinhui <xinhui....@amd.com>;
> linux-ker...@vger.kernel.org; dri-de...@lists.freedesktop.org; amd-
> g...@lists.freedesktop.org; Daniel Vetter <dan...@ffwll.ch>; Koenig, Christian
> <christian.koe...@amd.com>
> Subject: Re: [PATCH] drm/amdgpu: resove reboot exception for si oland
> 
> This bug is first reported here:
> 
> https://lore.kernel.org/lkml/1a620e7c-5b71-3d16-001a-
> 0d79b292a...@amd.com/
> 
> I modify the patch accroding mail list's discusstion,   and I do reboot test 
> for
> tens of thousands of times about 10 machines on arm64,  there's no bug
> reported.
> 
> 在 2023/3/10 16:18, Chen, Guchun 写道:
> >> -----Original Message-----
> >> From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of
> >> Zhenneng Li
> >> Sent: Friday, March 10, 2023 3:40 PM
> >> To: Deucher, Alexander <alexander.deuc...@amd.com>
> >> Cc: David Airlie <airl...@linux.ie>; Pan, Xinhui
> >> <xinhui....@amd.com>; linux-ker...@vger.kernel.org;
> >> dri-de...@lists.freedesktop.org; Zhenneng Li <lizhenn...@kylinos.cn>;
> >> amd-gfx@lists.freedesktop.org; Daniel Vetter <dan...@ffwll.ch>;
> >> Koenig, Christian <christian.koe...@amd.com>
> >> Subject: [PATCH] drm/amdgpu: resove reboot exception for si oland
> >>
> >> During reboot test on arm64 platform, it may failure on boot.
> >>
> >> The error message are as follows:
> >> [    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init [amdgpu]]
> >> *ERROR*
> >>                        late_init of IP block <si_dpm> failed -22
> >> [    7.006919][ 7] [  T295] amdgpu 0000:04:00.0:
> amdgpu_device_ip_late_init
> >> failed
> >> [    7.014224][ 7] [  T295] amdgpu 0000:04:00.0: Fatal error during GPU 
> >> init
> >> ---
> >>   drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c | 3 ---
> >>   1 file changed, 3 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
> >> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
> >> index d6d9e3b1b2c0..dee51c757ac0 100644
> >> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
> >> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
> >> @@ -7632,9 +7632,6 @@ static int si_dpm_late_init(void *handle)
> >>    if (!adev->pm.dpm_enabled)
> >>            return 0;
> >>
> >> -  ret = si_set_temperature_range(adev);
> >> -  if (ret)
> >> -          return ret;
> > si_set_temperature_range should be platform agnostic. Can you please
> elaborate more?
> >
> > Regards,
> > Guchun
> >
> >>   #if 0 //TODO ?
> >>    si_dpm_powergate_uvd(adev, true);
> >>   #endif
> >> --
> >> 2.25.1

Reply via email to