Thanks for sharing previous context on this. By a further code check, I found something interesting in si_dpm_late_init that why there is an early return when dpm_enabled has been true. The sequence to enable temperature range in boot is: 1. In IP hw_init(si_dpm_hw_init) ahead of late_init, set temperature range as part of si_thermal_start_thermal_controller 2. set adev->pm.dpm_enabled to true unconditionally in si_dpm_hw_init 3. In si_dpm_late_init, temperate range setting is still executed as we put a check "if (!adev->pm.dpm_enabled) return 0". Looks we should skip it when dpm including temperature range has been set already.
So I guess the random failure in enabling/disabling thermal alert is possibly by amdgpu driver does not check the return value when setting temperature in hw_init phase, FW randomly has not finished the process yet, while immediately, driver issues another same setting cycle to FW, and FW complains/returns an error code to driver. This may explain why a delay can work in such case. Or I am understanding this wrongly due to my limitation? Hi Zhenneng, Additionally, can you please try to modify the check to return early in si_dpm_late_init when adev->pm.dpm_enabled is true? [Also I dropped some public mail lists as looks such issue is amdgpu driver specific]:) > -----Original Message----- > From: 李真能 <lizhenn...@kylinos.cn> > Sent: Monday, March 13, 2023 9:05 AM > To: Chen, Guchun <guchun.c...@amd.com>; Deucher, Alexander > <alexander.deuc...@amd.com> > Cc: David Airlie <airl...@linux.ie>; Pan, Xinhui <xinhui....@amd.com>; > linux-ker...@vger.kernel.org; dri-de...@lists.freedesktop.org; amd- > g...@lists.freedesktop.org; Daniel Vetter <dan...@ffwll.ch>; Koenig, Christian > <christian.koe...@amd.com> > Subject: Re: [PATCH] drm/amdgpu: resove reboot exception for si oland > > This bug is first reported here: > > https://lore.kernel.org/lkml/1a620e7c-5b71-3d16-001a- > 0d79b292a...@amd.com/ > > I modify the patch accroding mail list's discusstion, and I do reboot test > for > tens of thousands of times about 10 machines on arm64, there's no bug > reported. > > 在 2023/3/10 16:18, Chen, Guchun 写道: > >> -----Original Message----- > >> From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of > >> Zhenneng Li > >> Sent: Friday, March 10, 2023 3:40 PM > >> To: Deucher, Alexander <alexander.deuc...@amd.com> > >> Cc: David Airlie <airl...@linux.ie>; Pan, Xinhui > >> <xinhui....@amd.com>; linux-ker...@vger.kernel.org; > >> dri-de...@lists.freedesktop.org; Zhenneng Li <lizhenn...@kylinos.cn>; > >> amd-gfx@lists.freedesktop.org; Daniel Vetter <dan...@ffwll.ch>; > >> Koenig, Christian <christian.koe...@amd.com> > >> Subject: [PATCH] drm/amdgpu: resove reboot exception for si oland > >> > >> During reboot test on arm64 platform, it may failure on boot. > >> > >> The error message are as follows: > >> [ 6.996395][ 7] [ T295] [drm:amdgpu_device_ip_late_init [amdgpu]] > >> *ERROR* > >> late_init of IP block <si_dpm> failed -22 > >> [ 7.006919][ 7] [ T295] amdgpu 0000:04:00.0: > amdgpu_device_ip_late_init > >> failed > >> [ 7.014224][ 7] [ T295] amdgpu 0000:04:00.0: Fatal error during GPU > >> init > >> --- > >> drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c | 3 --- > >> 1 file changed, 3 deletions(-) > >> > >> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c > >> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c > >> index d6d9e3b1b2c0..dee51c757ac0 100644 > >> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c > >> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c > >> @@ -7632,9 +7632,6 @@ static int si_dpm_late_init(void *handle) > >> if (!adev->pm.dpm_enabled) > >> return 0; > >> > >> - ret = si_set_temperature_range(adev); > >> - if (ret) > >> - return ret; > > si_set_temperature_range should be platform agnostic. Can you please > elaborate more? > > > > Regards, > > Guchun > > > >> #if 0 //TODO ? > >> si_dpm_powergate_uvd(adev, true); > >> #endif > >> -- > >> 2.25.1