Hi all,

This patch fixes a thermal runaway / emergency reboot observed on
SMU IP 14.0.x APUs when power_dpm_force_performance_level is set to
"high" and a sustained heavy compute workload is run.

Background
----------
Setting force_performance_level=high is meant to bias gfxclk towards
its maximum while still allowing the firmware's thermal/PPT throttler
to clamp the clock when limits are reached. The current driver
implements this by sending SetHardMinGfxClk + SetSoftMaxGfxClk in
smu_v14_0_0_set_soft_freq_limited_range(), pinning HardMin to peak
gfxclk.

In PMFW clock arbitration, however, HardMin has higher priority than
SoftMax. Once HardMin is pinned to peak, the throttler's attempts to
lower gfxclk through SoftMax are silently overridden. Throttling is
effectively disabled for the duration of force_performance_level=high.

Symptom
-------
Under sustained heavy compute load, gfxclk stays at peak with no
throttling headroom. GPU temperature climbs rapidly and on platforms
with aggressive thermal trip behaviour the system enters an emergency
shutdown / reboot before OS-level thermal handlers can react.

Fix
---
Replace SetHardMinGfxClk with SetSoftMinGfxclk in the APU path. The
driver still requests peak performance, but the firmware throttler is
free to clamp gfxclk via SoftMax when thermal/PPT limits are reached.
SoftMax handling is unchanged. No other clock domains are affected.

Validation
----------
Tested on an SMU IP 14.0.x APU at force_performance_level=high with
sustained heavy compute. Throttling now engages under load, gfxclk
and temperature stay within safe operating limits, and the previously
observed thermal runaway / reboot no longer occurs. Light and idle
behaviour are unchanged.

Review feedback welcome.

Thanks,
Priya

Priya Hosur (1):
  drm/amd/pm: smu_v14_0_0: use SoftMin for gfxclk in
    set_soft_freq_limited_range

 drivers/gpu/drm/amd/pm/swsmu/smu14/smu_v14_0_0_ppt.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

-- 
2.43.0

Reply via email to