pm: correct mem_busy_percent display due to calculation errors

Wang, Yang(Kevin) Thu, 26 Mar 2026 16:18:12 -0700

[AMD Official Use Only - AMD Internal Distribution Only]

>> What about returning -EBUSY or -EAGAIN if the value is negative?


However, this wrong value will still affect how 'pp_dpm_uclk' and 'gpu_metrics' 
show their results.
While the 'mem_busy_percent' sysfs node is fine as it is, the KMD still needs a 
special code path to handle it properly.
Also, for 'gpu_metrics', the driver must return some value to fill the buffer 
no matter what to avoid breaking node function.

For "activity" variables, we can safely assume they should only be between 0 
and 100.
Besides, PMFW uses a similar method internally, So, I think limiting the value 
to 0 is a reasonable fix.

Best Regards,
Kevin

-----Original Message-----
From: Alex Deucher <[email protected]>
Sent: Friday, March 27, 2026 5:33 AM
To: Lazar, Lijo <[email protected]>
Cc: Wang, Yang(Kevin) <[email protected]>; [email protected]; 
Deucher, Alexander <[email protected]>; Zhang, Hawking 
<[email protected]>; Feng, Kenneth <[email protected]>
Subject: Re: [PATCH] drm/amd/pm: correct mem_busy_percent display due to 
calculation errors

On Thu, Mar 26, 2026 at 12:37 AM Lazar, Lijo <[email protected]> wrote:
>
>
>
> On 26-Mar-26 7:34 AM, Yang Wang wrote:
> > PMFW may return invalid values due to internal calculation errors.
> > so, the kmd driver must validate and sanitize the returned values to
> > prevent issues caused by firmware calculation errors.
> >
> > For example, values 0xfffe (-2) and 0xffff (-1) are treated as
> > invalid and clamped to 0.
> >
>
> The problem with clamping is that the issue takes a different
> direction after that.
>
> Presently, the issue is reported as garbage values reported in activity.
> With clamping, the issue could get reported as 100% activity with
> light load or 0% activity with a heavy load. That will take the debug
> in a different direction.
>
> Instead, isn't it better to keep this as some errata and let user apps
> filter out garbage values? The previous or next sample could reflect
> the activity correctly. Later fix can be added to newer firmware, if
> that is an option.

What about returning -EBUSY or -EAGAIN if the value is negative?

Alex

>
> Thanks,
> Lijo
>
>
> > this applies to devices with CAB (Cache As Buffer) functionality.
> >
> > Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/4905
> >
> > Signed-off-by: Yang Wang <[email protected]>
> > ---
> >   drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h   | 17 +++++++++++++++++
> >   .../drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c    | 10 +++++-----
> >   .../drm/amd/pm/swsmu/smu13/smu_v13_0_7_ppt.c    | 10 +++++-----
> >   .../drm/amd/pm/swsmu/smu14/smu_v14_0_2_ppt.c    | 10 +++++-----
> >   4 files changed, 32 insertions(+), 15 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
> > b/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
> > index 609f5ab07d8a..365946c43e11 100644
> > --- a/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
> > +++ b/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
> > @@ -2164,4 +2164,21 @@ static inline void smu_feature_init(struct 
> > smu_context *smu, int feature_num)
> >       smu_feature_list_clear_all(smu, SMU_FEATURE_LIST_ALLOWED);
> >   }
> >
> > +/*
> > + * smu_safe_u16_nn - Make u16 safe by filtering negative overflow
> > +errors
> > + * @val: Input u16 value, may contain invalid negative overflows
> > + *
> > + * Convert u16 to non-negative value. Cast to s16 to detect
> > +negative values
> > + * caused by calculation errors. Return 0 for negative errors,
> > +return
> > + * original value if valid.
> > + *
> > + * Return: Valid u16 value or 0
> > + */
> > +static inline u16 smu_safe_u16_nn(u16 val) {
> > +    s16 tmp = (s16)val;
> > +
> > +    return tmp < 0 ? 0 : val;
> > +}
> > +
> >   #endif
> > diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c
> > b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c
> > index 9be7a2af560d..16f69b548ca4 100644
> > --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c
> > +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c
> > @@ -774,13 +774,13 @@ static int smu_v13_0_0_get_smu_metrics_data(struct 
> > smu_context *smu,
> >                       *value = metrics->AverageGfxclkFrequencyPreDs;
> >               break;
> >       case METRICS_AVERAGE_FCLK:
> > -             if (metrics->AverageUclkActivity <= SMU_13_0_0_BUSY_THRESHOLD)
> > +             if (smu_safe_u16_nn(metrics->AverageUclkActivity) <=
> > + SMU_13_0_0_BUSY_THRESHOLD)
> >                       *value = metrics->AverageFclkFrequencyPostDs;
> >               else
> >                       *value = metrics->AverageFclkFrequencyPreDs;
> >               break;
> >       case METRICS_AVERAGE_UCLK:
> > -             if (metrics->AverageUclkActivity <= SMU_13_0_0_BUSY_THRESHOLD)
> > +             if (smu_safe_u16_nn(metrics->AverageUclkActivity) <=
> > + SMU_13_0_0_BUSY_THRESHOLD)
> >                       *value = metrics->AverageMemclkFrequencyPostDs;
> >               else
> >                       *value = metrics->AverageMemclkFrequencyPreDs;
> > @@ -801,7 +801,7 @@ static int smu_v13_0_0_get_smu_metrics_data(struct 
> > smu_context *smu,
> >               *value = metrics->AverageGfxActivity;
> >               break;
> >       case METRICS_AVERAGE_MEMACTIVITY:
> > -             *value = metrics->AverageUclkActivity;
> > +             *value =
> > + smu_safe_u16_nn(metrics->AverageUclkActivity);
> >               break;
> >       case METRICS_AVERAGE_VCNACTIVITY:
> >               *value = max(metrics->Vcn0ActivityPercentage,
> > @@ -2086,7 +2086,7 @@ static ssize_t smu_v13_0_0_get_gpu_metrics(struct 
> > smu_context *smu,
> >
> > metrics->AvgTemperature[TEMP_VR_MEM1]);
> >
> >       gpu_metrics->average_gfx_activity = metrics->AverageGfxActivity;
> > -     gpu_metrics->average_umc_activity = metrics->AverageUclkActivity;
> > +     gpu_metrics->average_umc_activity =
> > + smu_safe_u16_nn(metrics->AverageUclkActivity);
> >       gpu_metrics->average_mm_activity = 
> > max(metrics->Vcn0ActivityPercentage,
> >
> > metrics->Vcn1ActivityPercentage);
> >
> > @@ -2103,7 +2103,7 @@ static ssize_t smu_v13_0_0_get_gpu_metrics(struct 
> > smu_context *smu,
> >       else
> >               gpu_metrics->average_gfxclk_frequency =
> > metrics->AverageGfxclkFrequencyPreDs;
> >
> > -     if (metrics->AverageUclkActivity <= SMU_13_0_0_BUSY_THRESHOLD)
> > +     if (smu_safe_u16_nn(metrics->AverageUclkActivity) <=
> > + SMU_13_0_0_BUSY_THRESHOLD)
> >               gpu_metrics->average_uclk_frequency = 
> > metrics->AverageMemclkFrequencyPostDs;
> >       else
> >               gpu_metrics->average_uclk_frequency =
> > metrics->AverageMemclkFrequencyPreDs;
> > diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_7_ppt.c
> > b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_7_ppt.c
> > index 5cc15545da6e..34a5973b9a06 100644
> > --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_7_ppt.c
> > +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_7_ppt.c
> > @@ -784,13 +784,13 @@ static int smu_v13_0_7_get_smu_metrics_data(struct 
> > smu_context *smu,
> >               *value = metrics->AverageGfxclkFrequencyPreDs;
> >               break;
> >       case METRICS_AVERAGE_FCLK:
> > -             if (metrics->AverageUclkActivity <= SMU_13_0_7_BUSY_THRESHOLD)
> > +             if (smu_safe_u16_nn(metrics->AverageUclkActivity) <=
> > + SMU_13_0_7_BUSY_THRESHOLD)
> >                       *value = metrics->AverageFclkFrequencyPostDs;
> >               else
> >                       *value = metrics->AverageFclkFrequencyPreDs;
> >               break;
> >       case METRICS_AVERAGE_UCLK:
> > -             if (metrics->AverageUclkActivity <= SMU_13_0_7_BUSY_THRESHOLD)
> > +             if (smu_safe_u16_nn(metrics->AverageUclkActivity) <=
> > + SMU_13_0_7_BUSY_THRESHOLD)
> >                       *value = metrics->AverageMemclkFrequencyPostDs;
> >               else
> >                       *value = metrics->AverageMemclkFrequencyPreDs;
> > @@ -815,7 +815,7 @@ static int smu_v13_0_7_get_smu_metrics_data(struct 
> > smu_context *smu,
> >               *value = metrics->AverageGfxActivity;
> >               break;
> >       case METRICS_AVERAGE_MEMACTIVITY:
> > -             *value = metrics->AverageUclkActivity;
> > +             *value =
> > + smu_safe_u16_nn(metrics->AverageUclkActivity);
> >               break;
> >       case METRICS_AVERAGE_SOCKETPOWER:
> >               *value = metrics->AverageSocketPower << 8; @@ -2092,7
> > +2092,7 @@ static ssize_t smu_v13_0_7_get_gpu_metrics(struct smu_context 
> > *smu,
> >
> > metrics->AvgTemperature[TEMP_VR_MEM1]);
> >
> >       gpu_metrics->average_gfx_activity = metrics->AverageGfxActivity;
> > -     gpu_metrics->average_umc_activity = metrics->AverageUclkActivity;
> > +     gpu_metrics->average_umc_activity =
> > + smu_safe_u16_nn(metrics->AverageUclkActivity);
> >       gpu_metrics->average_mm_activity = 
> > max(metrics->Vcn0ActivityPercentage,
> >
> > metrics->Vcn1ActivityPercentage);
> >
> > @@ -2105,7 +2105,7 @@ static ssize_t smu_v13_0_7_get_gpu_metrics(struct 
> > smu_context *smu,
> >       else
> >               gpu_metrics->average_gfxclk_frequency =
> > metrics->AverageGfxclkFrequencyPreDs;
> >
> > -     if (metrics->AverageUclkActivity <= SMU_13_0_7_BUSY_THRESHOLD)
> > +     if (smu_safe_u16_nn(metrics->AverageUclkActivity) <=
> > + SMU_13_0_7_BUSY_THRESHOLD)
> >               gpu_metrics->average_uclk_frequency = 
> > metrics->AverageMemclkFrequencyPostDs;
> >       else
> >               gpu_metrics->average_uclk_frequency =
> > metrics->AverageMemclkFrequencyPreDs;
> > diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu14/smu_v14_0_2_ppt.c
> > b/drivers/gpu/drm/amd/pm/swsmu/smu14/smu_v14_0_2_ppt.c
> > index 28c1b084fe62..aaec3a251e0f 100644
> > --- a/drivers/gpu/drm/amd/pm/swsmu/smu14/smu_v14_0_2_ppt.c
> > +++ b/drivers/gpu/drm/amd/pm/swsmu/smu14/smu_v14_0_2_ppt.c
> > @@ -660,13 +660,13 @@ static int smu_v14_0_2_get_smu_metrics_data(struct 
> > smu_context *smu,
> >                       *value = metrics->AverageGfxclkFrequencyPreDs;
> >               break;
> >       case METRICS_AVERAGE_FCLK:
> > -             if (metrics->AverageUclkActivity <= SMU_14_0_2_BUSY_THRESHOLD)
> > +             if (smu_safe_u16_nn(metrics->AverageUclkActivity) <=
> > + SMU_14_0_2_BUSY_THRESHOLD)
> >                       *value = metrics->AverageFclkFrequencyPostDs;
> >               else
> >                       *value = metrics->AverageFclkFrequencyPreDs;
> >               break;
> >       case METRICS_AVERAGE_UCLK:
> > -             if (metrics->AverageUclkActivity <= SMU_14_0_2_BUSY_THRESHOLD)
> > +             if (smu_safe_u16_nn(metrics->AverageUclkActivity) <=
> > + SMU_14_0_2_BUSY_THRESHOLD)
> >                       *value = metrics->AverageMemclkFrequencyPostDs;
> >               else
> >                       *value = metrics->AverageMemclkFrequencyPreDs;
> > @@ -687,7 +687,7 @@ static int smu_v14_0_2_get_smu_metrics_data(struct 
> > smu_context *smu,
> >               *value = metrics->AverageGfxActivity;
> >               break;
> >       case METRICS_AVERAGE_MEMACTIVITY:
> > -             *value = metrics->AverageUclkActivity;
> > +             *value =
> > + smu_safe_u16_nn(metrics->AverageUclkActivity);
> >               break;
> >       case METRICS_AVERAGE_VCNACTIVITY:
> >               *value = max(metrics->AverageVcn0ActivityPercentage,
> > @@ -2146,7 +2146,7 @@ static ssize_t smu_v14_0_2_get_gpu_metrics(struct 
> > smu_context *smu,
> >
> > metrics->AvgTemperature[TEMP_VR_MEM1]);
> >
> >       gpu_metrics->average_gfx_activity = metrics->AverageGfxActivity;
> > -     gpu_metrics->average_umc_activity = metrics->AverageUclkActivity;
> > +     gpu_metrics->average_umc_activity =
> > + smu_safe_u16_nn(metrics->AverageUclkActivity);
> >       gpu_metrics->average_mm_activity = 
> > max(metrics->AverageVcn0ActivityPercentage,
> >
> > metrics->Vcn1ActivityPercentage);
> >
> > @@ -2158,7 +2158,7 @@ static ssize_t smu_v14_0_2_get_gpu_metrics(struct 
> > smu_context *smu,
> >       else
> >               gpu_metrics->average_gfxclk_frequency =
> > metrics->AverageGfxclkFrequencyPreDs;
> >
> > -     if (metrics->AverageUclkActivity <= SMU_14_0_2_BUSY_THRESHOLD)
> > +     if (smu_safe_u16_nn(metrics->AverageUclkActivity) <=
> > + SMU_14_0_2_BUSY_THRESHOLD)
> >               gpu_metrics->average_uclk_frequency = 
> > metrics->AverageMemclkFrequencyPostDs;
> >       else
> >               gpu_metrics->average_uclk_frequency =
> > metrics->AverageMemclkFrequencyPreDs;
>

RE: [PATCH] drm/amd/pm: correct mem_busy_percent display due to calculation errors

Reply via email to