Reading GPU metrics has several constraints that weren't being honoured before:

- reading from offset >0 should be from the sampled data when offset==0 was 
read so it's coherent
- support >4KB gpu metrics
- use binary attributes
- ideally for a given file handle you should invalidate the data periodically 
(so you can keep the handle open and seek to 0 to capture new data though the 
invalidation should only happen on an offset==0 read)

Not really arguing one way or another if this patch is correct/ideal just 
saying what/why we're doing this.

Tom


________________________________________
From: Koenig, Christian <[email protected]>
Sent: Friday, April 10, 2026 10:00
To: StDenis, Tom; [email protected]
Cc: Prosyak, Vitaly
Subject: Re: [PATCH] drm/amd/pm: Change gpu_metrics over to binary with 
per-reader snapshots (v4)

On 4/6/26 18:25, Tom St Denis wrote:
> The gpu_metrics sysfs file carries a binary blob but was implemented as
> a text device_attribute, which imposes a hard PAGE_SIZE-1 cap and makes
> it impossible for userspace to distinguish a successful short read from
> a truncated one (stat reports 4 KiB regardless of actual payload size).
>
> Convert gpu_metrics to a bin_attribute.  The declared file size is an
> upper bound (128 KiB); the real payload length is the byte count
> returned before EOF.
>
> To guarantee that multi-chunk reads (which kernfs splits at PAGE_SIZE
> boundaries) return a coherent snapshot, each reader (identified by its
> struct file pointer) gets its own cached metrics buffer via an xarray.
> When offset is 0, PMFW is sampled and stored in that reader's entry;
> subsequent offsets are served from the same snapshot.  The entry is
> freed on EOF.  A mutex serialises xarray mutations and PMFW access.
>
> Stale entries from readers that close without reaching EOF (e.g.,
> killed processes) are lazily evicted after 30 seconds whenever any
> new reader starts.
>
> This per-reader approach avoids the cross-contamination problem of
> a single shared cache: concurrent readers each see their own coherent
> PMFW snapshot rather than risking one reader's off=0 overwriting the
> buffer while another reader is mid-way through a multi-chunk read.
>
> V3: Per-reader snapshot state, so each open fd gets its own cached
>     gpu_metrics buffer and all chunks for that reader come from the
>     same sample (Vitaly)
>
> V4: Remove the residual PAGE_SIZE cap that v3 still carried over from
>     v1. The whole point of the bin_attribute conversion is to
>     lift the old PAGE_SIZE-1 limit, but v3 still had:
>       - WARN_ON_ONCE(len > PAGE_SIZE) with silent truncation to 4K
>       - kzalloc(PAGE_SIZE) for the snapshot buffer (fixed allocation)
>       - a separate memcpy from metrics into that fixed buffer
>     In v4 the snapshot buffer is allocated with kmemdup() sized to the
>     actual payload returned by amdgpu_dpm_get_gpu_metrics(), so payloads
>     larger than PAGE_SIZE work correctly.  The only remaining upper bound
>     is AMDGPU_GPU_METRICS_BIN_ATTR_MAX_SZ (128 KiB), which is the
>     declared bin_attribute file size and serves as a sanity check —
>     exceeding it now returns -EOVERFLOW instead of silently truncating.
>     (Tom)
>
> Signed-off-by: Tom St Denis <[email protected]>
> Acked-by: Vitaly Prosyak <[email protected]>
> Change-Id: Ib4fa233d9a25a396f1cc7d5fcf74f5f6578329f5
> ---
>  drivers/gpu/drm/amd/pm/amdgpu_pm.c      | 208 +++++++++++++++++++++---
>  drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h |   4 +
>  drivers/gpu/drm/amd/pm/inc/amdgpu_pm.h  |   1 -
>  3 files changed, 188 insertions(+), 25 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c 
> b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> index a4d8e667eafb..303a2d643b10 100644
> --- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> +++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> @@ -34,8 +34,19 @@
>  #include <linux/nospec.h>
>  #include <linux/pm_runtime.h>
>  #include <linux/string_choices.h>
> +#include <linux/sysfs.h>
> +#include <linux/sizes.h>
> +#include <linux/xarray.h>
>  #include <asm/processor.h>
>
> +/*
> + * Sysfs reports this as the file size (stat/ls); kernfs also uses it to cap
> + * read offsets.  Actual payload length is the return value of
> + * amdgpu_dpm_get_gpu_metrics() and must not exceed this.
> + */
> +#define AMDGPU_GPU_METRICS_BIN_ATTR_MAX_SZ   SZ_128K
> +
> +
>  #define MAX_NUM_OF_FEATURES_PER_SUBSET               8
>  #define MAX_NUM_OF_SUBSETS                   8
>
> @@ -1734,43 +1745,170 @@ static ssize_t amdgpu_get_pm_metrics(struct device 
> *dev,
>   * DOC: gpu_metrics
>   *
>   * The amdgpu driver provides a sysfs API for retrieving current gpu
> - * metrics data. The file gpu_metrics is used for this. Reading the
> - * file will dump all the current gpu metrics data.
> + * metrics data.  The binary sysfs file gpu_metrics is used for this.
> + * Reading the file returns the raw metrics blob.  The sysfs file size
> + * is an upper bound for inode metadata; the real length is the amount
> + * returned before EOF.
> + *
> + * Metrics are sampled atomically per reader: the first read at offset 0
> + * captures a snapshot into a per-fd cache; subsequent reads at higher
> + * offsets (for payloads that span multiple pages) are served from that
> + * same snapshot.  Concurrent readers each get their own cache.
>   *
>   * These data include temperature, frequency, engines utilization,
>   * power consume, throttler status, fan speed and cpu core statistics(
>   * available for APU only). That's it will give a snapshot of all sensors
>   * at the same time.
>   */
> -static ssize_t amdgpu_get_gpu_metrics(struct device *dev,
> -                                   struct device_attribute *attr,
> -                                   char *buf)
> +
> +/* Per-reader snapshot entry, keyed by struct file pointer in the xarray */
> +struct gpu_metrics_snap_entry {
> +     void    *data;
> +     size_t  size;
> +     ktime_t timestamp;
> +};
> +
> +/* Evict stale entries from readers that closed without reaching EOF */
> +#define GPU_METRICS_SNAP_STALE_NS    (30ULL * NSEC_PER_SEC)
> +
> +static void gpu_metrics_evict_stale_locked(struct xarray *xa,
> +                                        unsigned long skip_key)
> +{
> +     struct gpu_metrics_snap_entry *entry;
> +     unsigned long idx;
> +     ktime_t cutoff;
> +
> +     cutoff = ktime_sub(ktime_get(), ns_to_ktime(GPU_METRICS_SNAP_STALE_NS));
> +
> +     xa_for_each(xa, idx, entry) {
> +             if (idx != skip_key && ktime_before(entry->timestamp, cutoff)) {
> +                     xa_erase(xa, idx);
> +                     kfree(entry->data);
> +                     kfree(entry);
> +             }
> +     }
> +}
> +
> +static void gpu_metrics_free_all(struct xarray *xa)
>  {
> +     struct gpu_metrics_snap_entry *entry;
> +     unsigned long idx;
> +
> +     xa_for_each(xa, idx, entry) {
> +             xa_erase(xa, idx);
> +             kfree(entry->data);
> +             kfree(entry);
> +     }
> +     xa_destroy(xa);
> +}
> +
> +static bool amdgpu_pm_gpu_metrics_bin_visible(struct amdgpu_device *adev,
> +                                           uint32_t mask)
> +{
> +     uint32_t gc_ver = amdgpu_ip_version(adev, GC_HWIP, 0);
> +
> +     if (!((ATTR_FLAG_BASIC | ATTR_FLAG_ONEVF) & mask))
> +             return false;
> +
> +     return gc_ver >= IP_VERSION(9, 1, 0);
> +}
> +
> +static ssize_t amdgpu_sysfs_gpu_metrics_read(struct file *f,
> +                                           struct kobject *kobj,
> +                                           const struct bin_attribute *attr,
> +                                           char *buf, loff_t off,
> +                                           size_t count)
> +{
> +     struct device *dev = kobj_to_dev(kobj);
>       struct drm_device *ddev = dev_get_drvdata(dev);
>       struct amdgpu_device *adev = drm_to_adev(ddev);
> -     void *gpu_metrics;
> -     ssize_t size = 0;
> -     int ret;
> +     unsigned long key = (unsigned long)f;
> +     struct gpu_metrics_snap_entry *entry;
> +     ssize_t ret;
>
> -     ret = amdgpu_pm_get_access_if_active(adev);
> -     if (ret)
> -             return ret;
> +     mutex_lock(&adev->pm.gpu_metrics_lock);
>
> -     size = amdgpu_dpm_get_gpu_metrics(adev, &gpu_metrics);
> -     if (size <= 0)
> -             goto out;
> +     if (off == 0) {
> +             void *metrics;
> +             void *old;
> +             int len;
>
> -     if (size >= PAGE_SIZE)
> -             size = PAGE_SIZE - 1;
> +             /* Evict stale entries from readers that never hit EOF */
> +             gpu_metrics_evict_stale_locked(&adev->pm.gpu_metrics_readers,
> +                                            key);
>
> -     memcpy(buf, gpu_metrics, size);
> +             ret = amdgpu_pm_get_access(adev);
> +             if (ret)
> +                     goto out_unlock;
>
> -out:
> -     amdgpu_pm_put_access(adev);
> +             len = amdgpu_dpm_get_gpu_metrics(adev, &metrics);
> +             amdgpu_pm_put_access(adev);
>
> -     return size;
> +             if (len <= 0) {
> +                     ret = len;
> +                     goto out_unlock;
> +             }
> +
> +             if (WARN_ON_ONCE(len > AMDGPU_GPU_METRICS_BIN_ATTR_MAX_SZ)) {
> +                     ret = -EOVERFLOW;
> +                     goto out_unlock;
> +             }
> +
> +             entry = xa_load(&adev->pm.gpu_metrics_readers, key);
> +             if (!entry) {
> +                     entry = kzalloc(sizeof(*entry), GFP_KERNEL);
> +                     if (!entry) {
> +                             ret = -ENOMEM;
> +                             goto out_unlock;
> +                     }
> +                     old = xa_store(&adev->pm.gpu_metrics_readers, key,
> +                                    entry, GFP_KERNEL);
> +                     if (xa_is_err(old)) {
> +                             kfree(entry);
> +                             ret = xa_err(old);
> +                             goto out_unlock;
> +                     }
> +             }
> +
> +             /* (Re-)allocate snapshot buffer sized to actual payload */
> +             kfree(entry->data);
> +             entry->data = kmemdup(metrics, len, GFP_KERNEL);
> +             if (!entry->data) {
> +                     xa_erase(&adev->pm.gpu_metrics_readers, key);
> +                     kfree(entry);
> +                     ret = -ENOMEM;
> +                     goto out_unlock;
> +             }
> +             entry->size = len;
> +             entry->timestamp = ktime_get();
> +     } else {
> +             entry = xa_load(&adev->pm.gpu_metrics_readers, key);
> +             if (!entry) {
> +                     ret = -EIO;
> +                     goto out_unlock;
> +             }
> +     }
> +
> +     if (off >= entry->size) {
> +             xa_erase(&adev->pm.gpu_metrics_readers, key);
> +             kfree(entry->data);
> +             kfree(entry);
> +             ret = 0;
> +             goto out_unlock;
> +     }
> +
> +     count = min_t(size_t, count, entry->size - off);
> +     memcpy(buf, (u8 *)entry->data + off, count);
> +     ret = count;
> +
> +out_unlock:
> +     mutex_unlock(&adev->pm.gpu_metrics_lock);
> +     return ret;
>  }
>
> +static const BIN_ATTR(gpu_metrics, 0444, amdgpu_sysfs_gpu_metrics_read, NULL,
> +                   AMDGPU_GPU_METRICS_BIN_ATTR_MAX_SZ);
> +
>  static int amdgpu_show_powershift_percent(struct device *dev,
>                                       char *buf, enum amd_pp_sensors sensor)
>  {
> @@ -2579,7 +2717,6 @@ static struct amdgpu_device_attr amdgpu_device_attrs[] 
> = {
>       AMDGPU_DEVICE_ATTR_RO(unique_id,                                
> ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
>       AMDGPU_DEVICE_ATTR_RW(thermal_throttling_logging,               
> ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
>       AMDGPU_DEVICE_ATTR_RW(apu_thermal_cap,                          
> ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
> -     AMDGPU_DEVICE_ATTR_RO(gpu_metrics,                              
> ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
>       AMDGPU_DEVICE_ATTR_RO(smartshift_apu_power,                     
> ATTR_FLAG_BASIC,
>                             .attr_update = ss_power_attr_update),
>       AMDGPU_DEVICE_ATTR_RO(smartshift_dgpu_power,                    
> ATTR_FLAG_BASIC,
> @@ -2657,9 +2794,6 @@ static int default_attr_update(struct amdgpu_device 
> *adev, struct amdgpu_device_
>                    gc_ver != IP_VERSION(9, 4, 3)) ||
>                   gc_ver < IP_VERSION(9, 0, 0))
>                       *states = ATTR_STATE_UNSUPPORTED;
> -     } else if (DEVICE_ATTR_IS(gpu_metrics)) {
> -             if (gc_ver < IP_VERSION(9, 1, 0))
> -                     *states = ATTR_STATE_UNSUPPORTED;
>       } else if (DEVICE_ATTR_IS(pp_power_profile_mode)) {
>               if (amdgpu_dpm_get_power_profile_mode(adev, NULL) == 
> -EOPNOTSUPP)
>                       *states = ATTR_STATE_UNSUPPORTED;
> @@ -4755,6 +4889,19 @@ int amdgpu_pm_sysfs_init(struct amdgpu_device *adev)
>       if (ret)
>               goto err_out0;
>
> +     if (amdgpu_pm_gpu_metrics_bin_visible(adev, mask)) {
> +             mutex_init(&adev->pm.gpu_metrics_lock);
> +             xa_init(&adev->pm.gpu_metrics_readers);
> +             ret = sysfs_create_bin_file(&adev->dev->kobj,
> +                                         &bin_attr_gpu_metrics);
> +             if (ret) {
> +                     xa_destroy(&adev->pm.gpu_metrics_readers);
> +                     mutex_destroy(&adev->pm.gpu_metrics_lock);
> +                     goto err_out1;
> +             }
> +             adev->pm.gpu_metrics_bin_registered = true;
> +     }
> +
>       if (amdgpu_dpm_is_overdrive_supported(adev)) {
>               ret = amdgpu_od_set_init(adev);
>               if (ret)
> @@ -4806,6 +4953,12 @@ int amdgpu_pm_sysfs_init(struct amdgpu_device *adev)
>       return 0;
>
>  err_out1:
> +     if (adev->pm.gpu_metrics_bin_registered) {
> +             sysfs_remove_bin_file(&adev->dev->kobj, &bin_attr_gpu_metrics);
> +             gpu_metrics_free_all(&adev->pm.gpu_metrics_readers);
> +             mutex_destroy(&adev->pm.gpu_metrics_lock);
> +             adev->pm.gpu_metrics_bin_registered = false;
> +     }
>       amdgpu_device_attr_remove_groups(adev, &adev->pm.pm_attr_list);
>  err_out0:
>       if (adev->pm.int_hwmon_dev)
> @@ -4821,6 +4974,13 @@ void amdgpu_pm_sysfs_fini(struct amdgpu_device *adev)
>       if (adev->pm.int_hwmon_dev)
>               hwmon_device_unregister(adev->pm.int_hwmon_dev);
>
> +     if (adev->pm.gpu_metrics_bin_registered) {
> +             sysfs_remove_bin_file(&adev->dev->kobj, &bin_attr_gpu_metrics);
> +             gpu_metrics_free_all(&adev->pm.gpu_metrics_readers);
> +             mutex_destroy(&adev->pm.gpu_metrics_lock);
> +             adev->pm.gpu_metrics_bin_registered = false;
> +     }
> +
>       amdgpu_device_attr_remove_groups(adev, &adev->pm.pm_attr_list);
>  }
>
> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h 
> b/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
> index aa3f427819a0..3677a4f543fb 100644
> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
> @@ -349,6 +349,10 @@ struct amdgpu_pm {
>       /* dpm */
>       bool                    dpm_enabled;
>       bool                    sysfs_initialized;
> +     bool                    gpu_metrics_bin_registered;
> +     struct mutex            gpu_metrics_lock;
> +     /* per-reader snapshot entries, keyed by (unsigned long)struct file * */
> +     struct xarray           gpu_metrics_readers;

Pretty clear NAK on that approach. What the heck are you doing here?

Regards,
Christian.

>       struct amdgpu_dpm       dpm;
>       const struct firmware   *fw;    /* SMC firmware */
>       uint32_t                fw_version;
> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_pm.h 
> b/drivers/gpu/drm/amd/pm/inc/amdgpu_pm.h
> index c12ced32f780..dc6875871f1d 100644
> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_pm.h
> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_pm.h
> @@ -73,7 +73,6 @@ enum amdgpu_device_attr_id {
>       device_attr_id__unique_id,
>       device_attr_id__thermal_throttling_logging,
>       device_attr_id__apu_thermal_cap,
> -     device_attr_id__gpu_metrics,
>       device_attr_id__smartshift_apu_power,
>       device_attr_id__smartshift_dgpu_power,
>       device_attr_id__smartshift_bias,

Reply via email to