+ Hawking, Lijo to help review.

On Fri, Dec 5, 2025 at 3:19 AM Riana Tauro <[email protected]> wrote:
>
> This work is a continuation of the great work started by Aravind ([1] and [2])
> in order to fulfill the RAS requirements and proposal as previously discussed
> and agreed in the Linux Plumbers accelerator's bof of 2022 [3].
>
> [1]: 
> https://lore.kernel.org/dri-devel/[email protected]/
> [2]: 
> https://lore.kernel.org/all/[email protected]/
> [3]: 
> https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
>
> During the past review round, Lukas pointed out that netlink had evolved
> in parallel during these years and that now, any new usage of netlink families
> would require the usage of the YAML description and scripts.
>
> With this new requirement in place, the family name is hardcoded in the yaml 
> file,
> so we are forced to have a single family name for the entire drm, and then we 
> now
> we are forced to have a registration.
>
> So, while doing the registration, we now created the concept of drm-ras-node.
> For now the only node type supported is the agreed error-counter. But that 
> could
> be expanded for other cases like telemetry, requested by Zack for the 
> qualcomm accel
> driver.
>
> In this first version, only querying counter is supported. But also this is 
> expandable
> to future introduction of multicast notification and also clearing the 
> counters.
>
> This design with multiple nodes per device is already flexible enough for 
> driver
> to decide if it wants to handle error per device, or per IP block, or per 
> error
> category. I believe this fully attend to the requested AMD feedback in the 
> earlier
> reviews.
>
> So, my proposal is to start simple with this case as is, and then iterate over
> with the drm-ras in tree so we evolve together according to various driver's 
> RAS
> needs.
>
> I have provided a documentation and the first Xe implementation of the counter
> as reference.
>
> Also, it is worth to mention that we have a in-tree pyynl/cli.py tool that 
> entirely
> exercises this new API, hence I hope this can be the reference code for the 
> uAPI
> usage, while we continue with the plan of introducing IGT tests and tools for 
> this
> and adjusting the internal vendor tools to open with open source developments 
> and
> changing them to support these flows.
>
> Example:
>
> $ sudo ynl --family drm_ras  --dump list-nodes
> [{'device-name': '0000:03:00.0',
>   'node-id': 0,
>   'node-name': 'correctable-errors',
>   'node-type': 'error-counter'},
>  {'device-name': '0000:03:00.0',
>   'node-id': 1,
>   'node-name': 'nonfatal-errors',
>   'node-type': 'error-counter'},
>  {'device-name': '0000:03:00.0',
>   'node-id': 2,
>   'node-name': 'fatal-errors',
>   'node-type': 'error-counter'}]
>
> $ sudo ynl --family drm_ras  --dump get-error-counters --json '{"node-id":1}'
> [{'error-id': 1, 'error-name': 'Core Compute Error', 'error-value': 0},
>  {'error-id': 2, 'error-name': 'SOC Internal Error', 'error-value': 0}]
>
> $ sudo ynl --family drm_ras --do query-error-counter  --json '{"node-id":1, 
> "error-id":1}'
> {'error-id': 1, 'error-name': 'Core Compute Error', 'error-value': 0}
>
> IGT : https://patchwork.freedesktop.org/patch/689729/?series=157409&rev=3
>
> Rev2: Fix review comments
>       Add support for GT and SOC errors
>
> Rev3: Add uAPI for errors and nodes
>       Update documentation
>
>
> Riana Tauro (3):
>   drm/xe/xe_drm_ras: Add support for drm ras
>   drm/xe/xe_hw_error: Add support for GT hardware errors
>   drm/xe/xe_hw_error: Add support for PVC SOC errors
>
> Rodrigo Vivi (1):
>   drm/ras: Introduce the DRM RAS infrastructure over generic netlink
>
>  Documentation/gpu/drm-ras.rst              | 109 +++++
>  Documentation/gpu/index.rst                |   1 +
>  Documentation/netlink/specs/drm_ras.yaml   | 130 ++++++
>  drivers/gpu/drm/Kconfig                    |   9 +
>  drivers/gpu/drm/Makefile                   |   1 +
>  drivers/gpu/drm/drm_drv.c                  |   6 +
>  drivers/gpu/drm/drm_ras.c                  | 351 ++++++++++++++++
>  drivers/gpu/drm/drm_ras_genl_family.c      |  42 ++
>  drivers/gpu/drm/drm_ras_nl.c               |  54 +++
>  drivers/gpu/drm/xe/Makefile                |   1 +
>  drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  68 ++++
>  drivers/gpu/drm/xe/xe_device_types.h       |   4 +
>  drivers/gpu/drm/xe/xe_drm_ras.c            | 199 +++++++++
>  drivers/gpu/drm/xe/xe_drm_ras.h            |  12 +
>  drivers/gpu/drm/xe/xe_drm_ras_types.h      |  40 ++
>  drivers/gpu/drm/xe/xe_hw_error.c           | 444 +++++++++++++++++++--
>  include/drm/drm_ras.h                      |  76 ++++
>  include/drm/drm_ras_genl_family.h          |  17 +
>  include/drm/drm_ras_nl.h                   |  24 ++
>  include/uapi/drm/drm_ras.h                 |  49 +++
>  include/uapi/drm/xe_drm.h                  |  82 ++++
>  21 files changed, 1682 insertions(+), 37 deletions(-)
>  create mode 100644 Documentation/gpu/drm-ras.rst
>  create mode 100644 Documentation/netlink/specs/drm_ras.yaml
>  create mode 100644 drivers/gpu/drm/drm_ras.c
>  create mode 100644 drivers/gpu/drm/drm_ras_genl_family.c
>  create mode 100644 drivers/gpu/drm/drm_ras_nl.c
>  create mode 100644 drivers/gpu/drm/xe/xe_drm_ras.c
>  create mode 100644 drivers/gpu/drm/xe/xe_drm_ras.h
>  create mode 100644 drivers/gpu/drm/xe/xe_drm_ras_types.h
>  create mode 100644 include/drm/drm_ras.h
>  create mode 100644 include/drm/drm_ras_genl_family.h
>  create mode 100644 include/drm/drm_ras_nl.h
>  create mode 100644 include/uapi/drm/drm_ras.h
>
> --
> 2.47.1
>

Reply via email to