+ Hawking, Lijo to help review.
On Fri, Dec 5, 2025 at 3:19 AM Riana Tauro <[email protected]> wrote: > > This work is a continuation of the great work started by Aravind ([1] and [2]) > in order to fulfill the RAS requirements and proposal as previously discussed > and agreed in the Linux Plumbers accelerator's bof of 2022 [3]. > > [1]: > https://lore.kernel.org/dri-devel/[email protected]/ > [2]: > https://lore.kernel.org/all/[email protected]/ > [3]: > https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html > > During the past review round, Lukas pointed out that netlink had evolved > in parallel during these years and that now, any new usage of netlink families > would require the usage of the YAML description and scripts. > > With this new requirement in place, the family name is hardcoded in the yaml > file, > so we are forced to have a single family name for the entire drm, and then we > now > we are forced to have a registration. > > So, while doing the registration, we now created the concept of drm-ras-node. > For now the only node type supported is the agreed error-counter. But that > could > be expanded for other cases like telemetry, requested by Zack for the > qualcomm accel > driver. > > In this first version, only querying counter is supported. But also this is > expandable > to future introduction of multicast notification and also clearing the > counters. > > This design with multiple nodes per device is already flexible enough for > driver > to decide if it wants to handle error per device, or per IP block, or per > error > category. I believe this fully attend to the requested AMD feedback in the > earlier > reviews. > > So, my proposal is to start simple with this case as is, and then iterate over > with the drm-ras in tree so we evolve together according to various driver's > RAS > needs. > > I have provided a documentation and the first Xe implementation of the counter > as reference. > > Also, it is worth to mention that we have a in-tree pyynl/cli.py tool that > entirely > exercises this new API, hence I hope this can be the reference code for the > uAPI > usage, while we continue with the plan of introducing IGT tests and tools for > this > and adjusting the internal vendor tools to open with open source developments > and > changing them to support these flows. > > Example: > > $ sudo ynl --family drm_ras --dump list-nodes > [{'device-name': '0000:03:00.0', > 'node-id': 0, > 'node-name': 'correctable-errors', > 'node-type': 'error-counter'}, > {'device-name': '0000:03:00.0', > 'node-id': 1, > 'node-name': 'nonfatal-errors', > 'node-type': 'error-counter'}, > {'device-name': '0000:03:00.0', > 'node-id': 2, > 'node-name': 'fatal-errors', > 'node-type': 'error-counter'}] > > $ sudo ynl --family drm_ras --dump get-error-counters --json '{"node-id":1}' > [{'error-id': 1, 'error-name': 'Core Compute Error', 'error-value': 0}, > {'error-id': 2, 'error-name': 'SOC Internal Error', 'error-value': 0}] > > $ sudo ynl --family drm_ras --do query-error-counter --json '{"node-id":1, > "error-id":1}' > {'error-id': 1, 'error-name': 'Core Compute Error', 'error-value': 0} > > IGT : https://patchwork.freedesktop.org/patch/689729/?series=157409&rev=3 > > Rev2: Fix review comments > Add support for GT and SOC errors > > Rev3: Add uAPI for errors and nodes > Update documentation > > > Riana Tauro (3): > drm/xe/xe_drm_ras: Add support for drm ras > drm/xe/xe_hw_error: Add support for GT hardware errors > drm/xe/xe_hw_error: Add support for PVC SOC errors > > Rodrigo Vivi (1): > drm/ras: Introduce the DRM RAS infrastructure over generic netlink > > Documentation/gpu/drm-ras.rst | 109 +++++ > Documentation/gpu/index.rst | 1 + > Documentation/netlink/specs/drm_ras.yaml | 130 ++++++ > drivers/gpu/drm/Kconfig | 9 + > drivers/gpu/drm/Makefile | 1 + > drivers/gpu/drm/drm_drv.c | 6 + > drivers/gpu/drm/drm_ras.c | 351 ++++++++++++++++ > drivers/gpu/drm/drm_ras_genl_family.c | 42 ++ > drivers/gpu/drm/drm_ras_nl.c | 54 +++ > drivers/gpu/drm/xe/Makefile | 1 + > drivers/gpu/drm/xe/regs/xe_hw_error_regs.h | 68 ++++ > drivers/gpu/drm/xe/xe_device_types.h | 4 + > drivers/gpu/drm/xe/xe_drm_ras.c | 199 +++++++++ > drivers/gpu/drm/xe/xe_drm_ras.h | 12 + > drivers/gpu/drm/xe/xe_drm_ras_types.h | 40 ++ > drivers/gpu/drm/xe/xe_hw_error.c | 444 +++++++++++++++++++-- > include/drm/drm_ras.h | 76 ++++ > include/drm/drm_ras_genl_family.h | 17 + > include/drm/drm_ras_nl.h | 24 ++ > include/uapi/drm/drm_ras.h | 49 +++ > include/uapi/drm/xe_drm.h | 82 ++++ > 21 files changed, 1682 insertions(+), 37 deletions(-) > create mode 100644 Documentation/gpu/drm-ras.rst > create mode 100644 Documentation/netlink/specs/drm_ras.yaml > create mode 100644 drivers/gpu/drm/drm_ras.c > create mode 100644 drivers/gpu/drm/drm_ras_genl_family.c > create mode 100644 drivers/gpu/drm/drm_ras_nl.c > create mode 100644 drivers/gpu/drm/xe/xe_drm_ras.c > create mode 100644 drivers/gpu/drm/xe/xe_drm_ras.h > create mode 100644 drivers/gpu/drm/xe/xe_drm_ras_types.h > create mode 100644 include/drm/drm_ras.h > create mode 100644 include/drm/drm_ras_genl_family.h > create mode 100644 include/drm/drm_ras_nl.h > create mode 100644 include/uapi/drm/drm_ras.h > > -- > 2.47.1 >
