On 1/9/2026 9:28 PM, Raag Jadav wrote:
On Fri, Jan 09, 2026 at 09:13:31AM -0500, Rodrigo Vivi wrote:
On Fri, Jan 09, 2026 at 01:38:44PM +0530, Riana Tauro wrote:
Hi Raag
Thank you for the review
On 12/9/2025 1:52 PM, Raag Jadav wrote:
On Fri, Dec 05, 2025 at 02:09:34PM +0530, Riana Tauro wrote:
Allocate correctable, nonfatal and fatal nodes per xe device.
Each node contains error classes, counters and respective
query counter functions.
Add basic functionality to create and register drm nodes.
Below operations can be performed using Generic netlink DRM RAS interface
...
Query Error counter:
$ sudo ynl --family drm_ras --do query-error-counter --json '{"node-id":1,
"error-id":1}'
{'error-id': 1, 'error-name': 'Core Compute Error', 'error-value': 0}
One more (sorry): So this means graphics will be a different id? Or do they
overlap? How does it work?
Did not get this question.
This give the impression that it's specific to compute engine, so I was
hoping for something more generic like "execution unit" or simply "core"
but I couldn't come up with anything better than this, so upto you.
Perhaps just GT. Let me check
Also,
[*] I'm not much informed about the history here but the 'error' term
seems slapped onto almost everything. We already know it's RAS so perhaps
we add it only where make sense and try to simplify some of the naming?
...
+/**
+ * enum drm_xe_ras_error_class - Supported drm ras error classes.
+ */
+enum drm_xe_ras_error_class {
+ /** @DRM_XE_RAS_ERROR_CORE_COMPUTE: GT and Media Error */
+ DRM_XE_RAS_ERROR_CORE_COMPUTE = 1,
+ /** @DRM_XE_RAS_ERROR_SOC_INTERNAL: SOC Error */
+ DRM_XE_RAS_ERROR_SOC_INTERNAL,
+ /** @DRM_XE_RAS_ERROR_CLASS_MAX: Max Error */
+ DRM_XE_RAS_ERROR_CLASS_MAX, /* non-ABI */
+};
Also, all of the enums share the same DRM_XE_RAS_ERROR_* prefix, so let's try
to have distinguishable naming. Perhaps [*] would be useful here as well ;)
DRM_XE_RAS_ERROR_SEVERITY_* will cause longer names. Any suggestions?
Already mentioned above[*], the key is to not overuse 'error' ;)
DRM_XE_RAS_SEVERITY_*
DRM_XE_RAS_COMPONENT_*
There's been an interest expressed to add telemetry nodes as well.
https://patchwork.freedesktop.org/patch/666138/?series=118435&rev=5
I have kept the prefix (DRM_XE_RAS_ERROR) consistent with the first
patch (type - ERROR_COUNTER) for alignment.
From my perspective retaining the prefix ERROR would be beneficial to
differentiate if there are different types.
Can you please have a look at the link and let me know if you still
think the same
For differentiation, i will add SEVERITY and CLASS/COMPONENT.
Thanks
Riana
and so on ...
Try this full version first and see how the outcome looks like...
if we are still respecting the line limits without ugly cuts, then let's go
with it.
otherwise try something shorter ERR_SEV_ ... or something like that...
... which can be futher shortened with this idea.
Side note: I'm already using these on my local branch.
Raag