This series builds on top of Introduce Xe Uncorrectable Error Handling[1] and adds support for handling errors that require a complete device power cycle (cold reset) to recover.
Certain error conditions leave the device in a persistent hardware error state that cannot be cleared through existing recovery mechanisms such as driver reload or PCIe reset. In these cases, functionality can only be restored by performing a cold reset. To support this, the series introduces a new DRM wedging recovery method, DRM_WEDGE_RECOVERY_COLD_RESET (BIT(4)). When a device is wedged with this method, the DRM core notifies userspace via a uevent that a cold reset is required. This allows userspace to take appropriate action to power-cycle the device. Example uevent received: SUBSYSTEM=drm WEDGED=cold-reset DEVPATH=/devices/.../drm/card0 Detailed description in commit message. [1] https://patchwork.freedesktop.org/series/160482/ This patch series introduces a call to punit_error_handler() from within handle_soc_internal_errors() when PUNIT errors detected. v2: - Add use case: Handling errors from power management unit, which requires a complete power cycle to recover. (Christian) - Add several instead of number to avoid update. (Jani) v3: - Update any scenario that requires cold-reset. (Riana) - Update document with generic scenario. (Riana) - Consistent with terminology. (Raag) - Remove already covered information. - Use PUNIT instead of PMU. (Riana) - Use consistent wordingi. - Remove log. (Raag) v4: - Rename cold reset to power cyclce. (Raag) - Update doc. (Raag/Riana) - Change commit message. (Raag) - Make function static. (Raag) v5: - Make it consistent with consumer expectations. (Raag) - Update commit message. - Remove unbind. - Simplify cold-reset script. - Remove kdoc for static function. - Remove xe_ prefix for static function. Cc: André Almeida <[email protected]> Cc: Christian König <[email protected]> Cc: David Airlie <[email protected]> Cc: Simona Vetter <[email protected]> Cc: Maxime Ripard <[email protected]> Cc: Maarten Lankhorst <[email protected]> Cc: Thomas Zimmermann <[email protected]> Mallesh Koujalagi (4): drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method drm/doc: Document DRM_WEDGE_RECOVERY_COLD_RESET recovery method drm/xe: Handle PUNIT errors by requesting cold-reset recovery drm/xe: Suppress Surprise Link Down on non-hotplug device Riana Tauro (1): Introduce Xe Uncorrectable Error Handling Documentation/gpu/drm-uapi.rst | 64 +- drivers/gpu/drm/drm_drv.c | 2 + drivers/gpu/drm/xe/Makefile | 1 + drivers/gpu/drm/xe/xe_device.c | 19 +- drivers/gpu/drm/xe/xe_device.h | 15 + drivers/gpu/drm/xe/xe_device_types.h | 6 + drivers/gpu/drm/xe/xe_gt.c | 14 +- drivers/gpu/drm/xe/xe_guc_submit.c | 9 +- drivers/gpu/drm/xe/xe_pci.c | 10 + drivers/gpu/drm/xe/xe_pci_error.c | 138 +++++ drivers/gpu/drm/xe/xe_ras.c | 552 ++++++++++++++++++ drivers/gpu/drm/xe/xe_ras.h | 5 +- drivers/gpu/drm/xe/xe_ras_types.h | 215 +++++++ drivers/gpu/drm/xe/xe_survivability_mode.c | 13 +- drivers/gpu/drm/xe/xe_sysctrl_event.c | 2 +- drivers/gpu/drm/xe/xe_sysctrl_event_types.h | 2 +- drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h | 11 + include/drm/drm_device.h | 1 + 18 files changed, 1058 insertions(+), 21 deletions(-) create mode 100644 drivers/gpu/drm/xe/xe_pci_error.c -- 2.34.1
