Add support to handle firmware reported errors. When CSC firmware errors are encoutered, a error interrupt is received by the GFX device as a MSI interrupt. Device Source control registers indicates the source of the error as CSC. The HEC error status register indicates that the error is firmware reported Depending on the type of firmware error, the error cause is written to the HEC Firmware error register. On encountering such CSC firmware errors, the device is unusable and can be recovered only using firmware update.
Whenever XE KMD detects such a firmware error, a drm wedged system administrator/userspace needs to be notified to trigger a firmware update. To address the above need, drm wedged uevent with a new recovery method and runtime survivability is used. The initial proposal to add 'firmware-flash' as a recovery method was not applicable to other drivers and could cause multiple recovery methods specific to vendors to be added. A more generic 'vendor-specific' method is introduced in this series, guiding users to refer to vendor specific documentation and system logs, additonal indicators for detailed vendor specific recovery mechanism. It is the responsibility of the consumer to refer to the correct vendor specific documentation and usecase before attempting a recovery. For example: If driver is XE KMD, the consumer must refer to the documentation of 'Device Wedging' under 'Documentation/gpu/xe/' The necessity of a firmware flash in Xe KMD is notified to the user with a combination of vendor-specific wedged uevent, runtime survivability mode and dmesg logs. Consumer must check both uevent and runtime survivability sysfs before triggering a firmware update. Udev $ udevadm monitor --property --kernel monitor will print the received events for: KERNEL - the kernel uevent KERNEL[754.709341] change /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0 (drm) ACTION=change DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0 SUBSYSTEM=drm WEDGED=vendor-specific DEVNAME=/dev/dri/card0 DEVTYPE=drm_minor SEQNUM=5973 MAJOR=226 MINOR=0 Dmesg: xe 0000:03:00.0: [drm] *ERROR* [Hardware Error]: Tile0 reported NONFATAL error 0x20000 xe 0000:03:00.0: [drm] *ERROR* [Hardware Error]: NONFATAL: HEC Uncorrected FW FD Corruption error reported, bit[2] is set xe 0000:03:00.0: Runtime Survivability mode enabled xe 0000:03:00.0: [drm] *ERROR* CRITICAL: Xe has declared device 0000:03:00.0 as wedged. IOCTLs and executions are blocked. Only a rebind may clear the failure Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new xe 0000:03:00.0: [drm] device wedged, needs recovery xe 0000:03:00.0: Firmware update required, Please refer to the userspace documentation for more details! Runtime survivability Sysfs: /sys/bus/pci/devices/<device>/survivability_mode Bspec: 50875, 53073, 53074, 53075, 53076 IGT: https://patchwork.freedesktop.org/patch/660122/ fwupd PR: https://github.com/fwupd/fwupd/pull/9024 Rev2: add a fault injection for csc errors fix review comments Rev3: add a vendor-specific recovery method add support for runtime survivability mode enable runtime survivability mode when csc errors are reported Rev4: refactor survivability code Rev5: Add more documentation add user friendly logs remove checks for BMG if not necessary fix other review comments Riana Tauro (9): drm: Add a vendor-specific recovery method to device wedged uevent drm/xe: Set GT as wedged before sending wedged uevent drm/xe: Add a helper function to set recovery method drm/xe/xe_survivability: Refactor survivability mode drm/xe/xe_survivability: Add support for Runtime survivability mode drm/xe/doc: Document device wedged and runtime survivability drm/xe: Add support to handle hardware errors drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors drm/xe/xe_hw_error: Add fault injection to trigger csc error handler Documentation/gpu/drm-uapi.rst | 41 +++- Documentation/gpu/xe/index.rst | 1 + Documentation/gpu/xe/xe_device.rst | 10 + Documentation/gpu/xe/xe_pcode.rst | 6 +- drivers/gpu/drm/drm_drv.c | 2 + drivers/gpu/drm/xe/Makefile | 1 + drivers/gpu/drm/xe/regs/xe_gsc_regs.h | 2 + drivers/gpu/drm/xe/regs/xe_hw_error_regs.h | 20 ++ drivers/gpu/drm/xe/regs/xe_irq_regs.h | 1 + drivers/gpu/drm/xe/xe_debugfs.c | 3 + drivers/gpu/drm/xe/xe_device.c | 58 +++++- drivers/gpu/drm/xe/xe_device.h | 1 + drivers/gpu/drm/xe/xe_device_types.h | 5 + drivers/gpu/drm/xe/xe_heci_gsc.c | 2 +- drivers/gpu/drm/xe/xe_hw_error.c | 181 ++++++++++++++++++ drivers/gpu/drm/xe/xe_hw_error.h | 15 ++ drivers/gpu/drm/xe/xe_irq.c | 4 + drivers/gpu/drm/xe/xe_pci.c | 6 +- drivers/gpu/drm/xe/xe_survivability_mode.c | 167 ++++++++++++---- drivers/gpu/drm/xe/xe_survivability_mode.h | 5 +- .../gpu/drm/xe/xe_survivability_mode_types.h | 8 + include/drm/drm_device.h | 4 + 22 files changed, 488 insertions(+), 55 deletions(-) create mode 100644 Documentation/gpu/xe/xe_device.rst create mode 100644 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h create mode 100644 drivers/gpu/drm/xe/xe_hw_error.c create mode 100644 drivers/gpu/drm/xe/xe_hw_error.h -- 2.47.1