Re: [Intel-xe] [PATCH 02/14] drm/xe: Introduce the dev_coredump infrastructure.

2023-05-02 Thread Matthew Brost
On Tue, May 02, 2023 at 02:06:55PM -0400, Rodrigo Vivi wrote:
> On Tue, May 02, 2023 at 07:57:02AM +, Matthew Brost wrote:
> > On Thu, Apr 27, 2023 at 10:28:13AM +0200, Thomas Hellström wrote:
> > > 
> > > On 4/26/23 22:57, Rodrigo Vivi wrote:
> > > > The goal is to use devcoredump infrastructure to report error states
> > > > captured at the crash time.
> > > > 
> > > > The error state will contain useful information for GPU hang debug, such
> > > > as INSTDONE registers and the current buffers getting executed, as well
> > > > as any other information that helps user space and allow later replays 
> > > > of
> > > > the error.
> > > > 
> > > > The proposal here is to avoid a Xe only error_state like i915 and use
> > > > a standard dev_coredump infrastructure to expose the error state.
> > > > 
> > > > For our own case, the data is only useful if it is a snapshot of the
> > > > time when the GPU crash has happened, since we reset the GPU immediately
> > > > after and the registers might have changed. So the proposal here is to
> > > > have an internal snapshot to be printed out later.
> > > > 
> > > > Also, usually a subsequent GPU hang can be only a cause of the initial
> > > > one. So we only save the 'first' hang. The dev_coredump has a delayed
> > > > work queue where it remove the coredump and free all the data withing a
> > > > few moments of the error. When that happens we also reset our capture
> > > > state and allow further snapshots.
> > > > 
> > > > Right now this infra only print out the time of the hang. More 
> > > > information
> > > > will be migrated here on subsequent work. Also, in order to organize the
> > > > dump better, the goal is to propose dev_coredump changes itself to allow
> > > > multiple files and different controls. But for now we start Xe usage of
> > > > it without any dependency on dev_coredump core changes.
> > > > 
> > > > Cc: Daniel Vetter 
> > > > Signed-off-by: Rodrigo Vivi 
> > > > ---
> > > >   drivers/gpu/drm/xe/Kconfig|   1 +
> > > >   drivers/gpu/drm/xe/Makefile   |   1 +
> > > >   drivers/gpu/drm/xe/xe_devcoredump.c   | 144 ++
> > > >   drivers/gpu/drm/xe/xe_devcoredump.h   |  22 
> > > >   drivers/gpu/drm/xe/xe_devcoredump_types.h |  47 +++
> > > >   drivers/gpu/drm/xe/xe_device_types.h  |   4 +
> > > >   drivers/gpu/drm/xe/xe_guc_submit.c|   2 +
> > > >   drivers/gpu/drm/xe/xe_pci.c   |   2 +
> > > >   8 files changed, 223 insertions(+)
> > > >   create mode 100644 drivers/gpu/drm/xe/xe_devcoredump.c
> > > >   create mode 100644 drivers/gpu/drm/xe/xe_devcoredump.h
> > > >   create mode 100644 drivers/gpu/drm/xe/xe_devcoredump_types.h
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/Kconfig b/drivers/gpu/drm/xe/Kconfig
> > > > index f6f3b491d162..d44794f99338 100644
> > > > --- a/drivers/gpu/drm/xe/Kconfig
> > > > +++ b/drivers/gpu/drm/xe/Kconfig
> > > > @@ -35,6 +35,7 @@ config DRM_XE
> > > > select DRM_TTM_HELPER
> > > > select DRM_SCHED
> > > > select MMU_NOTIFIER
> > > > +   select WANT_DEV_COREDUMP
> > > > help
> > > >   Experimental driver for Intel Xe series GPUs
> > > > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> > > > index ee4a95beec20..9d675f7c77aa 100644
> > > > --- a/drivers/gpu/drm/xe/Makefile
> > > > +++ b/drivers/gpu/drm/xe/Makefile
> > > > @@ -34,6 +34,7 @@ xe-y += xe_bb.o \
> > > > xe_bo.o \
> > > > xe_bo_evict.o \
> > > > xe_debugfs.o \
> > > > +   xe_devcoredump.o \
> > > > xe_device.o \
> > > > xe_dma_buf.o \
> > > > xe_engine.o \
> > > > diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c 
> > > > b/drivers/gpu/drm/xe/xe_devcoredump.c
> > > > new file mode 100644
> > > > index ..d9531183f03a
> > > > --- /dev/null
> > > > +++ b/drivers/gpu/drm/xe/xe_devcoredump.c
> > > > @@ -0,0 +1,144 @@
> > > > +// SPDX-License-Identifier: MIT
> > > > +/*
> > > > + * Copyright © 2023 Intel Corporation
> > > > + */
> > > > +
> > > > +#include "xe_devcoredump.h"
> > > > +#include "xe_devcoredump_types.h"
> > > > +
> > > > +#include 
> > > > +#include 
> > > > +
> > > > +#include "xe_engine.h"
> > > > +#include "xe_gt.h"
> > > > +
> > > > +/**
> > > > + * DOC: Xe device coredump
> > > > + *
> > > > + * Devices overview:
> > > > + * Xe uses dev_coredump infrastructure for exposing the crash errors 
> > > > in a
> > > > + * standardized way.
> > > > + * devcoredump exposes a temporary device under /sys/class/devcoredump/
> > > > + * which is linked with our card device directly.
> > > > + * The core dump can be accessed either from
> > > > + * /sys/class/drm/card/device/devcoredump/ or from
> > > > + * /sys/class/devcoredump/devcd where
> > > > + * /sys/class/devcoredump/devcd/failing_device is a link to
> > > > + * /sys/class/drm/card/device/.
> > > > + *
> > > > + * Snapshot at hang:
> > > > + * The 'data' file is printed with a 

Re: [Intel-xe] [PATCH 02/14] drm/xe: Introduce the dev_coredump infrastructure.

2023-05-02 Thread Rodrigo Vivi
On Tue, May 02, 2023 at 07:57:02AM +, Matthew Brost wrote:
> On Thu, Apr 27, 2023 at 10:28:13AM +0200, Thomas Hellström wrote:
> > 
> > On 4/26/23 22:57, Rodrigo Vivi wrote:
> > > The goal is to use devcoredump infrastructure to report error states
> > > captured at the crash time.
> > > 
> > > The error state will contain useful information for GPU hang debug, such
> > > as INSTDONE registers and the current buffers getting executed, as well
> > > as any other information that helps user space and allow later replays of
> > > the error.
> > > 
> > > The proposal here is to avoid a Xe only error_state like i915 and use
> > > a standard dev_coredump infrastructure to expose the error state.
> > > 
> > > For our own case, the data is only useful if it is a snapshot of the
> > > time when the GPU crash has happened, since we reset the GPU immediately
> > > after and the registers might have changed. So the proposal here is to
> > > have an internal snapshot to be printed out later.
> > > 
> > > Also, usually a subsequent GPU hang can be only a cause of the initial
> > > one. So we only save the 'first' hang. The dev_coredump has a delayed
> > > work queue where it remove the coredump and free all the data withing a
> > > few moments of the error. When that happens we also reset our capture
> > > state and allow further snapshots.
> > > 
> > > Right now this infra only print out the time of the hang. More information
> > > will be migrated here on subsequent work. Also, in order to organize the
> > > dump better, the goal is to propose dev_coredump changes itself to allow
> > > multiple files and different controls. But for now we start Xe usage of
> > > it without any dependency on dev_coredump core changes.
> > > 
> > > Cc: Daniel Vetter 
> > > Signed-off-by: Rodrigo Vivi 
> > > ---
> > >   drivers/gpu/drm/xe/Kconfig|   1 +
> > >   drivers/gpu/drm/xe/Makefile   |   1 +
> > >   drivers/gpu/drm/xe/xe_devcoredump.c   | 144 ++
> > >   drivers/gpu/drm/xe/xe_devcoredump.h   |  22 
> > >   drivers/gpu/drm/xe/xe_devcoredump_types.h |  47 +++
> > >   drivers/gpu/drm/xe/xe_device_types.h  |   4 +
> > >   drivers/gpu/drm/xe/xe_guc_submit.c|   2 +
> > >   drivers/gpu/drm/xe/xe_pci.c   |   2 +
> > >   8 files changed, 223 insertions(+)
> > >   create mode 100644 drivers/gpu/drm/xe/xe_devcoredump.c
> > >   create mode 100644 drivers/gpu/drm/xe/xe_devcoredump.h
> > >   create mode 100644 drivers/gpu/drm/xe/xe_devcoredump_types.h
> > > 
> > > diff --git a/drivers/gpu/drm/xe/Kconfig b/drivers/gpu/drm/xe/Kconfig
> > > index f6f3b491d162..d44794f99338 100644
> > > --- a/drivers/gpu/drm/xe/Kconfig
> > > +++ b/drivers/gpu/drm/xe/Kconfig
> > > @@ -35,6 +35,7 @@ config DRM_XE
> > >   select DRM_TTM_HELPER
> > >   select DRM_SCHED
> > >   select MMU_NOTIFIER
> > > + select WANT_DEV_COREDUMP
> > >   help
> > > Experimental driver for Intel Xe series GPUs
> > > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> > > index ee4a95beec20..9d675f7c77aa 100644
> > > --- a/drivers/gpu/drm/xe/Makefile
> > > +++ b/drivers/gpu/drm/xe/Makefile
> > > @@ -34,6 +34,7 @@ xe-y += xe_bb.o \
> > >   xe_bo.o \
> > >   xe_bo_evict.o \
> > >   xe_debugfs.o \
> > > + xe_devcoredump.o \
> > >   xe_device.o \
> > >   xe_dma_buf.o \
> > >   xe_engine.o \
> > > diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c 
> > > b/drivers/gpu/drm/xe/xe_devcoredump.c
> > > new file mode 100644
> > > index ..d9531183f03a
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/xe_devcoredump.c
> > > @@ -0,0 +1,144 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright © 2023 Intel Corporation
> > > + */
> > > +
> > > +#include "xe_devcoredump.h"
> > > +#include "xe_devcoredump_types.h"
> > > +
> > > +#include 
> > > +#include 
> > > +
> > > +#include "xe_engine.h"
> > > +#include "xe_gt.h"
> > > +
> > > +/**
> > > + * DOC: Xe device coredump
> > > + *
> > > + * Devices overview:
> > > + * Xe uses dev_coredump infrastructure for exposing the crash errors in a
> > > + * standardized way.
> > > + * devcoredump exposes a temporary device under /sys/class/devcoredump/
> > > + * which is linked with our card device directly.
> > > + * The core dump can be accessed either from
> > > + * /sys/class/drm/card/device/devcoredump/ or from
> > > + * /sys/class/devcoredump/devcd where
> > > + * /sys/class/devcoredump/devcd/failing_device is a link to
> > > + * /sys/class/drm/card/device/.
> > > + *
> > > + * Snapshot at hang:
> > > + * The 'data' file is printed with a drm_printer pointer at devcoredump 
> > > read
> > > + * time. For this reason, we need to take snapshots from when the hang 
> > > has
> > > + * happened, and not only when the user is reading the file. Otherwise 
> > > the
> > > + * information is outdated since the resets might have happened in 

Re: [Intel-xe] [PATCH 02/14] drm/xe: Introduce the dev_coredump infrastructure.

2023-05-02 Thread Rodrigo Vivi
On Tue, May 02, 2023 at 10:55:14AM +0300, Jani Nikula wrote:
> On Wed, 26 Apr 2023, Rodrigo Vivi  wrote:
> > +   drm_info(>drm, "Check your 
> > /sys/class/drm/card/device/devcoredump/data\n");
> 
> Drive-by comment, could use %d and xe->drm.primary->index instead of

yeap, I wondered about that. I guess I will just add this for now, but
then work on all driver to stop using the drm.primary directly or update
the documentation that states the drivers shouldn't use that directly.

> .
> 
> BR,
> Jani.
> 
> 
> -- 
> Jani Nikula, Intel Open Source Graphics Center


Re: [Intel-xe] [PATCH 02/14] drm/xe: Introduce the dev_coredump infrastructure.

2023-05-02 Thread Matthew Brost
On Thu, Apr 27, 2023 at 10:28:13AM +0200, Thomas Hellström wrote:
> 
> On 4/26/23 22:57, Rodrigo Vivi wrote:
> > The goal is to use devcoredump infrastructure to report error states
> > captured at the crash time.
> > 
> > The error state will contain useful information for GPU hang debug, such
> > as INSTDONE registers and the current buffers getting executed, as well
> > as any other information that helps user space and allow later replays of
> > the error.
> > 
> > The proposal here is to avoid a Xe only error_state like i915 and use
> > a standard dev_coredump infrastructure to expose the error state.
> > 
> > For our own case, the data is only useful if it is a snapshot of the
> > time when the GPU crash has happened, since we reset the GPU immediately
> > after and the registers might have changed. So the proposal here is to
> > have an internal snapshot to be printed out later.
> > 
> > Also, usually a subsequent GPU hang can be only a cause of the initial
> > one. So we only save the 'first' hang. The dev_coredump has a delayed
> > work queue where it remove the coredump and free all the data withing a
> > few moments of the error. When that happens we also reset our capture
> > state and allow further snapshots.
> > 
> > Right now this infra only print out the time of the hang. More information
> > will be migrated here on subsequent work. Also, in order to organize the
> > dump better, the goal is to propose dev_coredump changes itself to allow
> > multiple files and different controls. But for now we start Xe usage of
> > it without any dependency on dev_coredump core changes.
> > 
> > Cc: Daniel Vetter 
> > Signed-off-by: Rodrigo Vivi 
> > ---
> >   drivers/gpu/drm/xe/Kconfig|   1 +
> >   drivers/gpu/drm/xe/Makefile   |   1 +
> >   drivers/gpu/drm/xe/xe_devcoredump.c   | 144 ++
> >   drivers/gpu/drm/xe/xe_devcoredump.h   |  22 
> >   drivers/gpu/drm/xe/xe_devcoredump_types.h |  47 +++
> >   drivers/gpu/drm/xe/xe_device_types.h  |   4 +
> >   drivers/gpu/drm/xe/xe_guc_submit.c|   2 +
> >   drivers/gpu/drm/xe/xe_pci.c   |   2 +
> >   8 files changed, 223 insertions(+)
> >   create mode 100644 drivers/gpu/drm/xe/xe_devcoredump.c
> >   create mode 100644 drivers/gpu/drm/xe/xe_devcoredump.h
> >   create mode 100644 drivers/gpu/drm/xe/xe_devcoredump_types.h
> > 
> > diff --git a/drivers/gpu/drm/xe/Kconfig b/drivers/gpu/drm/xe/Kconfig
> > index f6f3b491d162..d44794f99338 100644
> > --- a/drivers/gpu/drm/xe/Kconfig
> > +++ b/drivers/gpu/drm/xe/Kconfig
> > @@ -35,6 +35,7 @@ config DRM_XE
> > select DRM_TTM_HELPER
> > select DRM_SCHED
> > select MMU_NOTIFIER
> > +   select WANT_DEV_COREDUMP
> > help
> >   Experimental driver for Intel Xe series GPUs
> > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> > index ee4a95beec20..9d675f7c77aa 100644
> > --- a/drivers/gpu/drm/xe/Makefile
> > +++ b/drivers/gpu/drm/xe/Makefile
> > @@ -34,6 +34,7 @@ xe-y += xe_bb.o \
> > xe_bo.o \
> > xe_bo_evict.o \
> > xe_debugfs.o \
> > +   xe_devcoredump.o \
> > xe_device.o \
> > xe_dma_buf.o \
> > xe_engine.o \
> > diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c 
> > b/drivers/gpu/drm/xe/xe_devcoredump.c
> > new file mode 100644
> > index ..d9531183f03a
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_devcoredump.c
> > @@ -0,0 +1,144 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2023 Intel Corporation
> > + */
> > +
> > +#include "xe_devcoredump.h"
> > +#include "xe_devcoredump_types.h"
> > +
> > +#include 
> > +#include 
> > +
> > +#include "xe_engine.h"
> > +#include "xe_gt.h"
> > +
> > +/**
> > + * DOC: Xe device coredump
> > + *
> > + * Devices overview:
> > + * Xe uses dev_coredump infrastructure for exposing the crash errors in a
> > + * standardized way.
> > + * devcoredump exposes a temporary device under /sys/class/devcoredump/
> > + * which is linked with our card device directly.
> > + * The core dump can be accessed either from
> > + * /sys/class/drm/card/device/devcoredump/ or from
> > + * /sys/class/devcoredump/devcd where
> > + * /sys/class/devcoredump/devcd/failing_device is a link to
> > + * /sys/class/drm/card/device/.
> > + *
> > + * Snapshot at hang:
> > + * The 'data' file is printed with a drm_printer pointer at devcoredump 
> > read
> > + * time. For this reason, we need to take snapshots from when the hang has
> > + * happened, and not only when the user is reading the file. Otherwise the
> > + * information is outdated since the resets might have happened in between.
> > + *
> > + * 'First' failure snapshot:
> > + * In general, the first hang is the most critical one since the following 
> > hangs
> > + * can be a consequence of the initial hang. For this reason we only take 
> > the
> > + * snapshot of the 'first' failure and ignore subsequent calls of this 
> > function,
> > + * at least while the coredump 

Re: [Intel-xe] [PATCH 02/14] drm/xe: Introduce the dev_coredump infrastructure.

2023-05-02 Thread Jani Nikula
On Wed, 26 Apr 2023, Rodrigo Vivi  wrote:
> + drm_info(>drm, "Check your 
> /sys/class/drm/card/device/devcoredump/data\n");

Drive-by comment, could use %d and xe->drm.primary->index instead of
.

BR,
Jani.


-- 
Jani Nikula, Intel Open Source Graphics Center


Re: [Intel-xe] [PATCH 02/14] drm/xe: Introduce the dev_coredump infrastructure.

2023-04-27 Thread Thomas Hellström



On 4/26/23 22:57, Rodrigo Vivi wrote:

The goal is to use devcoredump infrastructure to report error states
captured at the crash time.

The error state will contain useful information for GPU hang debug, such
as INSTDONE registers and the current buffers getting executed, as well
as any other information that helps user space and allow later replays of
the error.

The proposal here is to avoid a Xe only error_state like i915 and use
a standard dev_coredump infrastructure to expose the error state.

For our own case, the data is only useful if it is a snapshot of the
time when the GPU crash has happened, since we reset the GPU immediately
after and the registers might have changed. So the proposal here is to
have an internal snapshot to be printed out later.

Also, usually a subsequent GPU hang can be only a cause of the initial
one. So we only save the 'first' hang. The dev_coredump has a delayed
work queue where it remove the coredump and free all the data withing a
few moments of the error. When that happens we also reset our capture
state and allow further snapshots.

Right now this infra only print out the time of the hang. More information
will be migrated here on subsequent work. Also, in order to organize the
dump better, the goal is to propose dev_coredump changes itself to allow
multiple files and different controls. But for now we start Xe usage of
it without any dependency on dev_coredump core changes.

Cc: Daniel Vetter 
Signed-off-by: Rodrigo Vivi 
---
  drivers/gpu/drm/xe/Kconfig|   1 +
  drivers/gpu/drm/xe/Makefile   |   1 +
  drivers/gpu/drm/xe/xe_devcoredump.c   | 144 ++
  drivers/gpu/drm/xe/xe_devcoredump.h   |  22 
  drivers/gpu/drm/xe/xe_devcoredump_types.h |  47 +++
  drivers/gpu/drm/xe/xe_device_types.h  |   4 +
  drivers/gpu/drm/xe/xe_guc_submit.c|   2 +
  drivers/gpu/drm/xe/xe_pci.c   |   2 +
  8 files changed, 223 insertions(+)
  create mode 100644 drivers/gpu/drm/xe/xe_devcoredump.c
  create mode 100644 drivers/gpu/drm/xe/xe_devcoredump.h
  create mode 100644 drivers/gpu/drm/xe/xe_devcoredump_types.h

diff --git a/drivers/gpu/drm/xe/Kconfig b/drivers/gpu/drm/xe/Kconfig
index f6f3b491d162..d44794f99338 100644
--- a/drivers/gpu/drm/xe/Kconfig
+++ b/drivers/gpu/drm/xe/Kconfig
@@ -35,6 +35,7 @@ config DRM_XE
select DRM_TTM_HELPER
select DRM_SCHED
select MMU_NOTIFIER
+   select WANT_DEV_COREDUMP
help
  Experimental driver for Intel Xe series GPUs
  
diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile

index ee4a95beec20..9d675f7c77aa 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -34,6 +34,7 @@ xe-y += xe_bb.o \
xe_bo.o \
xe_bo_evict.o \
xe_debugfs.o \
+   xe_devcoredump.o \
xe_device.o \
xe_dma_buf.o \
xe_engine.o \
diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c 
b/drivers/gpu/drm/xe/xe_devcoredump.c
new file mode 100644
index ..d9531183f03a
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_devcoredump.c
@@ -0,0 +1,144 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#include "xe_devcoredump.h"
+#include "xe_devcoredump_types.h"
+
+#include 
+#include 
+
+#include "xe_engine.h"
+#include "xe_gt.h"
+
+/**
+ * DOC: Xe device coredump
+ *
+ * Devices overview:
+ * Xe uses dev_coredump infrastructure for exposing the crash errors in a
+ * standardized way.
+ * devcoredump exposes a temporary device under /sys/class/devcoredump/
+ * which is linked with our card device directly.
+ * The core dump can be accessed either from
+ * /sys/class/drm/card/device/devcoredump/ or from
+ * /sys/class/devcoredump/devcd where
+ * /sys/class/devcoredump/devcd/failing_device is a link to
+ * /sys/class/drm/card/device/.
+ *
+ * Snapshot at hang:
+ * The 'data' file is printed with a drm_printer pointer at devcoredump read
+ * time. For this reason, we need to take snapshots from when the hang has
+ * happened, and not only when the user is reading the file. Otherwise the
+ * information is outdated since the resets might have happened in between.
+ *
+ * 'First' failure snapshot:
+ * In general, the first hang is the most critical one since the following 
hangs
+ * can be a consequence of the initial hang. For this reason we only take the
+ * snapshot of the 'first' failure and ignore subsequent calls of this 
function,
+ * at least while the coredump device is alive. Dev_coredump has a delayed work
+ * queue that will eventually delete the device and free all the dump
+ * information. At this time we also clear the faulty_engine and allow the next
+ * hang capture.
+ */
+
+static ssize_t xe_devcoredump_read(char *buffer, loff_t offset,
+  size_t count, void *data, size_t datalen)
+{
+   struct xe_devcoredump *coredump = data;
+   struct xe_devcoredump_snapshot *ss;
+   struct