Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

Sharma, Shashank Thu, 10 Mar 2022 12:17:32 -0800



On 3/10/2022 8:56 PM, Rob Clark wrote:

On Thu, Mar 10, 2022 at 11:44 AM Sharma, Shashank
<shashank.sha...@amd.com> wrote:




On 3/10/2022 8:35 PM, Rob Clark wrote:

On Thu, Mar 10, 2022 at 11:14 AM Sharma, Shashank
<shashank.sha...@amd.com> wrote:




On 3/10/2022 7:33 PM, Abhinav Kumar wrote:



On 3/10/2022 9:40 AM, Rob Clark wrote:

On Thu, Mar 10, 2022 at 9:19 AM Sharma, Shashank
<shashank.sha...@amd.com> wrote:




On 3/10/2022 6:10 PM, Rob Clark wrote:

On Thu, Mar 10, 2022 at 8:21 AM Sharma, Shashank
<shashank.sha...@amd.com> wrote:




On 3/10/2022 4:24 PM, Rob Clark wrote:

On Thu, Mar 10, 2022 at 1:55 AM Christian König
<ckoenig.leichtzumer...@gmail.com> wrote:




Am 09.03.22 um 19:12 schrieb Rob Clark:

On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
<contactshashanksha...@gmail.com> wrote:

From: Shashank Sharma <shashank.sha...@amd.com>

This patch adds a new sysfs event, which will indicate
the userland about a GPU reset, and can also provide
some information like:
- process ID of the process involved with the GPU reset
- process name of the involved process
- the GPU status info (using flags)

This patch also introduces the first flag of the flags
bitmap, which can be appended as and when required.

Why invent something new, rather than using the already existing
devcoredump?


Yeah, that's a really valid question.

I don't think we need (or should encourage/allow) something drm
specific when there is already an existing solution used by both
drm
and non-drm drivers.  Userspace should not have to learn to support
yet another mechanism to do the same thing.


Question is how is userspace notified about new available core
dumps?


I haven't looked into it too closely, as the CrOS userspace
crash-reporter already had support for devcoredump, so it "just
worked" out of the box[1].  I believe a udev event is what triggers
the crash-reporter to go read the devcore dump out of sysfs.


I had a quick look at the devcoredump code, and it doesn't look like
that is sending an event to the user, so we still need an event to
indicate a GPU reset.


There definitely is an event to userspace, I suspect somewhere down
the device_add() path?


Let me check that out as well, hope that is not due to a driver-private
event for GPU reset, coz I think I have seen some of those in a few DRM
drivers.


Definitely no driver private event for drm/msm .. I haven't dug
through it all but this is the collector for devcoredump, triggered
somehow via udev.  Most likely from event triggered by device_add()

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fchromium.googlesource.com%2Fchromiumos%2Fplatform2%2F%2B%2FHEAD%2Fcrash-reporter%2Fudev_collector.cc&amp;data=04%7C01%7Cshashank.sharma%40amd.com%7C3b5c0e8744234962061d08da02d00248%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637825389694363926%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=mWI1Z%2B8eMJApePc5ajRipbGUG9Cw9wXf2FCw6NQxVaM%3D&amp;reserved=0


Yes, that is correct. the uevent for devcoredump is from device_add()

Yes, I could confirm in the code that device_add() sends a uevent.

kobject_uevent(&dev->kobj, KOBJ_ADD);

I was trying to map the ChromiumOs's udev event rules with the event
being sent from device_add(), what I could see is there is only one udev
rule for any DRM subsystem events in ChromiumOs's 99-crash-reporter.rules:

ACTION=="change", SUBSYSTEM=="drm", KERNEL=="card0", ENV{ERROR}=="1", \
     RUN+="/sbin/crash_reporter
--udev=KERNEL=card0:SUBSYSTEM=drm:ACTION=change"

Can someone confirm that this is the rule which gets triggered when a
devcoredump is generated ? I could not find an ERROR=1 string in the
env[] while sending this event from dev_add();


I think it is actually this rule:

ACTION=="add", SUBSYSTEM=="devcoredump", \
    RUN+="/sbin/crash_reporter
--udev=SUBSYSTEM=devcoredump:ACTION=add:KERNEL_NUMBER=%n"

It is something non-drm specific because it supports devcore dumps
from non drm drivers.  I know at least some of the wifi and remoteproc
drivers use it.


Ah, this seems like a problem for me. I understand it will work for a
reset/recovery app well, but if a DRM client (like a compositor), who
wants to listen only to DRM events (like a GPU reset), wouldn't this
create a lot of noise for it ? Like every time any subsystem produces
this coredump, there will be a change in devcoresump subsystem, and the
client will have to parse the core file, and then will have to decide if
it wants to react to this, or ignore.

Wouldn't a GPU reset event, specific to DRM subsystem server better in
such case ?


So, I suppose there are two different use-cases.. for something like
distro which has generic crash telemetry (ie. when users opt in to
automated crash reporting), and in general for debugging gpu crashes,
you want devcoredump, preferably with plenty of information about gpu
state, etc, so you actually have a chance of debugging problems you
can't necessarily reproduce locally.  Note also that mesa CI has some
limited support for collecting devcore dumps if a CI run triggers a
GPU fault.

For something like just notifying a compositor that a gpu crash
happened, perhaps drm_event is more suitable.  See
virtio_gpu_fence_event_create() for an example of adding new event
types.  Although maybe you want it to be an event which is not device
specific.  This isn't so much of a debugging use-case as simply
notification.

Exactly, in fact that should have been a part of my cover-letter. May beI will send a V2 with better doc.


- Shashank


BR,
-R

Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

Reply via email to