Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-30 Thread Daniel Vetter
On Tue, Mar 29, 2022 at 12:25:55PM -0400, Marek Olšák wrote:
> I don't know what iris does, but I would guess that the same problems as
> with AMD GPUs apply, making GPUs resets very fragile.

iris_batch_check_for_reset -> replace_kernel_ctx -> iris_lost_context_state

is I think the main call chain of how this is handled/detected. There's
also a side-chain which handles -EIO from execbuf.

Also this is using non-recoverable contexts, i.e. any time they suffer
from a gpu reset (either because guilty themselves, or collateral damage
of a reset that shot more than just the guilty context) the context stops
entirely and refuses any further execbuf with -EIO.

Cheers, Daniel

> 
> Marek
> 
> On Tue., Mar. 29, 2022, 08:14 Christian König, 
> wrote:
> 
> > My main question is what does the iris driver better than radeonsi when
> > the client doesn't support the robustness extension?
> >
> > From Daniels description it sounds like they have at least a partial
> > recovery mechanism in place.
> >
> > Apart from that I completely agree to what you said below.
> >
> > Christian.
> >
> > Am 26.03.22 um 01:53 schrieb Olsak, Marek:
> >
> > [AMD Official Use Only]
> >
> > amdgpu has 2 resets: soft reset and hard reset.
> >
> > The soft reset is able to recover from an infinite loop and even some GPU
> > hangs due to bad shaders or bad states. The soft reset uses a signal that
> > kills all currently-running shaders of a certain process (VM context),
> > which unblocks the graphics pipeline, so draws and command buffers finish
> > but are not correctly. This can then cause a hard hang if the shader was
> > supposed to signal work completion through a shader store instruction and a
> > non-shader consumer is waiting for it (skipping the store instruction by
> > killing the shader won't signal the work, and thus the consumer will be
> > stuck, requiring a hard reset).
> >
> > The hard reset can recover from other hangs, which is great, but it may
> > use a PCI reset, which erases VRAM on dGPUs. APUs don't lose memory
> > contents, but we should assume that any process that had running jobs on
> > the GPU during a GPU reset has its memory resources in an inconsistent
> > state, and thus following command buffers can cause another GPU hang. The
> > shader store example above is enough to cause another hard hang due to
> > incorrect content in memory resources, which can contain synchronization
> > primitives that are used internally by the hardware.
> >
> > Asking the driver to replay a command buffer that caused a hang is a sure
> > way to hang it again. Unrelated processes can be affected due to lost VRAM
> > or the misfortune of using the GPU while the GPU hang occurred. The window
> > system should recreate GPU resources and redraw everything without
> > affecting applications. If apps use GL, they should do the same. Processes
> > that can't recover by redrawing content can be terminated or left alone,
> > but they shouldn't be allowed to submit work to the GPU anymore.
> >
> > dEQP only exercises the soft reset. I think WebGL is only able to trigger
> > a soft reset at this point, but Vulkan can also trigger a hard reset.
> >
> > Marek
> > --
> > *From:* Koenig, Christian 
> > 
> > *Sent:* March 23, 2022 11:25
> > *To:* Daniel Vetter  ; Daniel Stone
> >  ; Olsak, Marek
> >  ; Grodzovsky, Andrey
> >  
> > *Cc:* Rob Clark  ; Rob Clark
> >  ; Sharma, Shashank
> >  ; Christian König
> >  ;
> > Somalapuram, Amaranath 
> > ; Abhinav Kumar 
> > ; dri-devel 
> > ; amd-gfx list
> >  ; Deucher,
> > Alexander  ;
> > Shashank Sharma 
> > 
> > *Subject:* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
> >
> > [Adding Marek and Andrey as well]
> >
> > Am 23.03.22 um 16:14 schrieb Daniel Vetter:
> > > On Wed, 23 Mar 2022 at 15:07, Daniel Stone 
> >  wrote:
> > >> Hi,
> > >>
> > >> On Mon, 21 Mar 2022 at 16:02, Rob Clark 
> >  wrote:
> > >>> On Mon, Mar 21, 2022 at 2:30 AM Christian König
> > >>>   wrote:
> > >>>> Well you can, it just means that their contexts are lost as well.
> > >>> Which is rather inconvenient when deqp-egl reset tests, for example,
> > >>> take down your compositor ;-)
> > >> Yeah. Or anything WebGL.
> > >>
> > >> System-wide collateral damage is definitely a non-starter. If that
> > >> means that the userspace driver has to do what iris does and en

Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-29 Thread Marek Olšák
I don't know what iris does, but I would guess that the same problems as
with AMD GPUs apply, making GPUs resets very fragile.

Marek

On Tue., Mar. 29, 2022, 08:14 Christian König, 
wrote:

> My main question is what does the iris driver better than radeonsi when
> the client doesn't support the robustness extension?
>
> From Daniels description it sounds like they have at least a partial
> recovery mechanism in place.
>
> Apart from that I completely agree to what you said below.
>
> Christian.
>
> Am 26.03.22 um 01:53 schrieb Olsak, Marek:
>
> [AMD Official Use Only]
>
> amdgpu has 2 resets: soft reset and hard reset.
>
> The soft reset is able to recover from an infinite loop and even some GPU
> hangs due to bad shaders or bad states. The soft reset uses a signal that
> kills all currently-running shaders of a certain process (VM context),
> which unblocks the graphics pipeline, so draws and command buffers finish
> but are not correctly. This can then cause a hard hang if the shader was
> supposed to signal work completion through a shader store instruction and a
> non-shader consumer is waiting for it (skipping the store instruction by
> killing the shader won't signal the work, and thus the consumer will be
> stuck, requiring a hard reset).
>
> The hard reset can recover from other hangs, which is great, but it may
> use a PCI reset, which erases VRAM on dGPUs. APUs don't lose memory
> contents, but we should assume that any process that had running jobs on
> the GPU during a GPU reset has its memory resources in an inconsistent
> state, and thus following command buffers can cause another GPU hang. The
> shader store example above is enough to cause another hard hang due to
> incorrect content in memory resources, which can contain synchronization
> primitives that are used internally by the hardware.
>
> Asking the driver to replay a command buffer that caused a hang is a sure
> way to hang it again. Unrelated processes can be affected due to lost VRAM
> or the misfortune of using the GPU while the GPU hang occurred. The window
> system should recreate GPU resources and redraw everything without
> affecting applications. If apps use GL, they should do the same. Processes
> that can't recover by redrawing content can be terminated or left alone,
> but they shouldn't be allowed to submit work to the GPU anymore.
>
> dEQP only exercises the soft reset. I think WebGL is only able to trigger
> a soft reset at this point, but Vulkan can also trigger a hard reset.
>
> Marek
> --
> *From:* Koenig, Christian 
> 
> *Sent:* March 23, 2022 11:25
> *To:* Daniel Vetter  ; Daniel Stone
>  ; Olsak, Marek
>  ; Grodzovsky, Andrey
>  
> *Cc:* Rob Clark  ; Rob Clark
>  ; Sharma, Shashank
>  ; Christian König
>  ;
> Somalapuram, Amaranath 
> ; Abhinav Kumar 
> ; dri-devel 
> ; amd-gfx list
>  ; Deucher,
> Alexander  ;
> Shashank Sharma 
> 
> *Subject:* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
>
> [Adding Marek and Andrey as well]
>
> Am 23.03.22 um 16:14 schrieb Daniel Vetter:
> > On Wed, 23 Mar 2022 at 15:07, Daniel Stone 
>  wrote:
> >> Hi,
> >>
> >> On Mon, 21 Mar 2022 at 16:02, Rob Clark 
>  wrote:
> >>> On Mon, Mar 21, 2022 at 2:30 AM Christian König
> >>>   wrote:
> >>>> Well you can, it just means that their contexts are lost as well.
> >>> Which is rather inconvenient when deqp-egl reset tests, for example,
> >>> take down your compositor ;-)
> >> Yeah. Or anything WebGL.
> >>
> >> System-wide collateral damage is definitely a non-starter. If that
> >> means that the userspace driver has to do what iris does and ensure
> >> everything's recreated and resubmitted, that works too, just as long
> >> as the response to 'my adblocker didn't detect a crypto miner ad'  is
> >> something better than 'shoot the entire user session'.
> > Not sure where that idea came from, I thought at least I made it clear
> > that legacy gl _has_ to recover. It's only vk and arb_robustness gl
> > which should die without recovery attempt.
> >
> > The entire discussion here is who should be responsible for replay and
> > at least if you can decide the uapi, then punting that entirely to
> > userspace is a good approach.
>
> Yes, completely agree. We have the approach of re-submitting things in
> the kernel and that failed quite miserable.
>
> In other words currently a GPU reset has something like a 99% chance to
> get down your whole desktop.
>
> Daniel can you briefly explain what exactly iris does when a lost
> context is detected without gl robustness?

Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-29 Thread Christian König
My main question is what does the iris driver better than radeonsi when 
the client doesn't support the robustness extension?


From Daniels description it sounds like they have at least a partial 
recovery mechanism in place.


Apart from that I completely agree to what you said below.

Christian.

Am 26.03.22 um 01:53 schrieb Olsak, Marek:


[AMD Official Use Only]


amdgpu has 2 resets: soft reset and hard reset.

The soft reset is able to recover from an infinite loop and even some 
GPU hangs due to bad shaders or bad states. The soft reset uses a 
signal that kills all currently-running shaders of a certain process 
(VM context), which unblocks the graphics pipeline, so draws and 
command buffers finish but are not correctly. This can then cause a 
hard hang if the shader was supposed to signal work completion through 
a shader store instruction and a non-shader consumer is waiting for it 
(skipping the store instruction by killing the shader won't signal the 
work, and thus the consumer will be stuck, requiring a hard reset).


The hard reset can recover from other hangs, which is great, but it 
may use a PCI reset, which erases VRAM on dGPUs. APUs don't lose 
memory contents, but we should assume that any process that had 
running jobs on the GPU during a GPU reset has its memory resources in 
an inconsistent state, and thus following command buffers can cause 
another GPU hang. The shader store example above is enough to cause 
another hard hang due to incorrect content in memory resources, which 
can contain synchronization primitives that are used internally by the 
hardware.


Asking the driver to replay a command buffer that caused a hang is a 
sure way to hang it again. Unrelated processes can be affected due to 
lost VRAM or the misfortune of using the GPU while the GPU hang 
occurred. The window system should recreate GPU resources and redraw 
everything without affecting applications. If apps use GL, they should 
do the same. Processes that can't recover by redrawing content can be 
terminated or left alone, but they shouldn't be allowed to submit work 
to the GPU anymore.


dEQP only exercises the soft reset. I think WebGL is only able to 
trigger a soft reset at this point, but Vulkan can also trigger a hard 
reset.


Marek

*From:* Koenig, Christian 
*Sent:* March 23, 2022 11:25
*To:* Daniel Vetter ; Daniel Stone 
; Olsak, Marek ; 
Grodzovsky, Andrey 
*Cc:* Rob Clark ; Rob Clark 
; Sharma, Shashank ; 
Christian König ; Somalapuram, 
Amaranath ; Abhinav Kumar 
; dri-devel 
; amd-gfx list 
; Deucher, Alexander 
; Shashank Sharma 


*Subject:* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
[Adding Marek and Andrey as well]

Am 23.03.22 um 16:14 schrieb Daniel Vetter:
> On Wed, 23 Mar 2022 at 15:07, Daniel Stone  wrote:
>> Hi,
>>
>> On Mon, 21 Mar 2022 at 16:02, Rob Clark  wrote:
>>> On Mon, Mar 21, 2022 at 2:30 AM Christian König
>>>  wrote:
>>>> Well you can, it just means that their contexts are lost as well.
>>> Which is rather inconvenient when deqp-egl reset tests, for example,
>>> take down your compositor ;-)
>> Yeah. Or anything WebGL.
>>
>> System-wide collateral damage is definitely a non-starter. If that
>> means that the userspace driver has to do what iris does and ensure
>> everything's recreated and resubmitted, that works too, just as long
>> as the response to 'my adblocker didn't detect a crypto miner ad'  is
>> something better than 'shoot the entire user session'.
> Not sure where that idea came from, I thought at least I made it clear
> that legacy gl _has_ to recover. It's only vk and arb_robustness gl
> which should die without recovery attempt.
>
> The entire discussion here is who should be responsible for replay and
> at least if you can decide the uapi, then punting that entirely to
> userspace is a good approach.

Yes, completely agree. We have the approach of re-submitting things in
the kernel and that failed quite miserable.

In other words currently a GPU reset has something like a 99% chance to
get down your whole desktop.

Daniel can you briefly explain what exactly iris does when a lost
context is detected without gl robustness?

It sounds like you guys got that working quite well.

Thanks,
Christian.

>
> Ofc it'd be nice if the collateral damage is limited, i.e. requests
> not currently on the gpu, or on different engines and all that
> shouldn't be nuked, if possible.
>
> Also ofc since msm uapi is that the kernel tries to recover there's
> not much we can do there, contexts cannot be shot. But still trying to
> replay them as much as possible feels a bit like overkill.
> -Daniel
>
>> Cheers,
>> Daniel
>
>



Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-25 Thread Olsak, Marek
[AMD Official Use Only]

amdgpu has 2 resets: soft reset and hard reset.

The soft reset is able to recover from an infinite loop and even some GPU hangs 
due to bad shaders or bad states. The soft reset uses a signal that kills all 
currently-running shaders of a certain process (VM context), which unblocks the 
graphics pipeline, so draws and command buffers finish but are not correctly. 
This can then cause a hard hang if the shader was supposed to signal work 
completion through a shader store instruction and a non-shader consumer is 
waiting for it (skipping the store instruction by killing the shader won't 
signal the work, and thus the consumer will be stuck, requiring a hard reset).

The hard reset can recover from other hangs, which is great, but it may use a 
PCI reset, which erases VRAM on dGPUs. APUs don't lose memory contents, but we 
should assume that any process that had running jobs on the GPU during a GPU 
reset has its memory resources in an inconsistent state, and thus following 
command buffers can cause another GPU hang. The shader store example above is 
enough to cause another hard hang due to incorrect content in memory resources, 
which can contain synchronization primitives that are used internally by the 
hardware.

Asking the driver to replay a command buffer that caused a hang is a sure way 
to hang it again. Unrelated processes can be affected due to lost VRAM or the 
misfortune of using the GPU while the GPU hang occurred. The window system 
should recreate GPU resources and redraw everything without affecting 
applications. If apps use GL, they should do the same. Processes that can't 
recover by redrawing content can be terminated or left alone, but they 
shouldn't be allowed to submit work to the GPU anymore.

dEQP only exercises the soft reset. I think WebGL is only able to trigger a 
soft reset at this point, but Vulkan can also trigger a hard reset.

Marek

From: Koenig, Christian 
Sent: March 23, 2022 11:25
To: Daniel Vetter ; Daniel Stone ; 
Olsak, Marek ; Grodzovsky, Andrey 

Cc: Rob Clark ; Rob Clark ; 
Sharma, Shashank ; Christian König 
; Somalapuram, Amaranath 
; Abhinav Kumar ; 
dri-devel ; amd-gfx list 
; Deucher, Alexander 
; Shashank Sharma 
Subject: Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

[Adding Marek and Andrey as well]

Am 23.03.22 um 16:14 schrieb Daniel Vetter:
> On Wed, 23 Mar 2022 at 15:07, Daniel Stone  wrote:
>> Hi,
>>
>> On Mon, 21 Mar 2022 at 16:02, Rob Clark  wrote:
>>> On Mon, Mar 21, 2022 at 2:30 AM Christian König
>>>  wrote:
>>>> Well you can, it just means that their contexts are lost as well.
>>> Which is rather inconvenient when deqp-egl reset tests, for example,
>>> take down your compositor ;-)
>> Yeah. Or anything WebGL.
>>
>> System-wide collateral damage is definitely a non-starter. If that
>> means that the userspace driver has to do what iris does and ensure
>> everything's recreated and resubmitted, that works too, just as long
>> as the response to 'my adblocker didn't detect a crypto miner ad'  is
>> something better than 'shoot the entire user session'.
> Not sure where that idea came from, I thought at least I made it clear
> that legacy gl _has_ to recover. It's only vk and arb_robustness gl
> which should die without recovery attempt.
>
> The entire discussion here is who should be responsible for replay and
> at least if you can decide the uapi, then punting that entirely to
> userspace is a good approach.

Yes, completely agree. We have the approach of re-submitting things in
the kernel and that failed quite miserable.

In other words currently a GPU reset has something like a 99% chance to
get down your whole desktop.

Daniel can you briefly explain what exactly iris does when a lost
context is detected without gl robustness?

It sounds like you guys got that working quite well.

Thanks,
Christian.

>
> Ofc it'd be nice if the collateral damage is limited, i.e. requests
> not currently on the gpu, or on different engines and all that
> shouldn't be nuked, if possible.
>
> Also ofc since msm uapi is that the kernel tries to recover there's
> not much we can do there, contexts cannot be shot. But still trying to
> replay them as much as possible feels a bit like overkill.
> -Daniel
>
>> Cheers,
>> Daniel
>
>



Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-23 Thread Rob Clark
On Wed, Mar 23, 2022 at 8:14 AM Daniel Vetter  wrote:
>
> On Wed, 23 Mar 2022 at 15:07, Daniel Stone  wrote:
> >
> > Hi,
> >
> > On Mon, 21 Mar 2022 at 16:02, Rob Clark  wrote:
> > > On Mon, Mar 21, 2022 at 2:30 AM Christian König
> > >  wrote:
> > > > Well you can, it just means that their contexts are lost as well.
> > >
> > > Which is rather inconvenient when deqp-egl reset tests, for example,
> > > take down your compositor ;-)
> >
> > Yeah. Or anything WebGL.
> >
> > System-wide collateral damage is definitely a non-starter. If that
> > means that the userspace driver has to do what iris does and ensure
> > everything's recreated and resubmitted, that works too, just as long
> > as the response to 'my adblocker didn't detect a crypto miner ad'  is
> > something better than 'shoot the entire user session'.
>
> Not sure where that idea came from, I thought at least I made it clear
> that legacy gl _has_ to recover. It's only vk and arb_robustness gl
> which should die without recovery attempt.
>
> The entire discussion here is who should be responsible for replay and
> at least if you can decide the uapi, then punting that entirely to
> userspace is a good approach.
>
> Ofc it'd be nice if the collateral damage is limited, i.e. requests
> not currently on the gpu, or on different engines and all that
> shouldn't be nuked, if possible.
>
> Also ofc since msm uapi is that the kernel tries to recover there's
> not much we can do there, contexts cannot be shot. But still trying to
> replay them as much as possible feels a bit like overkill.

It would perhaps be nice if older gens which don't (yet) have
per-process pgtables to have gone with the userspace-replays (although
that would require a lot more tracking in userspace than what is done
currently).. but fortunately those older gens don't use "state
objects" which could potentially be corrupted, but instead re-emit
state in cmdstream, so there is a lot less possibility for bad
collateral damage.  (On all the gens we also use gpu read-only buffers
whenever the gpu does not need to be able to write them.)

For newer stuff, the process isolation works pretty well.  In fact we
recently changed MSM_PARAM_FAULTS to only report faults/hangs in the
same address space, so the compositor is not even aware (and doesn't
need to be aware).

BR,
-R

> -Daniel
>
> > Cheers,
> > Daniel
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-23 Thread Christian König

[Adding Marek and Andrey as well]

Am 23.03.22 um 16:14 schrieb Daniel Vetter:

On Wed, 23 Mar 2022 at 15:07, Daniel Stone  wrote:

Hi,

On Mon, 21 Mar 2022 at 16:02, Rob Clark  wrote:

On Mon, Mar 21, 2022 at 2:30 AM Christian König
 wrote:

Well you can, it just means that their contexts are lost as well.

Which is rather inconvenient when deqp-egl reset tests, for example,
take down your compositor ;-)

Yeah. Or anything WebGL.

System-wide collateral damage is definitely a non-starter. If that
means that the userspace driver has to do what iris does and ensure
everything's recreated and resubmitted, that works too, just as long
as the response to 'my adblocker didn't detect a crypto miner ad'  is
something better than 'shoot the entire user session'.

Not sure where that idea came from, I thought at least I made it clear
that legacy gl _has_ to recover. It's only vk and arb_robustness gl
which should die without recovery attempt.

The entire discussion here is who should be responsible for replay and
at least if you can decide the uapi, then punting that entirely to
userspace is a good approach.


Yes, completely agree. We have the approach of re-submitting things in 
the kernel and that failed quite miserable.


In other words currently a GPU reset has something like a 99% chance to 
get down your whole desktop.


Daniel can you briefly explain what exactly iris does when a lost 
context is detected without gl robustness?


It sounds like you guys got that working quite well.

Thanks,
Christian.



Ofc it'd be nice if the collateral damage is limited, i.e. requests
not currently on the gpu, or on different engines and all that
shouldn't be nuked, if possible.

Also ofc since msm uapi is that the kernel tries to recover there's
not much we can do there, contexts cannot be shot. But still trying to
replay them as much as possible feels a bit like overkill.
-Daniel


Cheers,
Daniel







Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-23 Thread Daniel Vetter
On Wed, 23 Mar 2022 at 15:07, Daniel Stone  wrote:
>
> Hi,
>
> On Mon, 21 Mar 2022 at 16:02, Rob Clark  wrote:
> > On Mon, Mar 21, 2022 at 2:30 AM Christian König
> >  wrote:
> > > Well you can, it just means that their contexts are lost as well.
> >
> > Which is rather inconvenient when deqp-egl reset tests, for example,
> > take down your compositor ;-)
>
> Yeah. Or anything WebGL.
>
> System-wide collateral damage is definitely a non-starter. If that
> means that the userspace driver has to do what iris does and ensure
> everything's recreated and resubmitted, that works too, just as long
> as the response to 'my adblocker didn't detect a crypto miner ad'  is
> something better than 'shoot the entire user session'.

Not sure where that idea came from, I thought at least I made it clear
that legacy gl _has_ to recover. It's only vk and arb_robustness gl
which should die without recovery attempt.

The entire discussion here is who should be responsible for replay and
at least if you can decide the uapi, then punting that entirely to
userspace is a good approach.

Ofc it'd be nice if the collateral damage is limited, i.e. requests
not currently on the gpu, or on different engines and all that
shouldn't be nuked, if possible.

Also ofc since msm uapi is that the kernel tries to recover there's
not much we can do there, contexts cannot be shot. But still trying to
replay them as much as possible feels a bit like overkill.
-Daniel

> Cheers,
> Daniel



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-23 Thread Daniel Stone
Hi,

On Mon, 21 Mar 2022 at 16:02, Rob Clark  wrote:
> On Mon, Mar 21, 2022 at 2:30 AM Christian König
>  wrote:
> > Well you can, it just means that their contexts are lost as well.
>
> Which is rather inconvenient when deqp-egl reset tests, for example,
> take down your compositor ;-)

Yeah. Or anything WebGL.

System-wide collateral damage is definitely a non-starter. If that
means that the userspace driver has to do what iris does and ensure
everything's recreated and resubmitted, that works too, just as long
as the response to 'my adblocker didn't detect a crypto miner ad'  is
something better than 'shoot the entire user session'.

Cheers,
Daniel


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-21 Thread Rob Clark
On Mon, Mar 21, 2022 at 2:30 AM Christian König
 wrote:
>
> Am 18.03.22 um 16:12 schrieb Rob Clark:
> > On Fri, Mar 18, 2022 at 12:42 AM Christian König
> >  wrote:
> >> Am 17.03.22 um 18:31 schrieb Rob Clark:
> >>> On Thu, Mar 17, 2022 at 10:27 AM Daniel Vetter  wrote:
>  [SNIP]
> > (At some point, I'd like to use scheduler for the replay, and actually
> > use drm_sched_stop()/etc.. but last time I looked there were still
> > some sched bugs in that area which prevented me from deleting a bunch
> > of code ;-))
>  Not sure about your hw, but at least on intel replaying tends to just
>  result in follow-on fun. And that holds even more so the more complex a
>  workload is. This is why vk just dies immediately and does not try to
>  replay anything, offloading it to the app. Same with arb robusteness.
>  Afaik it's really only media and classic gl which insist that the driver
>  stack somehow recover.
> >>> At least for us, each submit must be self-contained (ie. not rely on
> >>> previous GPU hw state), so in practice replay works out pretty well.
> >>> The worst case is subsequent submits from same process fail as well
> >>> (if they depended on something that crashing submit failed to write
> >>> back to memory.. but in that case they just crash as well and we move
> >>> on to the next one.. the recent gens (a5xx+ at least) are pretty good
> >>> about quickly detecting problems and giving us an error irq.
> >> Well I absolutely agree with Daniel.
> >>
> >> The whole replay thing AMD did in the scheduler is an absolutely mess
> >> and should probably be killed with fire.
> >>
> >> I strongly recommend not to do the same mistake in other drivers.
> >>
> >> If you want to have some replay feature then please make it driver
> >> specific and don't use anything from the infrastructure in the DRM
> >> scheduler.
> > hmm, perhaps I was not clear, but I'm only talking about re-emitting
> > jobs *following* the faulting one (which could be from other contexts,
> > etc).. not trying to restart the faulting job.
> >
> > You *absolutely* need to replay jobs following the faulting one, they
> > could be from unrelated contexts/processes.  You can't just drop them
> > on the floor.
>
> Well you can, it just means that their contexts are lost as well.

Which is rather inconvenient when deqp-egl reset tests, for example,
take down your compositor ;-)

(Which for even more lolz, in CrOS restarts the android container or
vm.. which makes running android-cts deqp kinda funny)

> If you re-submit jobs which were already pushed to the hardware you
> absolutely need to make a couple of things sure:
>
> 1. Don't race with your hardware. E.g. you need a way to stop processing
> in case of a timeout and then double check once more if things haven't
> finished in the meantime.
>
> 2. Make absolutely sure you never re-submit an operation when it's
> dma-fence is already signaled. Otherwise you run into memory corruption.
>
> 3. When you have multiple engines it becomes really tricky because then
> even innocent jobs might have already been started on different queues
> which now hang.

We force power-off/on the GPU to reset it which is a pretty good way
to make sure we aren't racing with the GPU.

It's worked like this since pretty much the beginning, and in the
early days of bringing up mesa support for a new gen we tend to
exercise the gpu hang/recovery path quite a lot.. so it at least seems
pretty robust ;-)

BR,
-R

>
> > Currently it is all driver specific, but I wanted to delete a lot of
> > code and move to using scheduler to handle faults/timeouts (but
> > blocked on that until [1] is resolved)
>
> Please don't.
>
> Especially don't use the pending_list or any of the scheduler
> infrastructure for GPU reset. We need to get rid of that again sooner or
> later.
>
> This is extremely hardware dependent and pushing the amdgpu specific
> handling into the GPU scheduler was a mistake we shouldn't repeat for
> other drivers.
>
> Regards,
> Christian.
>
> >
> > [1] 
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.kernel.org%2Fproject%2Fdri-devel%2Fpatch%2F1630457207-13107-2-git-send-email-Monk.Liu%40amd.com%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7C1f6ddc253f9341231fa108da08f1afa9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637832131381866493%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=e%2F1tOh3nxH3QfzKQKiJKjCU7Z5S6haX07F8rzwZhRVY%3Dreserved=0
> >
> > BR,
> > -R
> >
> >> Thanks,
> >> Christian.
> >>
> >>> BR,
> >>> -R
> >>>
>  And recovering from a mess in userspace is a lot simpler than trying to
>  pull of the same magic in the kernel. Plus it also helps with a few of 
>  the
>  dma_fence rules, which is a nice bonus.
>  -Daniel
> 
>


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-21 Thread Daniel Vetter
On Fri, Mar 18, 2022 at 08:12:54AM -0700, Rob Clark wrote:
> On Fri, Mar 18, 2022 at 12:42 AM Christian König
>  wrote:
> >
> > Am 17.03.22 um 18:31 schrieb Rob Clark:
> > > On Thu, Mar 17, 2022 at 10:27 AM Daniel Vetter  wrote:
> > >> [SNIP]
> > >>> (At some point, I'd like to use scheduler for the replay, and actually
> > >>> use drm_sched_stop()/etc.. but last time I looked there were still
> > >>> some sched bugs in that area which prevented me from deleting a bunch
> > >>> of code ;-))
> > >> Not sure about your hw, but at least on intel replaying tends to just
> > >> result in follow-on fun. And that holds even more so the more complex a
> > >> workload is. This is why vk just dies immediately and does not try to
> > >> replay anything, offloading it to the app. Same with arb robusteness.
> > >> Afaik it's really only media and classic gl which insist that the driver
> > >> stack somehow recover.
> > > At least for us, each submit must be self-contained (ie. not rely on
> > > previous GPU hw state), so in practice replay works out pretty well.
> > > The worst case is subsequent submits from same process fail as well
> > > (if they depended on something that crashing submit failed to write
> > > back to memory.. but in that case they just crash as well and we move
> > > on to the next one.. the recent gens (a5xx+ at least) are pretty good
> > > about quickly detecting problems and giving us an error irq.
> >
> > Well I absolutely agree with Daniel.
> >
> > The whole replay thing AMD did in the scheduler is an absolutely mess
> > and should probably be killed with fire.
> >
> > I strongly recommend not to do the same mistake in other drivers.
> >
> > If you want to have some replay feature then please make it driver
> > specific and don't use anything from the infrastructure in the DRM
> > scheduler.
> 
> hmm, perhaps I was not clear, but I'm only talking about re-emitting
> jobs *following* the faulting one (which could be from other contexts,
> etc).. not trying to restart the faulting job.

You absolutely can drop jobs on the floor, this is what both anv and iris
expect. They use what we call non-recoverable context, meaning when any
gpu hang happens and the context is affect (whether as the guilty on, or
because it was a multi-engine reset and it was victimized) we kill it
entirely. No replaying, and any further execbuf ioctl fails with -EIO.

Userspace then gets to sort out the mess, which for vk is
VK_ERROR_DEVICE_LOST, for robust gl it's the same, and for non-robust gl
iris re-creates a pile of things.

Anything in-between _is_ dropped on the floor completely.

Also note that this is obviously uapi, if you have an userspace which
expect contexts to survive, then replaying makes some sense.

> You *absolutely* need to replay jobs following the faulting one, they
> could be from unrelated contexts/processes.  You can't just drop them
> on the floor.
> 
> Currently it is all driver specific, but I wanted to delete a lot of
> code and move to using scheduler to handle faults/timeouts (but
> blocked on that until [1] is resolved)

Yeah for the drivers where the uapi is "you can safely replay after a
hang, and you're supposed to", then sharing the code is ofc a good idea.

Just wanted to make it clear that this is only one of many uapi flavours
you can pick from, dropping it all on the floor is a perfectly legit
approach :-) And imo it's the more robust one, and also better fits with
latest apis like gl_arb_robustness or vk.

Cheers, Daniel


> 
> [1] 
> https://patchwork.kernel.org/project/dri-devel/patch/1630457207-13107-2-git-send-email-monk@amd.com/
> 
> BR,
> -R
> 
> > Thanks,
> > Christian.
> >
> > >
> > > BR,
> > > -R
> > >
> > >> And recovering from a mess in userspace is a lot simpler than trying to
> > >> pull of the same magic in the kernel. Plus it also helps with a few of 
> > >> the
> > >> dma_fence rules, which is a nice bonus.
> > >> -Daniel
> > >>
> >

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-21 Thread Christian König

Am 18.03.22 um 16:12 schrieb Rob Clark:

On Fri, Mar 18, 2022 at 12:42 AM Christian König
 wrote:

Am 17.03.22 um 18:31 schrieb Rob Clark:

On Thu, Mar 17, 2022 at 10:27 AM Daniel Vetter  wrote:

[SNIP]

(At some point, I'd like to use scheduler for the replay, and actually
use drm_sched_stop()/etc.. but last time I looked there were still
some sched bugs in that area which prevented me from deleting a bunch
of code ;-))

Not sure about your hw, but at least on intel replaying tends to just
result in follow-on fun. And that holds even more so the more complex a
workload is. This is why vk just dies immediately and does not try to
replay anything, offloading it to the app. Same with arb robusteness.
Afaik it's really only media and classic gl which insist that the driver
stack somehow recover.

At least for us, each submit must be self-contained (ie. not rely on
previous GPU hw state), so in practice replay works out pretty well.
The worst case is subsequent submits from same process fail as well
(if they depended on something that crashing submit failed to write
back to memory.. but in that case they just crash as well and we move
on to the next one.. the recent gens (a5xx+ at least) are pretty good
about quickly detecting problems and giving us an error irq.

Well I absolutely agree with Daniel.

The whole replay thing AMD did in the scheduler is an absolutely mess
and should probably be killed with fire.

I strongly recommend not to do the same mistake in other drivers.

If you want to have some replay feature then please make it driver
specific and don't use anything from the infrastructure in the DRM
scheduler.

hmm, perhaps I was not clear, but I'm only talking about re-emitting
jobs *following* the faulting one (which could be from other contexts,
etc).. not trying to restart the faulting job.

You *absolutely* need to replay jobs following the faulting one, they
could be from unrelated contexts/processes.  You can't just drop them
on the floor.


Well you can, it just means that their contexts are lost as well.

If you re-submit jobs which were already pushed to the hardware you 
absolutely need to make a couple of things sure:


1. Don't race with your hardware. E.g. you need a way to stop processing 
in case of a timeout and then double check once more if things haven't 
finished in the meantime.


2. Make absolutely sure you never re-submit an operation when it's 
dma-fence is already signaled. Otherwise you run into memory corruption.


3. When you have multiple engines it becomes really tricky because then 
even innocent jobs might have already been started on different queues 
which now hang.



Currently it is all driver specific, but I wanted to delete a lot of
code and move to using scheduler to handle faults/timeouts (but
blocked on that until [1] is resolved)


Please don't.

Especially don't use the pending_list or any of the scheduler 
infrastructure for GPU reset. We need to get rid of that again sooner or 
later.


This is extremely hardware dependent and pushing the amdgpu specific 
handling into the GPU scheduler was a mistake we shouldn't repeat for 
other drivers.


Regards,
Christian.



[1] 
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.kernel.org%2Fproject%2Fdri-devel%2Fpatch%2F1630457207-13107-2-git-send-email-Monk.Liu%40amd.com%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7C1f6ddc253f9341231fa108da08f1afa9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637832131381866493%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=e%2F1tOh3nxH3QfzKQKiJKjCU7Z5S6haX07F8rzwZhRVY%3Dreserved=0

BR,
-R


Thanks,
Christian.


BR,
-R


And recovering from a mess in userspace is a lot simpler than trying to
pull of the same magic in the kernel. Plus it also helps with a few of the
dma_fence rules, which is a nice bonus.
-Daniel





Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-18 Thread Rob Clark
On Fri, Mar 18, 2022 at 12:42 AM Christian König
 wrote:
>
> Am 17.03.22 um 18:31 schrieb Rob Clark:
> > On Thu, Mar 17, 2022 at 10:27 AM Daniel Vetter  wrote:
> >> [SNIP]
> >>> (At some point, I'd like to use scheduler for the replay, and actually
> >>> use drm_sched_stop()/etc.. but last time I looked there were still
> >>> some sched bugs in that area which prevented me from deleting a bunch
> >>> of code ;-))
> >> Not sure about your hw, but at least on intel replaying tends to just
> >> result in follow-on fun. And that holds even more so the more complex a
> >> workload is. This is why vk just dies immediately and does not try to
> >> replay anything, offloading it to the app. Same with arb robusteness.
> >> Afaik it's really only media and classic gl which insist that the driver
> >> stack somehow recover.
> > At least for us, each submit must be self-contained (ie. not rely on
> > previous GPU hw state), so in practice replay works out pretty well.
> > The worst case is subsequent submits from same process fail as well
> > (if they depended on something that crashing submit failed to write
> > back to memory.. but in that case they just crash as well and we move
> > on to the next one.. the recent gens (a5xx+ at least) are pretty good
> > about quickly detecting problems and giving us an error irq.
>
> Well I absolutely agree with Daniel.
>
> The whole replay thing AMD did in the scheduler is an absolutely mess
> and should probably be killed with fire.
>
> I strongly recommend not to do the same mistake in other drivers.
>
> If you want to have some replay feature then please make it driver
> specific and don't use anything from the infrastructure in the DRM
> scheduler.

hmm, perhaps I was not clear, but I'm only talking about re-emitting
jobs *following* the faulting one (which could be from other contexts,
etc).. not trying to restart the faulting job.

You *absolutely* need to replay jobs following the faulting one, they
could be from unrelated contexts/processes.  You can't just drop them
on the floor.

Currently it is all driver specific, but I wanted to delete a lot of
code and move to using scheduler to handle faults/timeouts (but
blocked on that until [1] is resolved)

[1] 
https://patchwork.kernel.org/project/dri-devel/patch/1630457207-13107-2-git-send-email-monk@amd.com/

BR,
-R

> Thanks,
> Christian.
>
> >
> > BR,
> > -R
> >
> >> And recovering from a mess in userspace is a lot simpler than trying to
> >> pull of the same magic in the kernel. Plus it also helps with a few of the
> >> dma_fence rules, which is a nice bonus.
> >> -Daniel
> >>
>


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-18 Thread Christian König

Am 17.03.22 um 18:31 schrieb Rob Clark:

On Thu, Mar 17, 2022 at 10:27 AM Daniel Vetter  wrote:

[SNIP]

(At some point, I'd like to use scheduler for the replay, and actually
use drm_sched_stop()/etc.. but last time I looked there were still
some sched bugs in that area which prevented me from deleting a bunch
of code ;-))

Not sure about your hw, but at least on intel replaying tends to just
result in follow-on fun. And that holds even more so the more complex a
workload is. This is why vk just dies immediately and does not try to
replay anything, offloading it to the app. Same with arb robusteness.
Afaik it's really only media and classic gl which insist that the driver
stack somehow recover.

At least for us, each submit must be self-contained (ie. not rely on
previous GPU hw state), so in practice replay works out pretty well.
The worst case is subsequent submits from same process fail as well
(if they depended on something that crashing submit failed to write
back to memory.. but in that case they just crash as well and we move
on to the next one.. the recent gens (a5xx+ at least) are pretty good
about quickly detecting problems and giving us an error irq.


Well I absolutely agree with Daniel.

The whole replay thing AMD did in the scheduler is an absolutely mess 
and should probably be killed with fire.


I strongly recommend not to do the same mistake in other drivers.

If you want to have some replay feature then please make it driver 
specific and don't use anything from the infrastructure in the DRM 
scheduler.


Thanks,
Christian.



BR,
-R


And recovering from a mess in userspace is a lot simpler than trying to
pull of the same magic in the kernel. Plus it also helps with a few of the
dma_fence rules, which is a nice bonus.
-Daniel





Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-17 Thread Rob Clark
On Thu, Mar 17, 2022 at 10:27 AM Daniel Vetter  wrote:
>
> On Thu, Mar 17, 2022 at 08:40:51AM -0700, Rob Clark wrote:
> > On Thu, Mar 17, 2022 at 2:29 AM Daniel Vetter  wrote:
> > >
> > > On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
> > > > Am 16.03.22 um 16:36 schrieb Rob Clark:
> > > > > [SNIP]
> > > > > just one point of clarification.. in the msm and i915 case it is
> > > > > purely for debugging and telemetry (ie. sending crash logs back to
> > > > > distro for analysis if user has crash reporting enabled).. it isn't
> > > > > used for triggering any action like killing app or compositor.
> > > >
> > > > By the way, how does msm it's memory management for the devcoredumps?
> > >
> > > GFP_NORECLAIM all the way. It's purely best effort.
> > >
> > > Note that the fancy new plan for i915 discrete gpu is to only support gpu
> > > crash dumps on non-recoverable gpu contexts, i.e. those that do not
> > > continue to the next batch when something bad happens. This is what vk
> > > wants and also what iris now uses (we do context recovery in userspace in
> > > all cases), and non-recoverable contexts greatly simplify the crash dump
> > > gather: Only thing you need to gather is the register state from hw
> > > (before you reset it), all the batchbuffer bo and indirect state bo (in
> > > i915 you can mark which bo to capture in the CS ioctl) can be captured in
> > > a worker later on. Which for non-recoverable context is no issue, since
> > > subsequent batchbuffers won't trample over any of these things.
> >
> > fwiw, we snapshot everything (cmdstream and bo's marked with dump
> > flag, in addition to hw state) before resuming the GPU, so there is no
> > danger of things being trampled.  After state is captured and GPU
> > reset, we "replay" the submits that were written into the ringbuffer
> > after the faulting submit.  GPU crashes should be a thing you don't
> > need to try to optimize.
>
> Not sure why you think we optimize anything here?
>
> > (At some point, I'd like to use scheduler for the replay, and actually
> > use drm_sched_stop()/etc.. but last time I looked there were still
> > some sched bugs in that area which prevented me from deleting a bunch
> > of code ;-))
>
> Not sure about your hw, but at least on intel replaying tends to just
> result in follow-on fun. And that holds even more so the more complex a
> workload is. This is why vk just dies immediately and does not try to
> replay anything, offloading it to the app. Same with arb robusteness.
> Afaik it's really only media and classic gl which insist that the driver
> stack somehow recover.

At least for us, each submit must be self-contained (ie. not rely on
previous GPU hw state), so in practice replay works out pretty well.
The worst case is subsequent submits from same process fail as well
(if they depended on something that crashing submit failed to write
back to memory.. but in that case they just crash as well and we move
on to the next one.. the recent gens (a5xx+ at least) are pretty good
about quickly detecting problems and giving us an error irq.

BR,
-R

> And recovering from a mess in userspace is a lot simpler than trying to
> pull of the same magic in the kernel. Plus it also helps with a few of the
> dma_fence rules, which is a nice bonus.
> -Daniel
>
> >
> > BR,
> > -R
> >
> > >
> > > And that way you can record the crashdump (or at least the big pieces like
> > > all the indirect state stuff) with GFP_KERNEL.
> > >
> > > msm probably gets it wrong since embedded drivers have much less shrinker
> > > and generally no mmu notifiers going on :-)
> > >
> > > > I mean it is strictly forbidden to allocate any memory in the GPU reset
> > > > path.
> > > >
> > > > > I would however *strongly* recommend devcoredump support in other GPU
> > > > > drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
> > > > > to debug and fix a couple obscure issues that I was not able to
> > > > > reproduce by myself.
> > > >
> > > > Yes, completely agree as well.
> > >
> > > +1
> > >
> > > Cheers, Daniel
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-17 Thread Daniel Vetter
On Thu, Mar 17, 2022 at 08:40:51AM -0700, Rob Clark wrote:
> On Thu, Mar 17, 2022 at 2:29 AM Daniel Vetter  wrote:
> >
> > On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
> > > Am 16.03.22 um 16:36 schrieb Rob Clark:
> > > > [SNIP]
> > > > just one point of clarification.. in the msm and i915 case it is
> > > > purely for debugging and telemetry (ie. sending crash logs back to
> > > > distro for analysis if user has crash reporting enabled).. it isn't
> > > > used for triggering any action like killing app or compositor.
> > >
> > > By the way, how does msm it's memory management for the devcoredumps?
> >
> > GFP_NORECLAIM all the way. It's purely best effort.
> >
> > Note that the fancy new plan for i915 discrete gpu is to only support gpu
> > crash dumps on non-recoverable gpu contexts, i.e. those that do not
> > continue to the next batch when something bad happens. This is what vk
> > wants and also what iris now uses (we do context recovery in userspace in
> > all cases), and non-recoverable contexts greatly simplify the crash dump
> > gather: Only thing you need to gather is the register state from hw
> > (before you reset it), all the batchbuffer bo and indirect state bo (in
> > i915 you can mark which bo to capture in the CS ioctl) can be captured in
> > a worker later on. Which for non-recoverable context is no issue, since
> > subsequent batchbuffers won't trample over any of these things.
> 
> fwiw, we snapshot everything (cmdstream and bo's marked with dump
> flag, in addition to hw state) before resuming the GPU, so there is no
> danger of things being trampled.  After state is captured and GPU
> reset, we "replay" the submits that were written into the ringbuffer
> after the faulting submit.  GPU crashes should be a thing you don't
> need to try to optimize.

Not sure why you think we optimize anything here?

> (At some point, I'd like to use scheduler for the replay, and actually
> use drm_sched_stop()/etc.. but last time I looked there were still
> some sched bugs in that area which prevented me from deleting a bunch
> of code ;-))

Not sure about your hw, but at least on intel replaying tends to just
result in follow-on fun. And that holds even more so the more complex a
workload is. This is why vk just dies immediately and does not try to
replay anything, offloading it to the app. Same with arb robusteness.
Afaik it's really only media and classic gl which insist that the driver
stack somehow recover.

And recovering from a mess in userspace is a lot simpler than trying to
pull of the same magic in the kernel. Plus it also helps with a few of the
dma_fence rules, which is a nice bonus.
-Daniel

> 
> BR,
> -R
> 
> >
> > And that way you can record the crashdump (or at least the big pieces like
> > all the indirect state stuff) with GFP_KERNEL.
> >
> > msm probably gets it wrong since embedded drivers have much less shrinker
> > and generally no mmu notifiers going on :-)
> >
> > > I mean it is strictly forbidden to allocate any memory in the GPU reset
> > > path.
> > >
> > > > I would however *strongly* recommend devcoredump support in other GPU
> > > > drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
> > > > to debug and fix a couple obscure issues that I was not able to
> > > > reproduce by myself.
> > >
> > > Yes, completely agree as well.
> >
> > +1
> >
> > Cheers, Daniel
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-17 Thread Daniel Vetter
On Thu, Mar 17, 2022 at 08:34:21AM -0700, Rob Clark wrote:
> On Thu, Mar 17, 2022 at 2:29 AM Daniel Vetter  wrote:
> >
> > On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
> > > Am 16.03.22 um 16:36 schrieb Rob Clark:
> > > > [SNIP]
> > > > just one point of clarification.. in the msm and i915 case it is
> > > > purely for debugging and telemetry (ie. sending crash logs back to
> > > > distro for analysis if user has crash reporting enabled).. it isn't
> > > > used for triggering any action like killing app or compositor.
> > >
> > > By the way, how does msm it's memory management for the devcoredumps?
> >
> > GFP_NORECLAIM all the way. It's purely best effort.
> 
> We do one GEM obj allocation in the snapshot path (the hw has a
> mechanism to snapshot it's own state into a gpu buffer.. not sure if
> nice debugging functionality like that is a commentary on the blob
> driver quality, but I'm not complaining)
> 
> I suppose we could pre-allocate this buffer up-front.. but it doesn't
> seem like a problem, ie. if allocation fails we just skip snapshotting
> stuff that needs the hw crashdumper.  I guess since vram is not
> involved, perhaps that makes the situation a bit more straightforward.

The problem is that you need to allocate with GFP_ATOMIC, instead of
GFP_KERNEL, or things go very bad.

The scheduler dma-fence annotations I've had (well still have them here)
would catch this stuff, but thus far they got nowhere.

> > Note that the fancy new plan for i915 discrete gpu is to only support gpu
> > crash dumps on non-recoverable gpu contexts, i.e. those that do not
> > continue to the next batch when something bad happens. This is what vk
> > wants and also what iris now uses (we do context recovery in userspace in
> > all cases), and non-recoverable contexts greatly simplify the crash dump
> > gather: Only thing you need to gather is the register state from hw
> > (before you reset it), all the batchbuffer bo and indirect state bo (in
> > i915 you can mark which bo to capture in the CS ioctl) can be captured in
> > a worker later on. Which for non-recoverable context is no issue, since
> > subsequent batchbuffers won't trample over any of these things.
> >
> > And that way you can record the crashdump (or at least the big pieces like
> > all the indirect state stuff) with GFP_KERNEL.
> >
> > msm probably gets it wrong since embedded drivers have much less shrinker
> > and generally no mmu notifiers going on :-)
> 
> Note that the bo's associated with the batch are still pinned at this
> point, from the bo lifecycle the batch is still active.  So from the
> point of view of shrinker, there should be no interaction.  We aren't
> doing anything with mmu notifiers (yet), so not entirely sure offhand
> the concern there.
> 
> Currently we just use GFP_KERNEL and bail if allocation fails.

Yeah you have a simple enough shrinker for this not to be a problem. The
issue is that sooner or later things tend to not stay like that, and we're
trying to have common rules for dma_fence to make sure everyone follows
the same rules.
-Daniel

> 
> BR,
> -R
> 
> > > I mean it is strictly forbidden to allocate any memory in the GPU reset
> > > path.
> > >
> > > > I would however *strongly* recommend devcoredump support in other GPU
> > > > drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
> > > > to debug and fix a couple obscure issues that I was not able to
> > > > reproduce by myself.
> > >
> > > Yes, completely agree as well.
> >
> > +1
> >
> > Cheers, Daniel
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-17 Thread Rob Clark
On Thu, Mar 17, 2022 at 2:29 AM Daniel Vetter  wrote:
>
> On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
> > Am 16.03.22 um 16:36 schrieb Rob Clark:
> > > [SNIP]
> > > just one point of clarification.. in the msm and i915 case it is
> > > purely for debugging and telemetry (ie. sending crash logs back to
> > > distro for analysis if user has crash reporting enabled).. it isn't
> > > used for triggering any action like killing app or compositor.
> >
> > By the way, how does msm it's memory management for the devcoredumps?
>
> GFP_NORECLAIM all the way. It's purely best effort.
>
> Note that the fancy new plan for i915 discrete gpu is to only support gpu
> crash dumps on non-recoverable gpu contexts, i.e. those that do not
> continue to the next batch when something bad happens. This is what vk
> wants and also what iris now uses (we do context recovery in userspace in
> all cases), and non-recoverable contexts greatly simplify the crash dump
> gather: Only thing you need to gather is the register state from hw
> (before you reset it), all the batchbuffer bo and indirect state bo (in
> i915 you can mark which bo to capture in the CS ioctl) can be captured in
> a worker later on. Which for non-recoverable context is no issue, since
> subsequent batchbuffers won't trample over any of these things.

fwiw, we snapshot everything (cmdstream and bo's marked with dump
flag, in addition to hw state) before resuming the GPU, so there is no
danger of things being trampled.  After state is captured and GPU
reset, we "replay" the submits that were written into the ringbuffer
after the faulting submit.  GPU crashes should be a thing you don't
need to try to optimize.

(At some point, I'd like to use scheduler for the replay, and actually
use drm_sched_stop()/etc.. but last time I looked there were still
some sched bugs in that area which prevented me from deleting a bunch
of code ;-))

BR,
-R

>
> And that way you can record the crashdump (or at least the big pieces like
> all the indirect state stuff) with GFP_KERNEL.
>
> msm probably gets it wrong since embedded drivers have much less shrinker
> and generally no mmu notifiers going on :-)
>
> > I mean it is strictly forbidden to allocate any memory in the GPU reset
> > path.
> >
> > > I would however *strongly* recommend devcoredump support in other GPU
> > > drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
> > > to debug and fix a couple obscure issues that I was not able to
> > > reproduce by myself.
> >
> > Yes, completely agree as well.
>
> +1
>
> Cheers, Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-17 Thread Rob Clark
On Thu, Mar 17, 2022 at 2:29 AM Daniel Vetter  wrote:
>
> On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
> > Am 16.03.22 um 16:36 schrieb Rob Clark:
> > > [SNIP]
> > > just one point of clarification.. in the msm and i915 case it is
> > > purely for debugging and telemetry (ie. sending crash logs back to
> > > distro for analysis if user has crash reporting enabled).. it isn't
> > > used for triggering any action like killing app or compositor.
> >
> > By the way, how does msm it's memory management for the devcoredumps?
>
> GFP_NORECLAIM all the way. It's purely best effort.

We do one GEM obj allocation in the snapshot path (the hw has a
mechanism to snapshot it's own state into a gpu buffer.. not sure if
nice debugging functionality like that is a commentary on the blob
driver quality, but I'm not complaining)

I suppose we could pre-allocate this buffer up-front.. but it doesn't
seem like a problem, ie. if allocation fails we just skip snapshotting
stuff that needs the hw crashdumper.  I guess since vram is not
involved, perhaps that makes the situation a bit more straightforward.

> Note that the fancy new plan for i915 discrete gpu is to only support gpu
> crash dumps on non-recoverable gpu contexts, i.e. those that do not
> continue to the next batch when something bad happens. This is what vk
> wants and also what iris now uses (we do context recovery in userspace in
> all cases), and non-recoverable contexts greatly simplify the crash dump
> gather: Only thing you need to gather is the register state from hw
> (before you reset it), all the batchbuffer bo and indirect state bo (in
> i915 you can mark which bo to capture in the CS ioctl) can be captured in
> a worker later on. Which for non-recoverable context is no issue, since
> subsequent batchbuffers won't trample over any of these things.
>
> And that way you can record the crashdump (or at least the big pieces like
> all the indirect state stuff) with GFP_KERNEL.
>
> msm probably gets it wrong since embedded drivers have much less shrinker
> and generally no mmu notifiers going on :-)

Note that the bo's associated with the batch are still pinned at this
point, from the bo lifecycle the batch is still active.  So from the
point of view of shrinker, there should be no interaction.  We aren't
doing anything with mmu notifiers (yet), so not entirely sure offhand
the concern there.

Currently we just use GFP_KERNEL and bail if allocation fails.

BR,
-R

> > I mean it is strictly forbidden to allocate any memory in the GPU reset
> > path.
> >
> > > I would however *strongly* recommend devcoredump support in other GPU
> > > drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
> > > to debug and fix a couple obscure issues that I was not able to
> > > reproduce by myself.
> >
> > Yes, completely agree as well.
>
> +1
>
> Cheers, Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-17 Thread Daniel Stone
Hi,

On Thu, 17 Mar 2022 at 09:21, Christian König  wrote:
> Am 17.03.22 um 09:42 schrieb Sharma, Shashank:
> >> AFAIU you probably want to be passing around a `struct pid *`, and
> >> then somehow use pid_vnr() in the context of the process reading the
> >> event to get the numeric pid.  Otherwise things will not do what you
> >> expect if the process triggering the crash is in a different pid
> >> namespace from the compositor.
> >
> > I am not sure if it is a good idea to add the pid extraction
> > complexity in here, it is left upto the driver to extract this
> > information and pass it to the work queue. In case of AMDGPU, its
> > extracted from GPU VM. It would be then more flexible for the drivers
> > as well.
>
> Yeah, but that is just used for debugging.
>
> If we want to use the pid for housekeeping, like for a daemon which
> kills/restarts processes, we absolutely need that or otherwise won't be
> able to work with containers.

100% this.

Pushing back to the compositor is a red herring. The compositor is
just a service which tries to handle window management and input. If
you're looking to kill the offending process or whatever, then that
should go through the session manager - be it systemd or something
container-centric or whatever. At least that way it can deal with
cgroups at the same time, unlike the compositor which is not really
aware of what the thing on the other end of the socket is doing. This
ties in with the support they already have for things like coredump
analysis, and would also be useful for other devices.

Some environments combine compositor and session manager, and a lot of
them have them strongly related, but they're very definitely not the
same thing ...

Cheers,
Daniel


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-17 Thread Christian König

Am 17.03.22 um 10:29 schrieb Daniel Vetter:

On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:

Am 16.03.22 um 16:36 schrieb Rob Clark:

[SNIP]
just one point of clarification.. in the msm and i915 case it is
purely for debugging and telemetry (ie. sending crash logs back to
distro for analysis if user has crash reporting enabled).. it isn't
used for triggering any action like killing app or compositor.

By the way, how does msm it's memory management for the devcoredumps?

GFP_NORECLAIM all the way. It's purely best effort.


Ok, good to know that it's as simple as that.


Note that the fancy new plan for i915 discrete gpu is to only support gpu
crash dumps on non-recoverable gpu contexts, i.e. those that do not
continue to the next batch when something bad happens.



This is what vk wants


That's exactly what I'm telling an internal team for a couple of years 
now as well. Good to know that this is not that totally crazy.



  and also what iris now uses (we do context recovery in userspace in
all cases), and non-recoverable contexts greatly simplify the crash dump
gather: Only thing you need to gather is the register state from hw
(before you reset it), all the batchbuffer bo and indirect state bo (in
i915 you can mark which bo to capture in the CS ioctl) can be captured in
a worker later on. Which for non-recoverable context is no issue, since
subsequent batchbuffers won't trample over any of these things.

And that way you can record the crashdump (or at least the big pieces like
all the indirect state stuff) with GFP_KERNEL.


Interesting idea, so basically we only do the state we need to reset 
initially and grab a reference on the killed application to gather the 
rest before we clean them up.


Going to keep that in mind as well.

Thanks,
Christian.



msm probably gets it wrong since embedded drivers have much less shrinker
and generally no mmu notifiers going on :-)


I mean it is strictly forbidden to allocate any memory in the GPU reset
path.


I would however *strongly* recommend devcoredump support in other GPU
drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
to debug and fix a couple obscure issues that I was not able to
reproduce by myself.

Yes, completely agree as well.

+1

Cheers, Daniel




Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-17 Thread Daniel Vetter
On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
> Am 16.03.22 um 16:36 schrieb Rob Clark:
> > [SNIP]
> > just one point of clarification.. in the msm and i915 case it is
> > purely for debugging and telemetry (ie. sending crash logs back to
> > distro for analysis if user has crash reporting enabled).. it isn't
> > used for triggering any action like killing app or compositor.
> 
> By the way, how does msm it's memory management for the devcoredumps?

GFP_NORECLAIM all the way. It's purely best effort.

Note that the fancy new plan for i915 discrete gpu is to only support gpu
crash dumps on non-recoverable gpu contexts, i.e. those that do not
continue to the next batch when something bad happens. This is what vk
wants and also what iris now uses (we do context recovery in userspace in
all cases), and non-recoverable contexts greatly simplify the crash dump
gather: Only thing you need to gather is the register state from hw
(before you reset it), all the batchbuffer bo and indirect state bo (in
i915 you can mark which bo to capture in the CS ioctl) can be captured in
a worker later on. Which for non-recoverable context is no issue, since
subsequent batchbuffers won't trample over any of these things.

And that way you can record the crashdump (or at least the big pieces like
all the indirect state stuff) with GFP_KERNEL.

msm probably gets it wrong since embedded drivers have much less shrinker
and generally no mmu notifiers going on :-)

> I mean it is strictly forbidden to allocate any memory in the GPU reset
> path.
> 
> > I would however *strongly* recommend devcoredump support in other GPU
> > drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
> > to debug and fix a couple obscure issues that I was not able to
> > reproduce by myself.
> 
> Yes, completely agree as well.

+1

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-17 Thread Daniel Vetter
On Mon, Mar 14, 2022 at 10:23:27AM -0400, Alex Deucher wrote:
> On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen  wrote:
> >
> > On Thu, 10 Mar 2022 11:56:41 -0800
> > Rob Clark  wrote:
> >
> > > For something like just notifying a compositor that a gpu crash
> > > happened, perhaps drm_event is more suitable.  See
> > > virtio_gpu_fence_event_create() for an example of adding new event
> > > types.  Although maybe you want it to be an event which is not device
> > > specific.  This isn't so much of a debugging use-case as simply
> > > notification.
> >
> > Hi,
> >
> > for this particular use case, are we now talking about the display
> > device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
> >
> > If the former, I wasn't aware that display device crashes are a thing.
> > How should a userspace display server react to those?
> >
> > If the latter, don't we have EGL extensions or Vulkan API already to
> > deliver that?
> >
> > The above would be about device crashes that directly affect the
> > display server. Is that the use case in mind here, or is it instead
> > about notifying the display server that some application has caused a
> > driver/hardware crash? If the latter, how should a display server react
> > to that? Disconnect the application?
> >
> > Shashank, what is the actual use case you are developing this for?
> >
> > I've read all the emails here so far, and I don't recall seeing it
> > explained.
> >
> 
> The idea is that a support daemon or compositor would listen for GPU
> reset notifications and do something useful with them (kill the guilty
> app, restart the desktop environment, etc.).  Today when the GPU
> resets, most applications just continue assuming nothing is wrong,
> meanwhile the GPU has stopped accepting work until the apps re-init
> their context so all of their command submissions just get rejected.
> 
> > Btw. somewhat relatedly, there has been work aiming to allow
> > graceful hot-unplug of DRM devices. There is a kernel doc outlining how
> > the various APIs should react towards userspace when a DRM device
> > suddenly disappears. That seems to have some overlap here IMO.
> >
> > See 
> > https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#device-hot-unplug
> > which also has a couple pointers to EGL and Vulkan APIs.
> 
> The problem is most applications don't use the GL or VK robustness
> APIs.  You could use something like that in the compositor, but those
> APIs tend to be focused more on the application itself rather than the
> GPU in general.  E.g., Is my context lost.  Which is fine for
> restarting your context, but doesn't really help if you want to try
> and do something with another application (i.e., the likely guilty
> app).  Also, on dGPU at least, when you reset the GPU, vram is usually
> lost (either due to the memory controller being reset, or vram being
> zero'd on init due to ECC support), so even if you are not the guilty
> process, in that case you'd need to re-init your context anyway.

Isn't that what arb robustness and all that stuff is for? Doing that
through sysfs event sounds very wrong, since in general apps just don't
have access to that. Also vk equivalent is vk_error_device_lost. Iirc both
have information like whether the app was the guilty one causing the hang,
or whether it was just victimized because the gpu can't do anything else
than a full gpu reset which nukes everything (like amdgpu currently has,
aside from the thread unblock trick in the first attempt).

And if your app/compositor doesn't use robust contexts then the userspace
driver gets to do a best effort attempt at recovery, or exit(). Whatever
you can do really.

Also note that you don't actually want an event, but a query ioctl (plus
maybe a specific errno on your CS ioctl). Neither of the above flows
supports events for gpu resets. RESET_STATS ioctl is the i915
implementation of this stuff.

For the core dump aspect yes pls devcoredump and not reinvented wheels
(and i915 is a bad example here, but in defence the i915 sysfs hang event
predates devcoredump).

Cheers, Daniel


> 
> Alex
> 
> >
> >
> > Thanks,
> > pq

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-17 Thread Christian König

Am 17.03.22 um 09:42 schrieb Sharma, Shashank:

On 3/16/2022 10:50 PM, Rob Clark wrote:

On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
 wrote:


From: Shashank Sharma 

This patch adds a new sysfs event, which will indicate
the userland about a GPU reset, and can also provide
some information like:
- process ID of the process involved with the GPU reset
- process name of the involved process
- the GPU status info (using flags)

This patch also introduces the first flag of the flags
bitmap, which can be appended as and when required.

V2: Addressed review comments from Christian and Amar
    - move the reset information structure to DRM layer
    - drop _ctx from struct name
    - make pid 32 bit(than 64)
    - set flag when VRAM invalid (than valid)
    - add process name as well (Amar)

Cc: Alexandar Deucher 
Cc: Christian Koenig 
Cc: Amaranath Somalapuram 
Signed-off-by: Shashank Sharma 
---
  drivers/gpu/drm/drm_sysfs.c | 31 +++
  include/drm/drm_sysfs.h | 10 ++
  2 files changed, 41 insertions(+)

diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c
index 430e00b16eec..840994810910 100644
--- a/drivers/gpu/drm/drm_sysfs.c
+++ b/drivers/gpu/drm/drm_sysfs.c
@@ -409,6 +409,37 @@ void drm_sysfs_hotplug_event(struct drm_device 
*dev)

  }
  EXPORT_SYMBOL(drm_sysfs_hotplug_event);

+/**
+ * drm_sysfs_reset_event - generate a DRM uevent to indicate GPU reset
+ * @dev: DRM device
+ * @reset_info: The contextual information about the reset (like 
PID, flags)

+ *
+ * Send a uevent for the DRM device specified by @dev. This informs
+ * user that a GPU reset has occurred, so that an interested client
+ * can take any recovery or profiling measure.
+ */
+void drm_sysfs_reset_event(struct drm_device *dev, struct 
drm_reset_event *reset_info)

+{
+   unsigned char pid_str[13];
+   unsigned char flags_str[15];
+   unsigned char pname_str[TASK_COMM_LEN + 6];
+   unsigned char reset_str[] = "RESET=1";
+   char *envp[] = { reset_str, pid_str, pname_str, flags_str, 
NULL };

+
+   if (!reset_info) {
+   DRM_WARN("No reset info, not sending the event\n");
+   return;
+   }
+
+   DRM_DEBUG("generating reset event\n");
+
+   snprintf(pid_str, ARRAY_SIZE(pid_str), "PID=%u", 
reset_info->pid);
+   snprintf(pname_str, ARRAY_SIZE(pname_str), "NAME=%s", 
reset_info->pname);
+   snprintf(flags_str, ARRAY_SIZE(flags_str), "FLAGS=%u", 
reset_info->flags);

+ kobject_uevent_env(>primary->kdev->kobj, KOBJ_CHANGE, envp);
+}
+EXPORT_SYMBOL(drm_sysfs_reset_event);
+
  /**
   * drm_sysfs_connector_hotplug_event - generate a DRM uevent for 
any connector

   * change
diff --git a/include/drm/drm_sysfs.h b/include/drm/drm_sysfs.h
index 6273cac44e47..5ba11c760619 100644
--- a/include/drm/drm_sysfs.h
+++ b/include/drm/drm_sysfs.h
@@ -1,16 +1,26 @@
  /* SPDX-License-Identifier: GPL-2.0 */
  #ifndef _DRM_SYSFS_H_
  #define _DRM_SYSFS_H_
+#include 
+
+#define DRM_GPU_RESET_FLAG_VRAM_INVALID (1 << 0)

  struct drm_device;
  struct device;
  struct drm_connector;
  struct drm_property;

+struct drm_reset_event {
+   uint32_t pid;


One side note, unrelated to devcoredump vs this..

AFAIU you probably want to be passing around a `struct pid *`, and
then somehow use pid_vnr() in the context of the process reading the
event to get the numeric pid.  Otherwise things will not do what you
expect if the process triggering the crash is in a different pid
namespace from the compositor.



I am not sure if it is a good idea to add the pid extraction 
complexity in here, it is left upto the driver to extract this 
information and pass it to the work queue. In case of AMDGPU, its 
extracted from GPU VM. It would be then more flexible for the drivers 
as well.


Yeah, but that is just used for debugging.

If we want to use the pid for housekeeping, like for a daemon which 
kills/restarts processes, we absolutely need that or otherwise won't be 
able to work with containers.


Regards,
Christian.



- Shashank


BR,
-R


+   uint32_t flags;
+   char pname[TASK_COMM_LEN];
+};
+
  int drm_class_device_register(struct device *dev);
  void drm_class_device_unregister(struct device *dev);

  void drm_sysfs_hotplug_event(struct drm_device *dev);
+void drm_sysfs_reset_event(struct drm_device *dev, struct 
drm_reset_event *reset_info);
  void drm_sysfs_connector_hotplug_event(struct drm_connector 
*connector);
  void drm_sysfs_connector_status_event(struct drm_connector 
*connector,

   struct drm_property *property);
--
2.32.0





Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-17 Thread Sharma, Shashank




On 3/16/2022 10:50 PM, Rob Clark wrote:

On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
 wrote:


From: Shashank Sharma 

This patch adds a new sysfs event, which will indicate
the userland about a GPU reset, and can also provide
some information like:
- process ID of the process involved with the GPU reset
- process name of the involved process
- the GPU status info (using flags)

This patch also introduces the first flag of the flags
bitmap, which can be appended as and when required.

V2: Addressed review comments from Christian and Amar
- move the reset information structure to DRM layer
- drop _ctx from struct name
- make pid 32 bit(than 64)
- set flag when VRAM invalid (than valid)
- add process name as well (Amar)

Cc: Alexandar Deucher 
Cc: Christian Koenig 
Cc: Amaranath Somalapuram 
Signed-off-by: Shashank Sharma 
---
  drivers/gpu/drm/drm_sysfs.c | 31 +++
  include/drm/drm_sysfs.h | 10 ++
  2 files changed, 41 insertions(+)

diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c
index 430e00b16eec..840994810910 100644
--- a/drivers/gpu/drm/drm_sysfs.c
+++ b/drivers/gpu/drm/drm_sysfs.c
@@ -409,6 +409,37 @@ void drm_sysfs_hotplug_event(struct drm_device *dev)
  }
  EXPORT_SYMBOL(drm_sysfs_hotplug_event);

+/**
+ * drm_sysfs_reset_event - generate a DRM uevent to indicate GPU reset
+ * @dev: DRM device
+ * @reset_info: The contextual information about the reset (like PID, flags)
+ *
+ * Send a uevent for the DRM device specified by @dev. This informs
+ * user that a GPU reset has occurred, so that an interested client
+ * can take any recovery or profiling measure.
+ */
+void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event 
*reset_info)
+{
+   unsigned char pid_str[13];
+   unsigned char flags_str[15];
+   unsigned char pname_str[TASK_COMM_LEN + 6];
+   unsigned char reset_str[] = "RESET=1";
+   char *envp[] = { reset_str, pid_str, pname_str, flags_str, NULL };
+
+   if (!reset_info) {
+   DRM_WARN("No reset info, not sending the event\n");
+   return;
+   }
+
+   DRM_DEBUG("generating reset event\n");
+
+   snprintf(pid_str, ARRAY_SIZE(pid_str), "PID=%u", reset_info->pid);
+   snprintf(pname_str, ARRAY_SIZE(pname_str), "NAME=%s", 
reset_info->pname);
+   snprintf(flags_str, ARRAY_SIZE(flags_str), "FLAGS=%u", 
reset_info->flags);
+   kobject_uevent_env(>primary->kdev->kobj, KOBJ_CHANGE, envp);
+}
+EXPORT_SYMBOL(drm_sysfs_reset_event);
+
  /**
   * drm_sysfs_connector_hotplug_event - generate a DRM uevent for any connector
   * change
diff --git a/include/drm/drm_sysfs.h b/include/drm/drm_sysfs.h
index 6273cac44e47..5ba11c760619 100644
--- a/include/drm/drm_sysfs.h
+++ b/include/drm/drm_sysfs.h
@@ -1,16 +1,26 @@
  /* SPDX-License-Identifier: GPL-2.0 */
  #ifndef _DRM_SYSFS_H_
  #define _DRM_SYSFS_H_
+#include 
+
+#define DRM_GPU_RESET_FLAG_VRAM_INVALID (1 << 0)

  struct drm_device;
  struct device;
  struct drm_connector;
  struct drm_property;

+struct drm_reset_event {
+   uint32_t pid;


One side note, unrelated to devcoredump vs this..

AFAIU you probably want to be passing around a `struct pid *`, and
then somehow use pid_vnr() in the context of the process reading the
event to get the numeric pid.  Otherwise things will not do what you
expect if the process triggering the crash is in a different pid
namespace from the compositor.



I am not sure if it is a good idea to add the pid extraction complexity 
in here, it is left upto the driver to extract this information and pass 
it to the work queue. In case of AMDGPU, its extracted from GPU VM. It 
would be then more flexible for the drivers as well.


- Shashank


BR,
-R


+   uint32_t flags;
+   char pname[TASK_COMM_LEN];
+};
+
  int drm_class_device_register(struct device *dev);
  void drm_class_device_unregister(struct device *dev);

  void drm_sysfs_hotplug_event(struct drm_device *dev);
+void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event 
*reset_info);
  void drm_sysfs_connector_hotplug_event(struct drm_connector *connector);
  void drm_sysfs_connector_status_event(struct drm_connector *connector,
   struct drm_property *property);
--
2.32.0



Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-17 Thread Christian König

Am 16.03.22 um 16:36 schrieb Rob Clark:

[SNIP]
just one point of clarification.. in the msm and i915 case it is
purely for debugging and telemetry (ie. sending crash logs back to
distro for analysis if user has crash reporting enabled).. it isn't
used for triggering any action like killing app or compositor.


By the way, how does msm it's memory management for the devcoredumps?

I mean it is strictly forbidden to allocate any memory in the GPU reset 
path.



I would however *strongly* recommend devcoredump support in other GPU
drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
to debug and fix a couple obscure issues that I was not able to
reproduce by myself.


Yes, completely agree as well.

Thanks,
Christian.



BR,
-R




Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-16 Thread Rob Clark
On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
 wrote:
>
> From: Shashank Sharma 
>
> This patch adds a new sysfs event, which will indicate
> the userland about a GPU reset, and can also provide
> some information like:
> - process ID of the process involved with the GPU reset
> - process name of the involved process
> - the GPU status info (using flags)
>
> This patch also introduces the first flag of the flags
> bitmap, which can be appended as and when required.
>
> V2: Addressed review comments from Christian and Amar
>- move the reset information structure to DRM layer
>- drop _ctx from struct name
>- make pid 32 bit(than 64)
>- set flag when VRAM invalid (than valid)
>- add process name as well (Amar)
>
> Cc: Alexandar Deucher 
> Cc: Christian Koenig 
> Cc: Amaranath Somalapuram 
> Signed-off-by: Shashank Sharma 
> ---
>  drivers/gpu/drm/drm_sysfs.c | 31 +++
>  include/drm/drm_sysfs.h | 10 ++
>  2 files changed, 41 insertions(+)
>
> diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c
> index 430e00b16eec..840994810910 100644
> --- a/drivers/gpu/drm/drm_sysfs.c
> +++ b/drivers/gpu/drm/drm_sysfs.c
> @@ -409,6 +409,37 @@ void drm_sysfs_hotplug_event(struct drm_device *dev)
>  }
>  EXPORT_SYMBOL(drm_sysfs_hotplug_event);
>
> +/**
> + * drm_sysfs_reset_event - generate a DRM uevent to indicate GPU reset
> + * @dev: DRM device
> + * @reset_info: The contextual information about the reset (like PID, flags)
> + *
> + * Send a uevent for the DRM device specified by @dev. This informs
> + * user that a GPU reset has occurred, so that an interested client
> + * can take any recovery or profiling measure.
> + */
> +void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event 
> *reset_info)
> +{
> +   unsigned char pid_str[13];
> +   unsigned char flags_str[15];
> +   unsigned char pname_str[TASK_COMM_LEN + 6];
> +   unsigned char reset_str[] = "RESET=1";
> +   char *envp[] = { reset_str, pid_str, pname_str, flags_str, NULL };
> +
> +   if (!reset_info) {
> +   DRM_WARN("No reset info, not sending the event\n");
> +   return;
> +   }
> +
> +   DRM_DEBUG("generating reset event\n");
> +
> +   snprintf(pid_str, ARRAY_SIZE(pid_str), "PID=%u", reset_info->pid);
> +   snprintf(pname_str, ARRAY_SIZE(pname_str), "NAME=%s", 
> reset_info->pname);
> +   snprintf(flags_str, ARRAY_SIZE(flags_str), "FLAGS=%u", 
> reset_info->flags);
> +   kobject_uevent_env(>primary->kdev->kobj, KOBJ_CHANGE, envp);
> +}
> +EXPORT_SYMBOL(drm_sysfs_reset_event);
> +
>  /**
>   * drm_sysfs_connector_hotplug_event - generate a DRM uevent for any 
> connector
>   * change
> diff --git a/include/drm/drm_sysfs.h b/include/drm/drm_sysfs.h
> index 6273cac44e47..5ba11c760619 100644
> --- a/include/drm/drm_sysfs.h
> +++ b/include/drm/drm_sysfs.h
> @@ -1,16 +1,26 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>  #ifndef _DRM_SYSFS_H_
>  #define _DRM_SYSFS_H_
> +#include 
> +
> +#define DRM_GPU_RESET_FLAG_VRAM_INVALID (1 << 0)
>
>  struct drm_device;
>  struct device;
>  struct drm_connector;
>  struct drm_property;
>
> +struct drm_reset_event {
> +   uint32_t pid;

One side note, unrelated to devcoredump vs this..

AFAIU you probably want to be passing around a `struct pid *`, and
then somehow use pid_vnr() in the context of the process reading the
event to get the numeric pid.  Otherwise things will not do what you
expect if the process triggering the crash is in a different pid
namespace from the compositor.

BR,
-R

> +   uint32_t flags;
> +   char pname[TASK_COMM_LEN];
> +};
> +
>  int drm_class_device_register(struct device *dev);
>  void drm_class_device_unregister(struct device *dev);
>
>  void drm_sysfs_hotplug_event(struct drm_device *dev);
> +void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event 
> *reset_info);
>  void drm_sysfs_connector_hotplug_event(struct drm_connector *connector);
>  void drm_sysfs_connector_status_event(struct drm_connector *connector,
>   struct drm_property *property);
> --
> 2.32.0
>


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-16 Thread Rob Clark
On Wed, Mar 16, 2022 at 8:48 AM Alex Deucher  wrote:
>
> On Wed, Mar 16, 2022 at 11:35 AM Rob Clark  wrote:
> >
> > On Wed, Mar 16, 2022 at 7:12 AM Alex Deucher  wrote:
> > >
> > > On Wed, Mar 16, 2022 at 4:48 AM Pekka Paalanen  
> > > wrote:
> > > >
> > [snip]
> > > > With new UAPI comes the demand of userspace proof, not hand-waving. You
> > > > would not be proposing this new interface if you didn't have use cases
> > > > in mind, even just one. You have to document what you imagine the new
> > > > thing to be used for, so that the appropriateness can be evaluated. If
> > > > the use case is deemed inappropriate for the proposed UAPI, you need to
> > > > find another use case to justify adding the new UAPI. If there is no
> > > > use for the UAPI, it shouldn't be added, right? Adding UAPI and hoping
> > > > someone finds use for it seems backwards to me.
> > >
> > > We do have a use case.  It's what I described originally.  There is a
> > > user space daemon (could be a compositor, could be something else)
> > > that runs and listens for GPU reset notifications.  When it receives a
> > > notification, it takes action and kills the guilty app and restarts
> > > the compositer and gathers any relevant data related to the GPU hang
> > > (if possible).  We can revisit this discussion once we have the whole
> > > implementation complete.  Other drivers seem to do similar things
> > > already today via different means (msm using devcoredump, i915 seems
> > > to have its own GPU reset notification mechanism, etc.).  It just
> > > seemed like there was value in having a generic drm GPU reset
> > > notification, but maybe not yet.
> >
> > just one point of clarification.. in the msm and i915 case it is
> > purely for debugging and telemetry (ie. sending crash logs back to
> > distro for analysis if user has crash reporting enabled).. it isn't
> > used for triggering any action like killing app or compositor.
>
> Sure.  you could use this proposed event for telemetry gathering as
> well.  The important part is the event.  What you do with it is up to
> the user.

True, but for that purpose (telemetry) let's leverage something that
userspace already has support for

> > I would however *strongly* recommend devcoredump support in other GPU
> > drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
> > to debug and fix a couple obscure issues that I was not able to
> > reproduce by myself.
>
> Agreed.  I think devcoredump makes sense and we'll ultimately enable
> support for that in amdgpu.  I think there is value in a GPU specific
> reset event as well for the use case I outlined, but maybe the
> devcoredump one is enough.  We'd just need to rely on the userspace
> side to filter as appropriate for events it cares about.  I'm not sure
> how many other devcore dump events are commonly seen.

They are, like CPU coredumps, not expected to be a frequent thing.
There are a number of other non-GPU drivers which also use devcoredump
(mostly wifi and remoteproc, ie. things like DSPs).  We did also
recently add msm devcoredump support for the display side of things,
in addition to gpu, to dump display state if we hit underruns, link
training fail, etc.  Each "crash" creates a new transient devcore
device so crashes from different devices can be reported in parallel.
But we do rate-limit the GPU crash reports by not reporting another
one until the previous one is read out and cleared by userspace.  (You
don't strictly need to do that, but it is probably a good idea..
usually the first crash is the interesting one.)

The ChromeOS crash-reporter does have an allow-list for which devcore
dumps are sent back, mainly because in the case of wifi drivers it is
hard to ensure that the dump does not include PII (like network names,
etc).  It would be pretty trivial to add amdgpu to the allow-list and
setup a "magic signature" so that queries could be run on amdgpu
devcore dumps.  I can help out with that, since I went through the
same process already for msm.

I'm not entirely sure what other distros do in this area.

BR,
-R


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-16 Thread Alex Deucher
On Wed, Mar 16, 2022 at 11:35 AM Rob Clark  wrote:
>
> On Wed, Mar 16, 2022 at 7:12 AM Alex Deucher  wrote:
> >
> > On Wed, Mar 16, 2022 at 4:48 AM Pekka Paalanen  wrote:
> > >
> [snip]
> > > With new UAPI comes the demand of userspace proof, not hand-waving. You
> > > would not be proposing this new interface if you didn't have use cases
> > > in mind, even just one. You have to document what you imagine the new
> > > thing to be used for, so that the appropriateness can be evaluated. If
> > > the use case is deemed inappropriate for the proposed UAPI, you need to
> > > find another use case to justify adding the new UAPI. If there is no
> > > use for the UAPI, it shouldn't be added, right? Adding UAPI and hoping
> > > someone finds use for it seems backwards to me.
> >
> > We do have a use case.  It's what I described originally.  There is a
> > user space daemon (could be a compositor, could be something else)
> > that runs and listens for GPU reset notifications.  When it receives a
> > notification, it takes action and kills the guilty app and restarts
> > the compositer and gathers any relevant data related to the GPU hang
> > (if possible).  We can revisit this discussion once we have the whole
> > implementation complete.  Other drivers seem to do similar things
> > already today via different means (msm using devcoredump, i915 seems
> > to have its own GPU reset notification mechanism, etc.).  It just
> > seemed like there was value in having a generic drm GPU reset
> > notification, but maybe not yet.
>
> just one point of clarification.. in the msm and i915 case it is
> purely for debugging and telemetry (ie. sending crash logs back to
> distro for analysis if user has crash reporting enabled).. it isn't
> used for triggering any action like killing app or compositor.

Sure.  you could use this proposed event for telemetry gathering as
well.  The important part is the event.  What you do with it is up to
the user.

>
> I would however *strongly* recommend devcoredump support in other GPU
> drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
> to debug and fix a couple obscure issues that I was not able to
> reproduce by myself.

Agreed.  I think devcoredump makes sense and we'll ultimately enable
support for that in amdgpu.  I think there is value in a GPU specific
reset event as well for the use case I outlined, but maybe the
devcoredump one is enough.  We'd just need to rely on the userspace
side to filter as appropriate for events it cares about.  I'm not sure
how many other devcore dump events are commonly seen.

Alex


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-16 Thread Rob Clark
On Wed, Mar 16, 2022 at 7:12 AM Alex Deucher  wrote:
>
> On Wed, Mar 16, 2022 at 4:48 AM Pekka Paalanen  wrote:
> >
[snip]
> > With new UAPI comes the demand of userspace proof, not hand-waving. You
> > would not be proposing this new interface if you didn't have use cases
> > in mind, even just one. You have to document what you imagine the new
> > thing to be used for, so that the appropriateness can be evaluated. If
> > the use case is deemed inappropriate for the proposed UAPI, you need to
> > find another use case to justify adding the new UAPI. If there is no
> > use for the UAPI, it shouldn't be added, right? Adding UAPI and hoping
> > someone finds use for it seems backwards to me.
>
> We do have a use case.  It's what I described originally.  There is a
> user space daemon (could be a compositor, could be something else)
> that runs and listens for GPU reset notifications.  When it receives a
> notification, it takes action and kills the guilty app and restarts
> the compositer and gathers any relevant data related to the GPU hang
> (if possible).  We can revisit this discussion once we have the whole
> implementation complete.  Other drivers seem to do similar things
> already today via different means (msm using devcoredump, i915 seems
> to have its own GPU reset notification mechanism, etc.).  It just
> seemed like there was value in having a generic drm GPU reset
> notification, but maybe not yet.

just one point of clarification.. in the msm and i915 case it is
purely for debugging and telemetry (ie. sending crash logs back to
distro for analysis if user has crash reporting enabled).. it isn't
used for triggering any action like killing app or compositor.

I would however *strongly* recommend devcoredump support in other GPU
drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
to debug and fix a couple obscure issues that I was not able to
reproduce by myself.

BR,
-R


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-16 Thread Alex Deucher
On Wed, Mar 16, 2022 at 4:48 AM Pekka Paalanen  wrote:
>
> On Tue, 15 Mar 2022 10:54:38 -0400
> Alex Deucher  wrote:
>
> > On Mon, Mar 14, 2022 at 11:26 AM Pekka Paalanen  wrote:
> > >
> > > On Mon, 14 Mar 2022 10:23:27 -0400
> > > Alex Deucher  wrote:
> > >
> > > > On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen  
> > > > wrote:
> > > > >
> > > > > On Thu, 10 Mar 2022 11:56:41 -0800
> > > > > Rob Clark  wrote:
> > > > >
> > > > > > For something like just notifying a compositor that a gpu crash
> > > > > > happened, perhaps drm_event is more suitable.  See
> > > > > > virtio_gpu_fence_event_create() for an example of adding new event
> > > > > > types.  Although maybe you want it to be an event which is not 
> > > > > > device
> > > > > > specific.  This isn't so much of a debugging use-case as simply
> > > > > > notification.
> > > > >
> > > > > Hi,
> > > > >
> > > > > for this particular use case, are we now talking about the display
> > > > > device (KMS) crashing or the rendering device (OpenGL/Vulkan) 
> > > > > crashing?
> > > > >
> > > > > If the former, I wasn't aware that display device crashes are a thing.
> > > > > How should a userspace display server react to those?
> > > > >
> > > > > If the latter, don't we have EGL extensions or Vulkan API already to
> > > > > deliver that?
> > > > >
> > > > > The above would be about device crashes that directly affect the
> > > > > display server. Is that the use case in mind here, or is it instead
> > > > > about notifying the display server that some application has caused a
> > > > > driver/hardware crash? If the latter, how should a display server 
> > > > > react
> > > > > to that? Disconnect the application?
> > > > >
> > > > > Shashank, what is the actual use case you are developing this for?
> > > > >
> > > > > I've read all the emails here so far, and I don't recall seeing it
> > > > > explained.
> > > > >
> > > >
> > > > The idea is that a support daemon or compositor would listen for GPU
> > > > reset notifications and do something useful with them (kill the guilty
> > > > app, restart the desktop environment, etc.).  Today when the GPU
> > > > resets, most applications just continue assuming nothing is wrong,
> > > > meanwhile the GPU has stopped accepting work until the apps re-init
> > > > their context so all of their command submissions just get rejected.
> > > >
> > > > > Btw. somewhat relatedly, there has been work aiming to allow
> > > > > graceful hot-unplug of DRM devices. There is a kernel doc outlining 
> > > > > how
> > > > > the various APIs should react towards userspace when a DRM device
> > > > > suddenly disappears. That seems to have some overlap here IMO.
> > > > >
> > > > > See 
> > > > > https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#device-hot-unplug
> > > > > which also has a couple pointers to EGL and Vulkan APIs.
> > > >
> > > > The problem is most applications don't use the GL or VK robustness
> > > > APIs.
> > >
> > > Hi,
> > >
> > > how would this new event help with that?
> >
> > This event would provide notification that a GPU reset occurred.
> >
> > >
> > > I mean, yeah, there could be a daemon that kills those GPU users, but
> > > then what? You still lose any unsaved work, and may need to manually
> > > restart them.
> > >
> > > Is the idea that it is better to have the app crash and disappear than
> > > to look like it froze while it otherwise still runs?
> >
> > Yes.
>
> Ok. That was just a wild guess.
>
> >  The daemon could also send the user some sort of notification
> > that a GPU reset occurred.
> >
> > >
> > > If some daemon or compositor goes killing apps that trigger GPU resets,
> > > then how do we stop that for an app that actually does use the
> > > appropriate EGL or Vulkan APIs to detect and remedy that situation
> > > itself?
> >
> > I guess the daemon could keep some sort of whitelist.  OTOH, very few
> > if any applications, especially games actually support these
> > extensions.
>
> I would think that a white-list is a non-starter due to the maintenance
> nightmare and the "wrong by default" design for well behaving programs.
>
> I am not a fan of optimising for broken software. I understand that
> with games this is routine, but we're talking about everything here,
> not just games. Games cannot be fixed, but the rest could if the
> problem was not sweeped under the rug. It's like the inverse of the
> platform problem.

Fair enough, but it hasn't happened in the last 15 years or so.

>
> > > >  You could use something like that in the compositor, but those
> > > > APIs tend to be focused more on the application itself rather than the
> > > > GPU in general.  E.g., Is my context lost.  Which is fine for
> > > > restarting your context, but doesn't really help if you want to try
> > > > and do something with another application (i.e., the likely guilty
> > > > app).  Also, on dGPU at least, when you reset the GPU, vram is usually
> > > > lost (either due to the memory controller 

Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-16 Thread Pekka Paalanen
On Tue, 15 Mar 2022 10:54:38 -0400
Alex Deucher  wrote:

> On Mon, Mar 14, 2022 at 11:26 AM Pekka Paalanen  wrote:
> >
> > On Mon, 14 Mar 2022 10:23:27 -0400
> > Alex Deucher  wrote:
> >  
> > > On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen  
> > > wrote:  
> > > >
> > > > On Thu, 10 Mar 2022 11:56:41 -0800
> > > > Rob Clark  wrote:
> > > >  
> > > > > For something like just notifying a compositor that a gpu crash
> > > > > happened, perhaps drm_event is more suitable.  See
> > > > > virtio_gpu_fence_event_create() for an example of adding new event
> > > > > types.  Although maybe you want it to be an event which is not device
> > > > > specific.  This isn't so much of a debugging use-case as simply
> > > > > notification.  
> > > >
> > > > Hi,
> > > >
> > > > for this particular use case, are we now talking about the display
> > > > device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
> > > >
> > > > If the former, I wasn't aware that display device crashes are a thing.
> > > > How should a userspace display server react to those?
> > > >
> > > > If the latter, don't we have EGL extensions or Vulkan API already to
> > > > deliver that?
> > > >
> > > > The above would be about device crashes that directly affect the
> > > > display server. Is that the use case in mind here, or is it instead
> > > > about notifying the display server that some application has caused a
> > > > driver/hardware crash? If the latter, how should a display server react
> > > > to that? Disconnect the application?
> > > >
> > > > Shashank, what is the actual use case you are developing this for?
> > > >
> > > > I've read all the emails here so far, and I don't recall seeing it
> > > > explained.
> > > >  
> > >
> > > The idea is that a support daemon or compositor would listen for GPU
> > > reset notifications and do something useful with them (kill the guilty
> > > app, restart the desktop environment, etc.).  Today when the GPU
> > > resets, most applications just continue assuming nothing is wrong,
> > > meanwhile the GPU has stopped accepting work until the apps re-init
> > > their context so all of their command submissions just get rejected.
> > >  
> > > > Btw. somewhat relatedly, there has been work aiming to allow
> > > > graceful hot-unplug of DRM devices. There is a kernel doc outlining how
> > > > the various APIs should react towards userspace when a DRM device
> > > > suddenly disappears. That seems to have some overlap here IMO.
> > > >
> > > > See 
> > > > https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#device-hot-unplug
> > > > which also has a couple pointers to EGL and Vulkan APIs.  
> > >
> > > The problem is most applications don't use the GL or VK robustness
> > > APIs.  
> >
> > Hi,
> >
> > how would this new event help with that?  
> 
> This event would provide notification that a GPU reset occurred.
> 
> >
> > I mean, yeah, there could be a daemon that kills those GPU users, but
> > then what? You still lose any unsaved work, and may need to manually
> > restart them.
> >
> > Is the idea that it is better to have the app crash and disappear than
> > to look like it froze while it otherwise still runs?  
> 
> Yes.

Ok. That was just a wild guess.

>  The daemon could also send the user some sort of notification
> that a GPU reset occurred.
> 
> >
> > If some daemon or compositor goes killing apps that trigger GPU resets,
> > then how do we stop that for an app that actually does use the
> > appropriate EGL or Vulkan APIs to detect and remedy that situation
> > itself?  
> 
> I guess the daemon could keep some sort of whitelist.  OTOH, very few
> if any applications, especially games actually support these
> extensions.

I would think that a white-list is a non-starter due to the maintenance
nightmare and the "wrong by default" design for well behaving programs.

I am not a fan of optimising for broken software. I understand that
with games this is routine, but we're talking about everything here,
not just games. Games cannot be fixed, but the rest could if the
problem was not sweeped under the rug. It's like the inverse of the
platform problem.

> > >  You could use something like that in the compositor, but those
> > > APIs tend to be focused more on the application itself rather than the
> > > GPU in general.  E.g., Is my context lost.  Which is fine for
> > > restarting your context, but doesn't really help if you want to try
> > > and do something with another application (i.e., the likely guilty
> > > app).  Also, on dGPU at least, when you reset the GPU, vram is usually
> > > lost (either due to the memory controller being reset, or vram being
> > > zero'd on init due to ECC support), so even if you are not the guilty
> > > process, in that case you'd need to re-init your context anyway.  
> >
> > Why should something like a compositor listen for this and kill apps
> > that triggered GPU resets, instead of e.g. Mesa noticing that in the app
> > and killing itself? Mesa 

Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-15 Thread Alex Deucher
On Mon, Mar 14, 2022 at 11:26 AM Pekka Paalanen  wrote:
>
> On Mon, 14 Mar 2022 10:23:27 -0400
> Alex Deucher  wrote:
>
> > On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen  wrote:
> > >
> > > On Thu, 10 Mar 2022 11:56:41 -0800
> > > Rob Clark  wrote:
> > >
> > > > For something like just notifying a compositor that a gpu crash
> > > > happened, perhaps drm_event is more suitable.  See
> > > > virtio_gpu_fence_event_create() for an example of adding new event
> > > > types.  Although maybe you want it to be an event which is not device
> > > > specific.  This isn't so much of a debugging use-case as simply
> > > > notification.
> > >
> > > Hi,
> > >
> > > for this particular use case, are we now talking about the display
> > > device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
> > >
> > > If the former, I wasn't aware that display device crashes are a thing.
> > > How should a userspace display server react to those?
> > >
> > > If the latter, don't we have EGL extensions or Vulkan API already to
> > > deliver that?
> > >
> > > The above would be about device crashes that directly affect the
> > > display server. Is that the use case in mind here, or is it instead
> > > about notifying the display server that some application has caused a
> > > driver/hardware crash? If the latter, how should a display server react
> > > to that? Disconnect the application?
> > >
> > > Shashank, what is the actual use case you are developing this for?
> > >
> > > I've read all the emails here so far, and I don't recall seeing it
> > > explained.
> > >
> >
> > The idea is that a support daemon or compositor would listen for GPU
> > reset notifications and do something useful with them (kill the guilty
> > app, restart the desktop environment, etc.).  Today when the GPU
> > resets, most applications just continue assuming nothing is wrong,
> > meanwhile the GPU has stopped accepting work until the apps re-init
> > their context so all of their command submissions just get rejected.
> >
> > > Btw. somewhat relatedly, there has been work aiming to allow
> > > graceful hot-unplug of DRM devices. There is a kernel doc outlining how
> > > the various APIs should react towards userspace when a DRM device
> > > suddenly disappears. That seems to have some overlap here IMO.
> > >
> > > See 
> > > https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#device-hot-unplug
> > > which also has a couple pointers to EGL and Vulkan APIs.
> >
> > The problem is most applications don't use the GL or VK robustness
> > APIs.
>
> Hi,
>
> how would this new event help with that?

This event would provide notification that a GPU reset occurred.

>
> I mean, yeah, there could be a daemon that kills those GPU users, but
> then what? You still lose any unsaved work, and may need to manually
> restart them.
>
> Is the idea that it is better to have the app crash and disappear than
> to look like it froze while it otherwise still runs?

Yes.  The daemon could also send the user some sort of notification
that a GPU reset occurred.

>
> If some daemon or compositor goes killing apps that trigger GPU resets,
> then how do we stop that for an app that actually does use the
> appropriate EGL or Vulkan APIs to detect and remedy that situation
> itself?

I guess the daemon could keep some sort of whitelist.  OTOH, very few
if any applications, especially games actually support these
extensions.

>
> >  You could use something like that in the compositor, but those
> > APIs tend to be focused more on the application itself rather than the
> > GPU in general.  E.g., Is my context lost.  Which is fine for
> > restarting your context, but doesn't really help if you want to try
> > and do something with another application (i.e., the likely guilty
> > app).  Also, on dGPU at least, when you reset the GPU, vram is usually
> > lost (either due to the memory controller being reset, or vram being
> > zero'd on init due to ECC support), so even if you are not the guilty
> > process, in that case you'd need to re-init your context anyway.
>
> Why should something like a compositor listen for this and kill apps
> that triggered GPU resets, instead of e.g. Mesa noticing that in the app
> and killing itself? Mesa in the app would know if robustness API is
> being used.

That's another possibility, but it doesn't handle the case where the
compositor doesn't support any sort of robustness extension so if the
GPU was reset, you'd lose your desktop anyway even if the app kept
running.

>
> Would be really nice to have the answers to all these questions to be
> collected and reiterated in the next version of this proposal.

The idea is to provide the notification of a GPU reset.  What the
various desktop environments or daemons do with it is up to them.  I
still think there is value in a notification even if you don't kill
apps or anything like that.  E.g., you can have a daemon running that
gets notified and logs the error, collects debug info, sends an 

Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-15 Thread Christian König

Am 15.03.22 um 08:13 schrieb Dave Airlie:

On Tue, 15 Mar 2022 at 00:23, Alex Deucher  wrote:

On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen  wrote:

On Thu, 10 Mar 2022 11:56:41 -0800
Rob Clark  wrote:


For something like just notifying a compositor that a gpu crash
happened, perhaps drm_event is more suitable.  See
virtio_gpu_fence_event_create() for an example of adding new event
types.  Although maybe you want it to be an event which is not device
specific.  This isn't so much of a debugging use-case as simply
notification.

Hi,

for this particular use case, are we now talking about the display
device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?

If the former, I wasn't aware that display device crashes are a thing.
How should a userspace display server react to those?

If the latter, don't we have EGL extensions or Vulkan API already to
deliver that?

The above would be about device crashes that directly affect the
display server. Is that the use case in mind here, or is it instead
about notifying the display server that some application has caused a
driver/hardware crash? If the latter, how should a display server react
to that? Disconnect the application?

Shashank, what is the actual use case you are developing this for?

I've read all the emails here so far, and I don't recall seeing it
explained.


The idea is that a support daemon or compositor would listen for GPU
reset notifications and do something useful with them (kill the guilty
app, restart the desktop environment, etc.).  Today when the GPU
resets, most applications just continue assuming nothing is wrong,
meanwhile the GPU has stopped accepting work until the apps re-init
their context so all of their command submissions just get rejected.

Just one thing comes to mind reading this, racy PID reuse.

process 1234 does something bad to GPU.
process 1234 dies in parallel to sysfs notification being sent.
other process 1234 reuses the pid
new process 1234 gets destroyed by receiver of sysfs notification.


That's a well known problem inherit to the uses of PIDs.

IIRC because of this the kernel only reuses PIDs when 
/proc/sys/kernel/pid_max is reached and then wraps around.


Regards,
Christian.



Dave.




Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-15 Thread Simon Ser
On Tuesday, March 15th, 2022 at 08:13, Dave Airlie  wrote:

> Just one thing comes to mind reading this, racy PID reuse.
>
> process 1234 does something bad to GPU.
> process 1234 dies in parallel to sysfs notification being sent.
> other process 1234 reuses the pid
> new process 1234 gets destroyed by receiver of sysfs notification.

This is a more general problem with PIDs. One solution is pidfd.


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-15 Thread Dave Airlie
On Tue, 15 Mar 2022 at 00:23, Alex Deucher  wrote:
>
> On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen  wrote:
> >
> > On Thu, 10 Mar 2022 11:56:41 -0800
> > Rob Clark  wrote:
> >
> > > For something like just notifying a compositor that a gpu crash
> > > happened, perhaps drm_event is more suitable.  See
> > > virtio_gpu_fence_event_create() for an example of adding new event
> > > types.  Although maybe you want it to be an event which is not device
> > > specific.  This isn't so much of a debugging use-case as simply
> > > notification.
> >
> > Hi,
> >
> > for this particular use case, are we now talking about the display
> > device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
> >
> > If the former, I wasn't aware that display device crashes are a thing.
> > How should a userspace display server react to those?
> >
> > If the latter, don't we have EGL extensions or Vulkan API already to
> > deliver that?
> >
> > The above would be about device crashes that directly affect the
> > display server. Is that the use case in mind here, or is it instead
> > about notifying the display server that some application has caused a
> > driver/hardware crash? If the latter, how should a display server react
> > to that? Disconnect the application?
> >
> > Shashank, what is the actual use case you are developing this for?
> >
> > I've read all the emails here so far, and I don't recall seeing it
> > explained.
> >
>
> The idea is that a support daemon or compositor would listen for GPU
> reset notifications and do something useful with them (kill the guilty
> app, restart the desktop environment, etc.).  Today when the GPU
> resets, most applications just continue assuming nothing is wrong,
> meanwhile the GPU has stopped accepting work until the apps re-init
> their context so all of their command submissions just get rejected.

Just one thing comes to mind reading this, racy PID reuse.

process 1234 does something bad to GPU.
process 1234 dies in parallel to sysfs notification being sent.
other process 1234 reuses the pid
new process 1234 gets destroyed by receiver of sysfs notification.

Dave.


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-14 Thread Pekka Paalanen
On Mon, 14 Mar 2022 10:23:27 -0400
Alex Deucher  wrote:

> On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen  wrote:
> >
> > On Thu, 10 Mar 2022 11:56:41 -0800
> > Rob Clark  wrote:
> >  
> > > For something like just notifying a compositor that a gpu crash
> > > happened, perhaps drm_event is more suitable.  See
> > > virtio_gpu_fence_event_create() for an example of adding new event
> > > types.  Although maybe you want it to be an event which is not device
> > > specific.  This isn't so much of a debugging use-case as simply
> > > notification.  
> >
> > Hi,
> >
> > for this particular use case, are we now talking about the display
> > device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
> >
> > If the former, I wasn't aware that display device crashes are a thing.
> > How should a userspace display server react to those?
> >
> > If the latter, don't we have EGL extensions or Vulkan API already to
> > deliver that?
> >
> > The above would be about device crashes that directly affect the
> > display server. Is that the use case in mind here, or is it instead
> > about notifying the display server that some application has caused a
> > driver/hardware crash? If the latter, how should a display server react
> > to that? Disconnect the application?
> >
> > Shashank, what is the actual use case you are developing this for?
> >
> > I've read all the emails here so far, and I don't recall seeing it
> > explained.
> >  
> 
> The idea is that a support daemon or compositor would listen for GPU
> reset notifications and do something useful with them (kill the guilty
> app, restart the desktop environment, etc.).  Today when the GPU
> resets, most applications just continue assuming nothing is wrong,
> meanwhile the GPU has stopped accepting work until the apps re-init
> their context so all of their command submissions just get rejected.
> 
> > Btw. somewhat relatedly, there has been work aiming to allow
> > graceful hot-unplug of DRM devices. There is a kernel doc outlining how
> > the various APIs should react towards userspace when a DRM device
> > suddenly disappears. That seems to have some overlap here IMO.
> >
> > See 
> > https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#device-hot-unplug
> > which also has a couple pointers to EGL and Vulkan APIs.  
> 
> The problem is most applications don't use the GL or VK robustness
> APIs.

Hi,

how would this new event help with that?

I mean, yeah, there could be a daemon that kills those GPU users, but
then what? You still lose any unsaved work, and may need to manually
restart them.

Is the idea that it is better to have the app crash and disappear than
to look like it froze while it otherwise still runs?

If some daemon or compositor goes killing apps that trigger GPU resets,
then how do we stop that for an app that actually does use the
appropriate EGL or Vulkan APIs to detect and remedy that situation
itself?

>  You could use something like that in the compositor, but those
> APIs tend to be focused more on the application itself rather than the
> GPU in general.  E.g., Is my context lost.  Which is fine for
> restarting your context, but doesn't really help if you want to try
> and do something with another application (i.e., the likely guilty
> app).  Also, on dGPU at least, when you reset the GPU, vram is usually
> lost (either due to the memory controller being reset, or vram being
> zero'd on init due to ECC support), so even if you are not the guilty
> process, in that case you'd need to re-init your context anyway.

Why should something like a compositor listen for this and kill apps
that triggered GPU resets, instead of e.g. Mesa noticing that in the app
and killing itself? Mesa in the app would know if robustness API is
being used.

Would be really nice to have the answers to all these questions to be
collected and reiterated in the next version of this proposal.


Thanks,
pq


pgpywmrz8z5sn.pgp
Description: OpenPGP digital signature


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-14 Thread Alex Deucher
On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen  wrote:
>
> On Thu, 10 Mar 2022 11:56:41 -0800
> Rob Clark  wrote:
>
> > For something like just notifying a compositor that a gpu crash
> > happened, perhaps drm_event is more suitable.  See
> > virtio_gpu_fence_event_create() for an example of adding new event
> > types.  Although maybe you want it to be an event which is not device
> > specific.  This isn't so much of a debugging use-case as simply
> > notification.
>
> Hi,
>
> for this particular use case, are we now talking about the display
> device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
>
> If the former, I wasn't aware that display device crashes are a thing.
> How should a userspace display server react to those?
>
> If the latter, don't we have EGL extensions or Vulkan API already to
> deliver that?
>
> The above would be about device crashes that directly affect the
> display server. Is that the use case in mind here, or is it instead
> about notifying the display server that some application has caused a
> driver/hardware crash? If the latter, how should a display server react
> to that? Disconnect the application?
>
> Shashank, what is the actual use case you are developing this for?
>
> I've read all the emails here so far, and I don't recall seeing it
> explained.
>

The idea is that a support daemon or compositor would listen for GPU
reset notifications and do something useful with them (kill the guilty
app, restart the desktop environment, etc.).  Today when the GPU
resets, most applications just continue assuming nothing is wrong,
meanwhile the GPU has stopped accepting work until the apps re-init
their context so all of their command submissions just get rejected.

> Btw. somewhat relatedly, there has been work aiming to allow
> graceful hot-unplug of DRM devices. There is a kernel doc outlining how
> the various APIs should react towards userspace when a DRM device
> suddenly disappears. That seems to have some overlap here IMO.
>
> See https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#device-hot-unplug
> which also has a couple pointers to EGL and Vulkan APIs.

The problem is most applications don't use the GL or VK robustness
APIs.  You could use something like that in the compositor, but those
APIs tend to be focused more on the application itself rather than the
GPU in general.  E.g., Is my context lost.  Which is fine for
restarting your context, but doesn't really help if you want to try
and do something with another application (i.e., the likely guilty
app).  Also, on dGPU at least, when you reset the GPU, vram is usually
lost (either due to the memory controller being reset, or vram being
zero'd on init due to ECC support), so even if you are not the guilty
process, in that case you'd need to re-init your context anyway.

Alex

>
>
> Thanks,
> pq


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-11 Thread Pekka Paalanen
On Thu, 10 Mar 2022 11:56:41 -0800
Rob Clark  wrote:

> For something like just notifying a compositor that a gpu crash
> happened, perhaps drm_event is more suitable.  See
> virtio_gpu_fence_event_create() for an example of adding new event
> types.  Although maybe you want it to be an event which is not device
> specific.  This isn't so much of a debugging use-case as simply
> notification.

Hi,

for this particular use case, are we now talking about the display
device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?

If the former, I wasn't aware that display device crashes are a thing.
How should a userspace display server react to those?

If the latter, don't we have EGL extensions or Vulkan API already to
deliver that?

The above would be about device crashes that directly affect the
display server. Is that the use case in mind here, or is it instead
about notifying the display server that some application has caused a
driver/hardware crash? If the latter, how should a display server react
to that? Disconnect the application?

Shashank, what is the actual use case you are developing this for?

I've read all the emails here so far, and I don't recall seeing it
explained.

Btw. somewhat relatedly, there has been work aiming to allow
graceful hot-unplug of DRM devices. There is a kernel doc outlining how
the various APIs should react towards userspace when a DRM device
suddenly disappears. That seems to have some overlap here IMO.

See https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#device-hot-unplug
which also has a couple pointers to EGL and Vulkan APIs.


Thanks,
pq


pgpwnFOIGbMib.pgp
Description: OpenPGP digital signature


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-10 Thread Sharma, Shashank




On 3/10/2022 8:56 PM, Rob Clark wrote:

On Thu, Mar 10, 2022 at 11:44 AM Sharma, Shashank
 wrote:




On 3/10/2022 8:35 PM, Rob Clark wrote:

On Thu, Mar 10, 2022 at 11:14 AM Sharma, Shashank
 wrote:




On 3/10/2022 7:33 PM, Abhinav Kumar wrote:



On 3/10/2022 9:40 AM, Rob Clark wrote:

On Thu, Mar 10, 2022 at 9:19 AM Sharma, Shashank
 wrote:




On 3/10/2022 6:10 PM, Rob Clark wrote:

On Thu, Mar 10, 2022 at 8:21 AM Sharma, Shashank
 wrote:




On 3/10/2022 4:24 PM, Rob Clark wrote:

On Thu, Mar 10, 2022 at 1:55 AM Christian König
 wrote:




Am 09.03.22 um 19:12 schrieb Rob Clark:

On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
 wrote:

From: Shashank Sharma 

This patch adds a new sysfs event, which will indicate
the userland about a GPU reset, and can also provide
some information like:
- process ID of the process involved with the GPU reset
- process name of the involved process
- the GPU status info (using flags)

This patch also introduces the first flag of the flags
bitmap, which can be appended as and when required.

Why invent something new, rather than using the already existing
devcoredump?


Yeah, that's a really valid question.


I don't think we need (or should encourage/allow) something drm
specific when there is already an existing solution used by both
drm
and non-drm drivers.  Userspace should not have to learn to support
yet another mechanism to do the same thing.


Question is how is userspace notified about new available core
dumps?


I haven't looked into it too closely, as the CrOS userspace
crash-reporter already had support for devcoredump, so it "just
worked" out of the box[1].  I believe a udev event is what triggers
the crash-reporter to go read the devcore dump out of sysfs.


I had a quick look at the devcoredump code, and it doesn't look like
that is sending an event to the user, so we still need an event to
indicate a GPU reset.


There definitely is an event to userspace, I suspect somewhere down
the device_add() path?



Let me check that out as well, hope that is not due to a driver-private
event for GPU reset, coz I think I have seen some of those in a few DRM
drivers.



Definitely no driver private event for drm/msm .. I haven't dug
through it all but this is the collector for devcoredump, triggered
somehow via udev.  Most likely from event triggered by device_add()

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fchromium.googlesource.com%2Fchromiumos%2Fplatform2%2F%2B%2FHEAD%2Fcrash-reporter%2Fudev_collector.ccdata=04%7C01%7Cshashank.sharma%40amd.com%7C3b5c0e8744234962061d08da02d00248%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637825389694363926%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=mWI1Z%2B8eMJApePc5ajRipbGUG9Cw9wXf2FCw6NQxVaM%3Dreserved=0



Yes, that is correct. the uevent for devcoredump is from device_add()


Yes, I could confirm in the code that device_add() sends a uevent.

kobject_uevent(>kobj, KOBJ_ADD);

I was trying to map the ChromiumOs's udev event rules with the event
being sent from device_add(), what I could see is there is only one udev
rule for any DRM subsystem events in ChromiumOs's 99-crash-reporter.rules:

ACTION=="change", SUBSYSTEM=="drm", KERNEL=="card0", ENV{ERROR}=="1", \
 RUN+="/sbin/crash_reporter
--udev=KERNEL=card0:SUBSYSTEM=drm:ACTION=change"

Can someone confirm that this is the rule which gets triggered when a
devcoredump is generated ? I could not find an ERROR=1 string in the
env[] while sending this event from dev_add();


I think it is actually this rule:

ACTION=="add", SUBSYSTEM=="devcoredump", \
RUN+="/sbin/crash_reporter
--udev=SUBSYSTEM=devcoredump:ACTION=add:KERNEL_NUMBER=%n"

It is something non-drm specific because it supports devcore dumps
from non drm drivers.  I know at least some of the wifi and remoteproc
drivers use it.



Ah, this seems like a problem for me. I understand it will work for a
reset/recovery app well, but if a DRM client (like a compositor), who
wants to listen only to DRM events (like a GPU reset), wouldn't this
create a lot of noise for it ? Like every time any subsystem produces
this coredump, there will be a change in devcoresump subsystem, and the
client will have to parse the core file, and then will have to decide if
it wants to react to this, or ignore.

Wouldn't a GPU reset event, specific to DRM subsystem server better in
such case ?



So, I suppose there are two different use-cases.. for something like
distro which has generic crash telemetry (ie. when users opt in to
automated crash reporting), and in general for debugging gpu crashes,
you want devcoredump, preferably with plenty of information about gpu
state, etc, so you actually have a chance of debugging problems you
can't necessarily reproduce locally.  Note also that mesa CI has some
limited support for collecting devcore dumps if a CI run triggers a
GPU fault.

For something like just notifying a compositor that a gpu crash

Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-10 Thread Rob Clark
On Thu, Mar 10, 2022 at 11:44 AM Sharma, Shashank
 wrote:
>
>
>
> On 3/10/2022 8:35 PM, Rob Clark wrote:
> > On Thu, Mar 10, 2022 at 11:14 AM Sharma, Shashank
> >  wrote:
> >>
> >>
> >>
> >> On 3/10/2022 7:33 PM, Abhinav Kumar wrote:
> >>>
> >>>
> >>> On 3/10/2022 9:40 AM, Rob Clark wrote:
>  On Thu, Mar 10, 2022 at 9:19 AM Sharma, Shashank
>   wrote:
> >
> >
> >
> > On 3/10/2022 6:10 PM, Rob Clark wrote:
> >> On Thu, Mar 10, 2022 at 8:21 AM Sharma, Shashank
> >>  wrote:
> >>>
> >>>
> >>>
> >>> On 3/10/2022 4:24 PM, Rob Clark wrote:
>  On Thu, Mar 10, 2022 at 1:55 AM Christian König
>   wrote:
> >
> >
> >
> > Am 09.03.22 um 19:12 schrieb Rob Clark:
> >> On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
> >>  wrote:
> >>> From: Shashank Sharma 
> >>>
> >>> This patch adds a new sysfs event, which will indicate
> >>> the userland about a GPU reset, and can also provide
> >>> some information like:
> >>> - process ID of the process involved with the GPU reset
> >>> - process name of the involved process
> >>> - the GPU status info (using flags)
> >>>
> >>> This patch also introduces the first flag of the flags
> >>> bitmap, which can be appended as and when required.
> >> Why invent something new, rather than using the already existing
> >> devcoredump?
> >
> > Yeah, that's a really valid question.
> >
> >> I don't think we need (or should encourage/allow) something drm
> >> specific when there is already an existing solution used by both
> >> drm
> >> and non-drm drivers.  Userspace should not have to learn to support
> >> yet another mechanism to do the same thing.
> >
> > Question is how is userspace notified about new available core
> > dumps?
> 
>  I haven't looked into it too closely, as the CrOS userspace
>  crash-reporter already had support for devcoredump, so it "just
>  worked" out of the box[1].  I believe a udev event is what triggers
>  the crash-reporter to go read the devcore dump out of sysfs.
> >>>
> >>> I had a quick look at the devcoredump code, and it doesn't look like
> >>> that is sending an event to the user, so we still need an event to
> >>> indicate a GPU reset.
> >>
> >> There definitely is an event to userspace, I suspect somewhere down
> >> the device_add() path?
> >>
> >
> > Let me check that out as well, hope that is not due to a driver-private
> > event for GPU reset, coz I think I have seen some of those in a few DRM
> > drivers.
> >
> 
>  Definitely no driver private event for drm/msm .. I haven't dug
>  through it all but this is the collector for devcoredump, triggered
>  somehow via udev.  Most likely from event triggered by device_add()
> 
>  https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fchromium.googlesource.com%2Fchromiumos%2Fplatform2%2F%2B%2FHEAD%2Fcrash-reporter%2Fudev_collector.ccdata=04%7C01%7Cshashank.sharma%40amd.com%7Cb4e920f125ae4d7de29708da02cd3112%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637825377562005233%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=M4xHPErex4vn7l3lNPgniiMp%2BKb3SpOHQo2QLAndxDQ%3Dreserved=0
> 
> >>>
> >>> Yes, that is correct. the uevent for devcoredump is from device_add()
> >>>
> >> Yes, I could confirm in the code that device_add() sends a uevent.
> >>
> >> kobject_uevent(>kobj, KOBJ_ADD);
> >>
> >> I was trying to map the ChromiumOs's udev event rules with the event
> >> being sent from device_add(), what I could see is there is only one udev
> >> rule for any DRM subsystem events in ChromiumOs's 99-crash-reporter.rules:
> >>
> >> ACTION=="change", SUBSYSTEM=="drm", KERNEL=="card0", ENV{ERROR}=="1", \
> >> RUN+="/sbin/crash_reporter
> >> --udev=KERNEL=card0:SUBSYSTEM=drm:ACTION=change"
> >>
> >> Can someone confirm that this is the rule which gets triggered when a
> >> devcoredump is generated ? I could not find an ERROR=1 string in the
> >> env[] while sending this event from dev_add();
> >
> > I think it is actually this rule:
> >
> > ACTION=="add", SUBSYSTEM=="devcoredump", \
> >RUN+="/sbin/crash_reporter
> > --udev=SUBSYSTEM=devcoredump:ACTION=add:KERNEL_NUMBER=%n"
> >
> > It is something non-drm specific because it supports devcore dumps
> > from non drm drivers.  I know at least some of the wifi and remoteproc
> > drivers use it.
> >
>
> Ah, this seems like a problem for me. I understand it will work for a
> reset/recovery app well, but if a DRM client (like a compositor), who
> wants to listen only to DRM events (like a GPU reset), wouldn't this
> create a lot of noise for it ? Like every time any subsystem produces
> this 

Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-10 Thread Sharma, Shashank




On 3/10/2022 8:35 PM, Rob Clark wrote:

On Thu, Mar 10, 2022 at 11:14 AM Sharma, Shashank
 wrote:




On 3/10/2022 7:33 PM, Abhinav Kumar wrote:



On 3/10/2022 9:40 AM, Rob Clark wrote:

On Thu, Mar 10, 2022 at 9:19 AM Sharma, Shashank
 wrote:




On 3/10/2022 6:10 PM, Rob Clark wrote:

On Thu, Mar 10, 2022 at 8:21 AM Sharma, Shashank
 wrote:




On 3/10/2022 4:24 PM, Rob Clark wrote:

On Thu, Mar 10, 2022 at 1:55 AM Christian König
 wrote:




Am 09.03.22 um 19:12 schrieb Rob Clark:

On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
 wrote:

From: Shashank Sharma 

This patch adds a new sysfs event, which will indicate
the userland about a GPU reset, and can also provide
some information like:
- process ID of the process involved with the GPU reset
- process name of the involved process
- the GPU status info (using flags)

This patch also introduces the first flag of the flags
bitmap, which can be appended as and when required.

Why invent something new, rather than using the already existing
devcoredump?


Yeah, that's a really valid question.


I don't think we need (or should encourage/allow) something drm
specific when there is already an existing solution used by both
drm
and non-drm drivers.  Userspace should not have to learn to support
yet another mechanism to do the same thing.


Question is how is userspace notified about new available core
dumps?


I haven't looked into it too closely, as the CrOS userspace
crash-reporter already had support for devcoredump, so it "just
worked" out of the box[1].  I believe a udev event is what triggers
the crash-reporter to go read the devcore dump out of sysfs.


I had a quick look at the devcoredump code, and it doesn't look like
that is sending an event to the user, so we still need an event to
indicate a GPU reset.


There definitely is an event to userspace, I suspect somewhere down
the device_add() path?



Let me check that out as well, hope that is not due to a driver-private
event for GPU reset, coz I think I have seen some of those in a few DRM
drivers.



Definitely no driver private event for drm/msm .. I haven't dug
through it all but this is the collector for devcoredump, triggered
somehow via udev.  Most likely from event triggered by device_add()

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fchromium.googlesource.com%2Fchromiumos%2Fplatform2%2F%2B%2FHEAD%2Fcrash-reporter%2Fudev_collector.ccdata=04%7C01%7Cshashank.sharma%40amd.com%7Cb4e920f125ae4d7de29708da02cd3112%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637825377562005233%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=M4xHPErex4vn7l3lNPgniiMp%2BKb3SpOHQo2QLAndxDQ%3Dreserved=0



Yes, that is correct. the uevent for devcoredump is from device_add()


Yes, I could confirm in the code that device_add() sends a uevent.

kobject_uevent(>kobj, KOBJ_ADD);

I was trying to map the ChromiumOs's udev event rules with the event
being sent from device_add(), what I could see is there is only one udev
rule for any DRM subsystem events in ChromiumOs's 99-crash-reporter.rules:

ACTION=="change", SUBSYSTEM=="drm", KERNEL=="card0", ENV{ERROR}=="1", \
RUN+="/sbin/crash_reporter
--udev=KERNEL=card0:SUBSYSTEM=drm:ACTION=change"

Can someone confirm that this is the rule which gets triggered when a
devcoredump is generated ? I could not find an ERROR=1 string in the
env[] while sending this event from dev_add();


I think it is actually this rule:

ACTION=="add", SUBSYSTEM=="devcoredump", \
   RUN+="/sbin/crash_reporter
--udev=SUBSYSTEM=devcoredump:ACTION=add:KERNEL_NUMBER=%n"

It is something non-drm specific because it supports devcore dumps
from non drm drivers.  I know at least some of the wifi and remoteproc
drivers use it.



Ah, this seems like a problem for me. I understand it will work for a 
reset/recovery app well, but if a DRM client (like a compositor), who 
wants to listen only to DRM events (like a GPU reset), wouldn't this 
create a lot of noise for it ? Like every time any subsystem produces 
this coredump, there will be a change in devcoresump subsystem, and the 
client will have to parse the core file, and then will have to decide if 
it wants to react to this, or ignore.


Wouldn't a GPU reset event, specific to DRM subsystem server better in 
such case ?


- Shashank


BR,
-R


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-10 Thread Rob Clark
On Thu, Mar 10, 2022 at 11:14 AM Sharma, Shashank
 wrote:
>
>
>
> On 3/10/2022 7:33 PM, Abhinav Kumar wrote:
> >
> >
> > On 3/10/2022 9:40 AM, Rob Clark wrote:
> >> On Thu, Mar 10, 2022 at 9:19 AM Sharma, Shashank
> >>  wrote:
> >>>
> >>>
> >>>
> >>> On 3/10/2022 6:10 PM, Rob Clark wrote:
>  On Thu, Mar 10, 2022 at 8:21 AM Sharma, Shashank
>   wrote:
> >
> >
> >
> > On 3/10/2022 4:24 PM, Rob Clark wrote:
> >> On Thu, Mar 10, 2022 at 1:55 AM Christian König
> >>  wrote:
> >>>
> >>>
> >>>
> >>> Am 09.03.22 um 19:12 schrieb Rob Clark:
>  On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
>   wrote:
> > From: Shashank Sharma 
> >
> > This patch adds a new sysfs event, which will indicate
> > the userland about a GPU reset, and can also provide
> > some information like:
> > - process ID of the process involved with the GPU reset
> > - process name of the involved process
> > - the GPU status info (using flags)
> >
> > This patch also introduces the first flag of the flags
> > bitmap, which can be appended as and when required.
>  Why invent something new, rather than using the already existing
>  devcoredump?
> >>>
> >>> Yeah, that's a really valid question.
> >>>
>  I don't think we need (or should encourage/allow) something drm
>  specific when there is already an existing solution used by both
>  drm
>  and non-drm drivers.  Userspace should not have to learn to support
>  yet another mechanism to do the same thing.
> >>>
> >>> Question is how is userspace notified about new available core
> >>> dumps?
> >>
> >> I haven't looked into it too closely, as the CrOS userspace
> >> crash-reporter already had support for devcoredump, so it "just
> >> worked" out of the box[1].  I believe a udev event is what triggers
> >> the crash-reporter to go read the devcore dump out of sysfs.
> >
> > I had a quick look at the devcoredump code, and it doesn't look like
> > that is sending an event to the user, so we still need an event to
> > indicate a GPU reset.
> 
>  There definitely is an event to userspace, I suspect somewhere down
>  the device_add() path?
> 
> >>>
> >>> Let me check that out as well, hope that is not due to a driver-private
> >>> event for GPU reset, coz I think I have seen some of those in a few DRM
> >>> drivers.
> >>>
> >>
> >> Definitely no driver private event for drm/msm .. I haven't dug
> >> through it all but this is the collector for devcoredump, triggered
> >> somehow via udev.  Most likely from event triggered by device_add()
> >>
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fchromium.googlesource.com%2Fchromiumos%2Fplatform2%2F%2B%2FHEAD%2Fcrash-reporter%2Fudev_collector.ccdata=04%7C01%7Cshashank.sharma%40amd.com%7C86146416b717420501fc08da02c4785b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637825340130157925%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=LncI%2F5mIpeG1Avj2YXLmbZ5f1ONUfpf6TzJZH3%2Fs8%2Fw%3Dreserved=0
> >>
> >
> > Yes, that is correct. the uevent for devcoredump is from device_add()
> >
> Yes, I could confirm in the code that device_add() sends a uevent.
>
> kobject_uevent(>kobj, KOBJ_ADD);
>
> I was trying to map the ChromiumOs's udev event rules with the event
> being sent from device_add(), what I could see is there is only one udev
> rule for any DRM subsystem events in ChromiumOs's 99-crash-reporter.rules:
>
> ACTION=="change", SUBSYSTEM=="drm", KERNEL=="card0", ENV{ERROR}=="1", \
>RUN+="/sbin/crash_reporter
> --udev=KERNEL=card0:SUBSYSTEM=drm:ACTION=change"
>
> Can someone confirm that this is the rule which gets triggered when a
> devcoredump is generated ? I could not find an ERROR=1 string in the
> env[] while sending this event from dev_add();

I think it is actually this rule:

ACTION=="add", SUBSYSTEM=="devcoredump", \
  RUN+="/sbin/crash_reporter
--udev=SUBSYSTEM=devcoredump:ACTION=add:KERNEL_NUMBER=%n"

It is something non-drm specific because it supports devcore dumps
from non drm drivers.  I know at least some of the wifi and remoteproc
drivers use it.

BR,
-R


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-10 Thread Sharma, Shashank




On 3/10/2022 7:33 PM, Abhinav Kumar wrote:



On 3/10/2022 9:40 AM, Rob Clark wrote:

On Thu, Mar 10, 2022 at 9:19 AM Sharma, Shashank
 wrote:




On 3/10/2022 6:10 PM, Rob Clark wrote:

On Thu, Mar 10, 2022 at 8:21 AM Sharma, Shashank
 wrote:




On 3/10/2022 4:24 PM, Rob Clark wrote:

On Thu, Mar 10, 2022 at 1:55 AM Christian König
 wrote:




Am 09.03.22 um 19:12 schrieb Rob Clark:

On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
 wrote:

From: Shashank Sharma 

This patch adds a new sysfs event, which will indicate
the userland about a GPU reset, and can also provide
some information like:
- process ID of the process involved with the GPU reset
- process name of the involved process
- the GPU status info (using flags)

This patch also introduces the first flag of the flags
bitmap, which can be appended as and when required.
Why invent something new, rather than using the already existing 
devcoredump?


Yeah, that's a really valid question.


I don't think we need (or should encourage/allow) something drm
specific when there is already an existing solution used by both 
drm

and non-drm drivers.  Userspace should not have to learn to support
yet another mechanism to do the same thing.


Question is how is userspace notified about new available core 
dumps?


I haven't looked into it too closely, as the CrOS userspace
crash-reporter already had support for devcoredump, so it "just
worked" out of the box[1].  I believe a udev event is what triggers
the crash-reporter to go read the devcore dump out of sysfs.


I had a quick look at the devcoredump code, and it doesn't look like
that is sending an event to the user, so we still need an event to
indicate a GPU reset.


There definitely is an event to userspace, I suspect somewhere down
the device_add() path?



Let me check that out as well, hope that is not due to a driver-private
event for GPU reset, coz I think I have seen some of those in a few DRM
drivers.



Definitely no driver private event for drm/msm .. I haven't dug
through it all but this is the collector for devcoredump, triggered
somehow via udev.  Most likely from event triggered by device_add()

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fchromium.googlesource.com%2Fchromiumos%2Fplatform2%2F%2B%2FHEAD%2Fcrash-reporter%2Fudev_collector.ccdata=04%7C01%7Cshashank.sharma%40amd.com%7C86146416b717420501fc08da02c4785b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637825340130157925%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=LncI%2F5mIpeG1Avj2YXLmbZ5f1ONUfpf6TzJZH3%2Fs8%2Fw%3Dreserved=0 



Yes, that is correct. the uevent for devcoredump is from device_add()


Yes, I could confirm in the code that device_add() sends a uevent.

kobject_uevent(>kobj, KOBJ_ADD);

I was trying to map the ChromiumOs's udev event rules with the event 
being sent from device_add(), what I could see is there is only one udev 
rule for any DRM subsystem events in ChromiumOs's 99-crash-reporter.rules:


ACTION=="change", SUBSYSTEM=="drm", KERNEL=="card0", ENV{ERROR}=="1", \
  RUN+="/sbin/crash_reporter 
--udev=KERNEL=card0:SUBSYSTEM=drm:ACTION=change"


Can someone confirm that this is the rule which gets triggered when a 
devcoredump is generated ? I could not find an ERROR=1 string in the 
env[] while sending this event from dev_add();


- Shashank

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftorvalds%2Flinux%2Fblob%2Fmaster%2Fdrivers%2Fbase%2Fcore.c%23L3340data=04%7C01%7Cshashank.sharma%40amd.com%7C86146416b717420501fc08da02c4785b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637825340130157925%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=5HyWYZ5ZWYz4mUPWeTW51QFdoY0NlA50Nbj1dAC6os4%3Dreserved=0 





BR,
-R


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-10 Thread Abhinav Kumar




On 3/10/2022 9:40 AM, Rob Clark wrote:

On Thu, Mar 10, 2022 at 9:19 AM Sharma, Shashank
 wrote:




On 3/10/2022 6:10 PM, Rob Clark wrote:

On Thu, Mar 10, 2022 at 8:21 AM Sharma, Shashank
 wrote:




On 3/10/2022 4:24 PM, Rob Clark wrote:

On Thu, Mar 10, 2022 at 1:55 AM Christian König
 wrote:




Am 09.03.22 um 19:12 schrieb Rob Clark:

On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
 wrote:

From: Shashank Sharma 

This patch adds a new sysfs event, which will indicate
the userland about a GPU reset, and can also provide
some information like:
- process ID of the process involved with the GPU reset
- process name of the involved process
- the GPU status info (using flags)

This patch also introduces the first flag of the flags
bitmap, which can be appended as and when required.

Why invent something new, rather than using the already existing devcoredump?


Yeah, that's a really valid question.


I don't think we need (or should encourage/allow) something drm
specific when there is already an existing solution used by both drm
and non-drm drivers.  Userspace should not have to learn to support
yet another mechanism to do the same thing.


Question is how is userspace notified about new available core dumps?


I haven't looked into it too closely, as the CrOS userspace
crash-reporter already had support for devcoredump, so it "just
worked" out of the box[1].  I believe a udev event is what triggers
the crash-reporter to go read the devcore dump out of sysfs.


I had a quick look at the devcoredump code, and it doesn't look like
that is sending an event to the user, so we still need an event to
indicate a GPU reset.


There definitely is an event to userspace, I suspect somewhere down
the device_add() path?



Let me check that out as well, hope that is not due to a driver-private
event for GPU reset, coz I think I have seen some of those in a few DRM
drivers.



Definitely no driver private event for drm/msm .. I haven't dug
through it all but this is the collector for devcoredump, triggered
somehow via udev.  Most likely from event triggered by device_add()

https://chromium.googlesource.com/chromiumos/platform2/+/HEAD/crash-reporter/udev_collector.cc


Yes, that is correct. the uevent for devcoredump is from device_add()

https://github.com/torvalds/linux/blob/master/drivers/base/core.c#L3340



BR,
-R


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-10 Thread Rob Clark
On Thu, Mar 10, 2022 at 9:19 AM Sharma, Shashank
 wrote:
>
>
>
> On 3/10/2022 6:10 PM, Rob Clark wrote:
> > On Thu, Mar 10, 2022 at 8:21 AM Sharma, Shashank
> >  wrote:
> >>
> >>
> >>
> >> On 3/10/2022 4:24 PM, Rob Clark wrote:
> >>> On Thu, Mar 10, 2022 at 1:55 AM Christian König
> >>>  wrote:
> 
> 
> 
>  Am 09.03.22 um 19:12 schrieb Rob Clark:
> > On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
> >  wrote:
> >> From: Shashank Sharma 
> >>
> >> This patch adds a new sysfs event, which will indicate
> >> the userland about a GPU reset, and can also provide
> >> some information like:
> >> - process ID of the process involved with the GPU reset
> >> - process name of the involved process
> >> - the GPU status info (using flags)
> >>
> >> This patch also introduces the first flag of the flags
> >> bitmap, which can be appended as and when required.
> > Why invent something new, rather than using the already existing 
> > devcoredump?
> 
>  Yeah, that's a really valid question.
> 
> > I don't think we need (or should encourage/allow) something drm
> > specific when there is already an existing solution used by both drm
> > and non-drm drivers.  Userspace should not have to learn to support
> > yet another mechanism to do the same thing.
> 
>  Question is how is userspace notified about new available core dumps?
> >>>
> >>> I haven't looked into it too closely, as the CrOS userspace
> >>> crash-reporter already had support for devcoredump, so it "just
> >>> worked" out of the box[1].  I believe a udev event is what triggers
> >>> the crash-reporter to go read the devcore dump out of sysfs.
> >>
> >> I had a quick look at the devcoredump code, and it doesn't look like
> >> that is sending an event to the user, so we still need an event to
> >> indicate a GPU reset.
> >
> > There definitely is an event to userspace, I suspect somewhere down
> > the device_add() path?
> >
>
> Let me check that out as well, hope that is not due to a driver-private
> event for GPU reset, coz I think I have seen some of those in a few DRM
> drivers.
>

Definitely no driver private event for drm/msm .. I haven't dug
through it all but this is the collector for devcoredump, triggered
somehow via udev.  Most likely from event triggered by device_add()

https://chromium.googlesource.com/chromiumos/platform2/+/HEAD/crash-reporter/udev_collector.cc

BR,
-R


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-10 Thread Sharma, Shashank




On 3/10/2022 6:10 PM, Rob Clark wrote:

On Thu, Mar 10, 2022 at 8:21 AM Sharma, Shashank
 wrote:




On 3/10/2022 4:24 PM, Rob Clark wrote:

On Thu, Mar 10, 2022 at 1:55 AM Christian König
 wrote:




Am 09.03.22 um 19:12 schrieb Rob Clark:

On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
 wrote:

From: Shashank Sharma 

This patch adds a new sysfs event, which will indicate
the userland about a GPU reset, and can also provide
some information like:
- process ID of the process involved with the GPU reset
- process name of the involved process
- the GPU status info (using flags)

This patch also introduces the first flag of the flags
bitmap, which can be appended as and when required.

Why invent something new, rather than using the already existing devcoredump?


Yeah, that's a really valid question.


I don't think we need (or should encourage/allow) something drm
specific when there is already an existing solution used by both drm
and non-drm drivers.  Userspace should not have to learn to support
yet another mechanism to do the same thing.


Question is how is userspace notified about new available core dumps?


I haven't looked into it too closely, as the CrOS userspace
crash-reporter already had support for devcoredump, so it "just
worked" out of the box[1].  I believe a udev event is what triggers
the crash-reporter to go read the devcore dump out of sysfs.


I had a quick look at the devcoredump code, and it doesn't look like
that is sending an event to the user, so we still need an event to
indicate a GPU reset.


There definitely is an event to userspace, I suspect somewhere down
the device_add() path?



Let me check that out as well, hope that is not due to a driver-private 
event for GPU reset, coz I think I have seen some of those in a few DRM 
drivers.


- Shashank


BR,
-R


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-10 Thread Rob Clark
On Thu, Mar 10, 2022 at 8:27 AM Andrey Grodzovsky
 wrote:
>
>
> On 2022-03-10 11:21, Sharma, Shashank wrote:
> >
> >
> > On 3/10/2022 4:24 PM, Rob Clark wrote:
> >> On Thu, Mar 10, 2022 at 1:55 AM Christian König
> >>  wrote:
> >>>
> >>>
> >>>
> >>> Am 09.03.22 um 19:12 schrieb Rob Clark:
>  On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
>   wrote:
> > From: Shashank Sharma 
> >
> > This patch adds a new sysfs event, which will indicate
> > the userland about a GPU reset, and can also provide
> > some information like:
> > - process ID of the process involved with the GPU reset
> > - process name of the involved process
> > - the GPU status info (using flags)
> >
> > This patch also introduces the first flag of the flags
> > bitmap, which can be appended as and when required.
>  Why invent something new, rather than using the already existing
>  devcoredump?
> >>>
> >>> Yeah, that's a really valid question.
> >>>
>  I don't think we need (or should encourage/allow) something drm
>  specific when there is already an existing solution used by both drm
>  and non-drm drivers.  Userspace should not have to learn to support
>  yet another mechanism to do the same thing.
> >>>
> >>> Question is how is userspace notified about new available core dumps?
> >>
> >> I haven't looked into it too closely, as the CrOS userspace
> >> crash-reporter already had support for devcoredump, so it "just
> >> worked" out of the box[1].  I believe a udev event is what triggers
> >> the crash-reporter to go read the devcore dump out of sysfs.
> >
> > I had a quick look at the devcoredump code, and it doesn't look like
> > that is sending an event to the user, so we still need an event to
> > indicate a GPU reset.
> >
> > - Shashank
>
>
> Another point I raised in another thread is that it looks like we might
> want to use devcoredump during ASIC reset to dump more HW related data
> which is useful
> for debugging. It means the use client will have to extract the pid and
> process name out from a bigger data set - Is that ok ? We can probably
> put it at the beginning
> for easiest parsing.
>

Yes, this is what we do for drm/msm.. the start of the devcore file
has something like:


kernel: 5.14.0-rc3-debug+
module: msm
time: 1632763923.453207637
comm: deqp-gles3:sq0
cmdline: ./deqp-gles31 --deqp-surface-width=256
--deqp-surface-height=256 --deqp-gl-config-name=rgbad24s8ms0
--deqp-visibility=hidden
--deqp-caselist-file=/home/robclark/src/deqp/build/modules/gles31/new-run/c33.r1.caselist.txt
--deqp-log-filename=/home/robclark/src/deqp/build/modules/gles31/new-run/c33.r1.qpa
--deqp-log-flush=disable
--deqp-shadercache-filename=/home/robclark/src/deqp/build/modules/gles31/new-run/t499826814672.shader_cache
--deqp-shadercache-truncate=disable
revision: 618 (6.1.8.0)


We capture quite a lot of state, cmdstream that triggered the hang,
register/state dumps, microcontroller state, etc.  But we do go out of
our way to not capture textures or caches that might contain texture
data by default (for privacy reasons)

It has been hugely useful for debugging a few issues that happen
rarely enough that they are difficult to reproduce.  I guess that is
crowd-sourced debugging ;-)

BR,
-R


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-10 Thread Rob Clark
On Thu, Mar 10, 2022 at 8:21 AM Sharma, Shashank
 wrote:
>
>
>
> On 3/10/2022 4:24 PM, Rob Clark wrote:
> > On Thu, Mar 10, 2022 at 1:55 AM Christian König
> >  wrote:
> >>
> >>
> >>
> >> Am 09.03.22 um 19:12 schrieb Rob Clark:
> >>> On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
> >>>  wrote:
>  From: Shashank Sharma 
> 
>  This patch adds a new sysfs event, which will indicate
>  the userland about a GPU reset, and can also provide
>  some information like:
>  - process ID of the process involved with the GPU reset
>  - process name of the involved process
>  - the GPU status info (using flags)
> 
>  This patch also introduces the first flag of the flags
>  bitmap, which can be appended as and when required.
> >>> Why invent something new, rather than using the already existing 
> >>> devcoredump?
> >>
> >> Yeah, that's a really valid question.
> >>
> >>> I don't think we need (or should encourage/allow) something drm
> >>> specific when there is already an existing solution used by both drm
> >>> and non-drm drivers.  Userspace should not have to learn to support
> >>> yet another mechanism to do the same thing.
> >>
> >> Question is how is userspace notified about new available core dumps?
> >
> > I haven't looked into it too closely, as the CrOS userspace
> > crash-reporter already had support for devcoredump, so it "just
> > worked" out of the box[1].  I believe a udev event is what triggers
> > the crash-reporter to go read the devcore dump out of sysfs.
>
> I had a quick look at the devcoredump code, and it doesn't look like
> that is sending an event to the user, so we still need an event to
> indicate a GPU reset.

There definitely is an event to userspace, I suspect somewhere down
the device_add() path?

BR,
-R


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-10 Thread Andrey Grodzovsky



On 2022-03-10 11:21, Sharma, Shashank wrote:



On 3/10/2022 4:24 PM, Rob Clark wrote:

On Thu, Mar 10, 2022 at 1:55 AM Christian König
 wrote:




Am 09.03.22 um 19:12 schrieb Rob Clark:

On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
 wrote:

From: Shashank Sharma 

This patch adds a new sysfs event, which will indicate
the userland about a GPU reset, and can also provide
some information like:
- process ID of the process involved with the GPU reset
- process name of the involved process
- the GPU status info (using flags)

This patch also introduces the first flag of the flags
bitmap, which can be appended as and when required.
Why invent something new, rather than using the already existing 
devcoredump?


Yeah, that's a really valid question.


I don't think we need (or should encourage/allow) something drm
specific when there is already an existing solution used by both drm
and non-drm drivers.  Userspace should not have to learn to support
yet another mechanism to do the same thing.


Question is how is userspace notified about new available core dumps?


I haven't looked into it too closely, as the CrOS userspace
crash-reporter already had support for devcoredump, so it "just
worked" out of the box[1].  I believe a udev event is what triggers
the crash-reporter to go read the devcore dump out of sysfs.


I had a quick look at the devcoredump code, and it doesn't look like 
that is sending an event to the user, so we still need an event to 
indicate a GPU reset.


- Shashank



Another point I raised in another thread is that it looks like we might 
want to use devcoredump during ASIC reset to dump more HW related data 
which is useful
for debugging. It means the use client will have to extract the pid and 
process name out from a bigger data set - Is that ok ? We can probably 
put it at the beginning

for easiest parsing.

Andrey






BR,
-R

[1] One small change to allow-list the drm/msm devcore dumps was
needed after a privacy review/audit of what is contained in the GPU
devcored to ensure it does not contain any PII .. for CrOS on amd
devices I'd be happy to walk whoever deals with amd CrOS devices
through the process offline.


Regards,
Christian.



BR,
-R




Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-10 Thread Sharma, Shashank




On 3/10/2022 4:24 PM, Rob Clark wrote:

On Thu, Mar 10, 2022 at 1:55 AM Christian König
 wrote:




Am 09.03.22 um 19:12 schrieb Rob Clark:

On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
 wrote:

From: Shashank Sharma 

This patch adds a new sysfs event, which will indicate
the userland about a GPU reset, and can also provide
some information like:
- process ID of the process involved with the GPU reset
- process name of the involved process
- the GPU status info (using flags)

This patch also introduces the first flag of the flags
bitmap, which can be appended as and when required.

Why invent something new, rather than using the already existing devcoredump?


Yeah, that's a really valid question.


I don't think we need (or should encourage/allow) something drm
specific when there is already an existing solution used by both drm
and non-drm drivers.  Userspace should not have to learn to support
yet another mechanism to do the same thing.


Question is how is userspace notified about new available core dumps?


I haven't looked into it too closely, as the CrOS userspace
crash-reporter already had support for devcoredump, so it "just
worked" out of the box[1].  I believe a udev event is what triggers
the crash-reporter to go read the devcore dump out of sysfs.


I had a quick look at the devcoredump code, and it doesn't look like 
that is sending an event to the user, so we still need an event to 
indicate a GPU reset.


- Shashank



BR,
-R

[1] One small change to allow-list the drm/msm devcore dumps was
needed after a privacy review/audit of what is contained in the GPU
devcored to ensure it does not contain any PII .. for CrOS on amd
devices I'd be happy to walk whoever deals with amd CrOS devices
through the process offline.


Regards,
Christian.



BR,
-R




Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-10 Thread Rob Clark
On Thu, Mar 10, 2022 at 1:55 AM Christian König
 wrote:
>
>
>
> Am 09.03.22 um 19:12 schrieb Rob Clark:
> > On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
> >  wrote:
> >> From: Shashank Sharma 
> >>
> >> This patch adds a new sysfs event, which will indicate
> >> the userland about a GPU reset, and can also provide
> >> some information like:
> >> - process ID of the process involved with the GPU reset
> >> - process name of the involved process
> >> - the GPU status info (using flags)
> >>
> >> This patch also introduces the first flag of the flags
> >> bitmap, which can be appended as and when required.
> > Why invent something new, rather than using the already existing 
> > devcoredump?
>
> Yeah, that's a really valid question.
>
> > I don't think we need (or should encourage/allow) something drm
> > specific when there is already an existing solution used by both drm
> > and non-drm drivers.  Userspace should not have to learn to support
> > yet another mechanism to do the same thing.
>
> Question is how is userspace notified about new available core dumps?

I haven't looked into it too closely, as the CrOS userspace
crash-reporter already had support for devcoredump, so it "just
worked" out of the box[1].  I believe a udev event is what triggers
the crash-reporter to go read the devcore dump out of sysfs.

BR,
-R

[1] One small change to allow-list the drm/msm devcore dumps was
needed after a privacy review/audit of what is contained in the GPU
devcored to ensure it does not contain any PII .. for CrOS on amd
devices I'd be happy to walk whoever deals with amd CrOS devices
through the process offline.

> Regards,
> Christian.
>
> >
> > BR,
> > -R
>


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-10 Thread Christian König




Am 09.03.22 um 19:12 schrieb Rob Clark:

On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
 wrote:

From: Shashank Sharma 

This patch adds a new sysfs event, which will indicate
the userland about a GPU reset, and can also provide
some information like:
- process ID of the process involved with the GPU reset
- process name of the involved process
- the GPU status info (using flags)

This patch also introduces the first flag of the flags
bitmap, which can be appended as and when required.

Why invent something new, rather than using the already existing devcoredump?


Yeah, that's a really valid question.


I don't think we need (or should encourage/allow) something drm
specific when there is already an existing solution used by both drm
and non-drm drivers.  Userspace should not have to learn to support
yet another mechanism to do the same thing.


Question is how is userspace notified about new available core dumps?

Regards,
Christian.



BR,
-R




Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-09 Thread Rob Clark
On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
 wrote:
>
> From: Shashank Sharma 
>
> This patch adds a new sysfs event, which will indicate
> the userland about a GPU reset, and can also provide
> some information like:
> - process ID of the process involved with the GPU reset
> - process name of the involved process
> - the GPU status info (using flags)
>
> This patch also introduces the first flag of the flags
> bitmap, which can be appended as and when required.

Why invent something new, rather than using the already existing devcoredump?

I don't think we need (or should encourage/allow) something drm
specific when there is already an existing solution used by both drm
and non-drm drivers.  Userspace should not have to learn to support
yet another mechanism to do the same thing.

BR,
-R


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-09 Thread Sharma, Shashank




On 3/9/2022 8:47 AM, Simon Ser wrote:

Hi,

Maybe it would be a good idea to state the intended use-case in the
commit message? 


It was added in the second patch, but yeah, it makes more sense to add a 
cover-letter probably.


And explain why the current (driver-specific IIRC) APIs

aren't enough?

As you mentioned, we also started with a driver specific API, and then 
realized that most of the drivers can benefit from this infrastructure, 
as most of them would be more or less interested in the similar 
information.



Since this introduces new uAPI, can you point to a user-space patch
which uses the new uAPI? See this link for more info on DRM user-space
requirements:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdri.freedesktop.org%2Fdocs%2Fdrm%2Fgpu%2Fdrm-uapi.html%23open-source-userspace-requirementsdata=04%7C01%7Cshashank.sharma%40amd.com%7C85e8e40c84524f3272e908da01a11b1c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637824088716047386%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=%2BMTzBIfl3FGHdAv%2BSx%2F83a86x8eksbOhdQVqdrJ2Rq0%3Dreserved=0

The userspace work is in progress, this was to get a general feedback 
from the community and the interest.



Please use drm_dbg_* and drm_warn_* helpers instead of DRM_DEBUG and
DRM_WARN. This allows identifying on which device the uevent happens.


Well, I don't want to break the consistency across the file, as other
functions seems to be using DRM_DEBUG family of prints.

On Tuesday, March 8th, 2022 at 19:04, Shashank Sharma 
 wrote:


+void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event 
*reset_info)


reset_info can be const.

Sure.

- Shashank


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-09 Thread Pierre-Eric Pelloux-Prayer



On 09/03/2022 11:24, Christian König wrote:
> Am 09.03.22 um 11:10 schrieb Simon Ser:
>> On Wednesday, March 9th, 2022 at 10:56, Pierre-Eric Pelloux-Prayer 
>>  wrote:
>>
>>> Would it be possible to include the app parameters as well?
>> Can all processes read sysfs events?
> 
> No, but application parameters are usually not secret.
> 
>> There might be security implications here. The app parameters might
>> contain sensitive information, like passwords or tokens.
> 
> It's a well known security vulnerably to provide password etc.. as 
> application parameter since everybody can read them through procfs.
> 
>> Can't the uevent receiver query the kernel to get that info from the
>> PID?
> 
> I'm leaning also towards just providing the PID and not the application name. 
> The information should be as short as possible.
> 

Indeed you're both right. The PID + reading procfs should be enough.

Thanks,
Pierre-Eric


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-09 Thread Simon Ser
On Wednesday, March 9th, 2022 at 11:24, Christian König 
 wrote:

> Am 09.03.22 um 11:10 schrieb Simon Ser:
> > On Wednesday, March 9th, 2022 at 10:56, Pierre-Eric Pelloux-Prayer 
> >  wrote:
> >
> >> Would it be possible to include the app parameters as well?
> > Can all processes read sysfs events?
>
> No, but application parameters are usually not secret.

It's a bad practice, yes. But people still do it.

> > There might be security implications here. The app parameters might
> > contain sensitive information, like passwords or tokens.
>
> It's a well known security vulnerably to provide password etc.. as
> application parameter since everybody can read them through procfs.

I was thinking mostly about Flatpak apps here. Flatpak apps are running
in a process namespace so they can't query the CLI parameters of PIDs
outside of the namespace. But they might still receive sysfs uevents.


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-09 Thread Christian König

Am 09.03.22 um 11:10 schrieb Simon Ser:

On Wednesday, March 9th, 2022 at 10:56, Pierre-Eric Pelloux-Prayer 
 wrote:


Would it be possible to include the app parameters as well?

Can all processes read sysfs events?


No, but application parameters are usually not secret.


There might be security implications here. The app parameters might
contain sensitive information, like passwords or tokens.


It's a well known security vulnerably to provide password etc.. as 
application parameter since everybody can read them through procfs.



Can't the uevent receiver query the kernel to get that info from the
PID?


I'm leaning also towards just providing the PID and not the application 
name. The information should be as short as possible.


Regards,
Christian.


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-09 Thread Simon Ser
On Wednesday, March 9th, 2022 at 10:56, Pierre-Eric Pelloux-Prayer 
 wrote:

> Would it be possible to include the app parameters as well?

Can all processes read sysfs events?

There might be security implications here. The app parameters might
contain sensitive information, like passwords or tokens.

Can't the uevent receiver query the kernel to get that info from the
PID?


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-09 Thread Pierre-Eric Pelloux-Prayer
Hi Shashank,

On 08/03/2022 19:04, Shashank Sharma wrote:
> From: Shashank Sharma 
> 
> This patch adds a new sysfs event, which will indicate
> the userland about a GPU reset, and can also provide
> some information like:
> - process ID of the process involved with the GPU reset
> - process name of the involved process

Would it be possible to include the app parameters as well?

piglit has a shader_runner test application that can cause hangs,
and it's run like this:

   shader_runner 1.shader_test

Knowing that shader_runner caused the hang is not really useful
without the "1.shader_test" part.

Thanks,
Pierre-Eric 

> - the GPU status info (using flags)
> 
> This patch also introduces the first flag of the flags
> bitmap, which can be appended as and when required.
> 
> V2: Addressed review comments from Christian and Amar
>- move the reset information structure to DRM layer
>- drop _ctx from struct name
>- make pid 32 bit(than 64)
>- set flag when VRAM invalid (than valid)
>- add process name as well (Amar)
> 
> Cc: Alexandar Deucher 
> Cc: Christian Koenig 
> Cc: Amaranath Somalapuram 
> Signed-off-by: Shashank Sharma 
> ---
>  drivers/gpu/drm/drm_sysfs.c | 31 +++
>  include/drm/drm_sysfs.h | 10 ++
>  2 files changed, 41 insertions(+)
> 
> diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c
> index 430e00b16eec..840994810910 100644
> --- a/drivers/gpu/drm/drm_sysfs.c
> +++ b/drivers/gpu/drm/drm_sysfs.c
> @@ -409,6 +409,37 @@ void drm_sysfs_hotplug_event(struct drm_device *dev)
>  }
>  EXPORT_SYMBOL(drm_sysfs_hotplug_event);
>  
> +/**
> + * drm_sysfs_reset_event - generate a DRM uevent to indicate GPU reset
> + * @dev: DRM device
> + * @reset_info: The contextual information about the reset (like PID, flags)
> + *
> + * Send a uevent for the DRM device specified by @dev. This informs
> + * user that a GPU reset has occurred, so that an interested client
> + * can take any recovery or profiling measure.
> + */
> +void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event 
> *reset_info)
> +{
> + unsigned char pid_str[13];
> + unsigned char flags_str[15];
> + unsigned char pname_str[TASK_COMM_LEN + 6];
> + unsigned char reset_str[] = "RESET=1";
> + char *envp[] = { reset_str, pid_str, pname_str, flags_str, NULL };
> +
> + if (!reset_info) {
> + DRM_WARN("No reset info, not sending the event\n");
> + return;
> + }
> +
> + DRM_DEBUG("generating reset event\n");
> +
> + snprintf(pid_str, ARRAY_SIZE(pid_str), "PID=%u", reset_info->pid);
> + snprintf(pname_str, ARRAY_SIZE(pname_str), "NAME=%s", 
> reset_info->pname);
> + snprintf(flags_str, ARRAY_SIZE(flags_str), "FLAGS=%u", 
> reset_info->flags);
> + kobject_uevent_env(>primary->kdev->kobj, KOBJ_CHANGE, envp);
> +}
> +EXPORT_SYMBOL(drm_sysfs_reset_event);
> +
>  /**
>   * drm_sysfs_connector_hotplug_event - generate a DRM uevent for any 
> connector
>   * change
> diff --git a/include/drm/drm_sysfs.h b/include/drm/drm_sysfs.h
> index 6273cac44e47..5ba11c760619 100644
> --- a/include/drm/drm_sysfs.h
> +++ b/include/drm/drm_sysfs.h
> @@ -1,16 +1,26 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>  #ifndef _DRM_SYSFS_H_
>  #define _DRM_SYSFS_H_
> +#include 
> +
> +#define DRM_GPU_RESET_FLAG_VRAM_INVALID (1 << 0)
>  
>  struct drm_device;
>  struct device;
>  struct drm_connector;
>  struct drm_property;
>  
> +struct drm_reset_event {
> + uint32_t pid;
> + uint32_t flags;
> + char pname[TASK_COMM_LEN];
> +};
> +
>  int drm_class_device_register(struct device *dev);
>  void drm_class_device_unregister(struct device *dev);
>  
>  void drm_sysfs_hotplug_event(struct drm_device *dev);
> +void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event 
> *reset_info);
>  void drm_sysfs_connector_hotplug_event(struct drm_connector *connector);
>  void drm_sysfs_connector_status_event(struct drm_connector *connector,
> struct drm_property *property);
> 


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-09 Thread Christian König

Am 08.03.22 um 19:04 schrieb Shashank Sharma:

From: Shashank Sharma 

This patch adds a new sysfs event, which will indicate
the userland about a GPU reset, and can also provide
some information like:
- process ID of the process involved with the GPU reset
- process name of the involved process
- the GPU status info (using flags)

This patch also introduces the first flag of the flags
bitmap, which can be appended as and when required.

V2: Addressed review comments from Christian and Amar
- move the reset information structure to DRM layer
- drop _ctx from struct name
- make pid 32 bit(than 64)
- set flag when VRAM invalid (than valid)
- add process name as well (Amar)

Cc: Alexandar Deucher 
Cc: Christian Koenig 
Cc: Amaranath Somalapuram 
Signed-off-by: Shashank Sharma 
---
  drivers/gpu/drm/drm_sysfs.c | 31 +++
  include/drm/drm_sysfs.h | 10 ++
  2 files changed, 41 insertions(+)

diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c
index 430e00b16eec..840994810910 100644
--- a/drivers/gpu/drm/drm_sysfs.c
+++ b/drivers/gpu/drm/drm_sysfs.c
@@ -409,6 +409,37 @@ void drm_sysfs_hotplug_event(struct drm_device *dev)
  }
  EXPORT_SYMBOL(drm_sysfs_hotplug_event);
  
+/**

+ * drm_sysfs_reset_event - generate a DRM uevent to indicate GPU reset
+ * @dev: DRM device
+ * @reset_info: The contextual information about the reset (like PID, flags)
+ *
+ * Send a uevent for the DRM device specified by @dev. This informs
+ * user that a GPU reset has occurred, so that an interested client
+ * can take any recovery or profiling measure.
+ */
+void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event 
*reset_info)
+{
+   unsigned char pid_str[13];
+   unsigned char flags_str[15];
+   unsigned char pname_str[TASK_COMM_LEN + 6];
+   unsigned char reset_str[] = "RESET=1";
+   char *envp[] = { reset_str, pid_str, pname_str, flags_str, NULL };


In general we try to sort those longest first and maybe break envp on 
multiple lines.



+
+   if (!reset_info) {
+   DRM_WARN("No reset info, not sending the event\n");
+   return;
+   }


Please don't do stuff like that, providing no reset_info is a coding 
error and should really result in a crash.


Regards,
Christian.


+
+   DRM_DEBUG("generating reset event\n");
+
+   snprintf(pid_str, ARRAY_SIZE(pid_str), "PID=%u", reset_info->pid);
+   snprintf(pname_str, ARRAY_SIZE(pname_str), "NAME=%s", 
reset_info->pname);
+   snprintf(flags_str, ARRAY_SIZE(flags_str), "FLAGS=%u", 
reset_info->flags);
+   kobject_uevent_env(>primary->kdev->kobj, KOBJ_CHANGE, envp);
+}
+EXPORT_SYMBOL(drm_sysfs_reset_event);
+
  /**
   * drm_sysfs_connector_hotplug_event - generate a DRM uevent for any connector
   * change
diff --git a/include/drm/drm_sysfs.h b/include/drm/drm_sysfs.h
index 6273cac44e47..5ba11c760619 100644
--- a/include/drm/drm_sysfs.h
+++ b/include/drm/drm_sysfs.h
@@ -1,16 +1,26 @@
  /* SPDX-License-Identifier: GPL-2.0 */
  #ifndef _DRM_SYSFS_H_
  #define _DRM_SYSFS_H_
+#include 
+
+#define DRM_GPU_RESET_FLAG_VRAM_INVALID (1 << 0)
  
  struct drm_device;

  struct device;
  struct drm_connector;
  struct drm_property;
  
+struct drm_reset_event {

+   uint32_t pid;
+   uint32_t flags;
+   char pname[TASK_COMM_LEN];
+};
+
  int drm_class_device_register(struct device *dev);
  void drm_class_device_unregister(struct device *dev);
  
  void drm_sysfs_hotplug_event(struct drm_device *dev);

+void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event 
*reset_info);
  void drm_sysfs_connector_hotplug_event(struct drm_connector *connector);
  void drm_sysfs_connector_status_event(struct drm_connector *connector,
  struct drm_property *property);




Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-08 Thread Simon Ser
Hi,

Maybe it would be a good idea to state the intended use-case in the
commit message? And explain why the current (driver-specific IIRC) APIs
aren't enough?

Since this introduces new uAPI, can you point to a user-space patch
which uses the new uAPI? See this link for more info on DRM user-space
requirements:
https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#open-source-userspace-requirements

Please use drm_dbg_* and drm_warn_* helpers instead of DRM_DEBUG and
DRM_WARN. This allows identifying on which device the uevent happens.

On Tuesday, March 8th, 2022 at 19:04, Shashank Sharma 
 wrote:

> +void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event 
> *reset_info)

reset_info can be const.