RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

2023-12-13 Thread Liu, Shaoyun
[Public]

Check the MES API,  It's  my fault , originally I think it will use  trap_en as 
parameter to tell MES to enable/disable shader debugger , but actually it's not 
.  So we either need to add a parameter for this (ex . add flag for 
enable/disable)  or as your solution add flag for flush.  Consider it's 
possible user process can  be killed after call set_shader but before any 
add_queue , then notify MES to do a process context flush after process 
termination seems  more reasonable .

Regards
Shaoyun.liu



From: Kim, Jonathan 
Sent: Tuesday, December 12, 2023 11:43 PM
To: Liu, Shaoyun ; Huang, JinHuiEric 
; amd-gfx@lists.freedesktop.org
Cc: Wong, Alice ; Kuehling, Felix 
; Kasiviswanathan, Harish 

Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management


[Public]

Again, MES only knows to flush if there was something enqueued in the first 
place.
SET_SHADER dictates what's on the process list.
SET_SHADER can be the last call prior to process termination with nothing 
enqueued, hence no MES auto flush occurs.

MES doesn't block anything on the flush flag request.
The driver guarantees that flush is only done on process termination after 
device dequeue, whether there were queues or not.
MES has no idea what an invalid context is.
It just has a value stored in its linked list that's associated with a driver 
allocated BO that no longer exists after process termination.

If you're still not sure about this solution, then this should be discussed 
offline with the MES team.
We're not going to gain ground discussing this here.  The solution has already 
been merged.
Feel free to propose a better solution if you're not satisfied with this one.

Jon

From: Liu, Shaoyun mailto:shaoyun@amd.com>>
Sent: Tuesday, December 12, 2023 11:08 PM
To: Kim, Jonathan mailto:jonathan@amd.com>>; Huang, 
JinHuiEric mailto:jinhuieric.hu...@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Cc: Wong, Alice mailto:shiwei.w...@amd.com>>; Kuehling, 
Felix mailto:felix.kuehl...@amd.com>>; Kasiviswanathan, 
Harish mailto:harish.kasiviswanat...@amd.com>>
Subject: Re: [PATCH] drm/amdkfd: fix mes set shader debugger process management


[Public]

You try to add one new interface to inform mes about the context flush after 
driver side finish process termination , from my understanding, mes already 
know the process context need to be purged after all the related queues been 
removed even without this notification. What do you expect mes to do about this 
context flush flag ? Mes should block this process context for next set_sched 
command? Mes can achieve  this by ignore the set_sched command with trap 
disable parameter on an invalid process context .

Shaoyun.liu

Get Outlook for iOS<https://aka.ms/o0ukef>

From: Kim, Jonathan mailto:jonathan@amd.com>>
Sent: Tuesday, December 12, 2023 8:19:09 PM
To: Liu, Shaoyun mailto:shaoyun@amd.com>>; Huang, 
JinHuiEric mailto:jinhuieric.hu...@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> 
mailto:amd-gfx@lists.freedesktop.org>>
Cc: Wong, Alice mailto:shiwei.w...@amd.com>>; Kuehling, 
Felix mailto:felix.kuehl...@amd.com>>; Kasiviswanathan, 
Harish mailto:harish.kasiviswanat...@amd.com>>
Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

[Public]

> -Original Message-
> From: Liu, Shaoyun mailto:shaoyun@amd.com>>
> Sent: Tuesday, December 12, 2023 7:08 PM
> To: Kim, Jonathan mailto:jonathan@amd.com>>; Huang, 
> JinHuiEric
> mailto:jinhuieric.hu...@amd.com>>; 
> amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Cc: Wong, Alice mailto:shiwei.w...@amd.com>>; Kuehling, 
> Felix
> mailto:felix.kuehl...@amd.com>>; Kasiviswanathan, 
> Harish
> mailto:harish.kasiviswanat...@amd.com>>
> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
> [Public]
>
> I see,  so the  problem is after process context , set_shader been  called 
> with
> disable parameter,  do you know the  reason why  MES re-added the
> process context into the  list ?

Because MES has no idea what disable means.

All it knows is that without the flush flag, set_shader should update the 
necessary per-VMID (process) registers as requested by the driver, which 
requires persistent per-process HW settings so that potential future waves can 
inherit those settings i.e. ADD_QUEUE.skip_process_ctx_clear is set (why 
ADD_QUEUE auto clears the process context otherwise is another long story, 
basically an unsolvable MES cache bug problem).

Common use case example:
add_queue -> set_shader call either transiently stalls the SPI per-VMID or 
transiently dequeues the HWS per-VMID depending on the request settings 

RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

2023-12-12 Thread Kim, Jonathan
[Public]

Again, MES only knows to flush if there was something enqueued in the first 
place.
SET_SHADER dictates what's on the process list.
SET_SHADER can be the last call prior to process termination with nothing 
enqueued, hence no MES auto flush occurs.

MES doesn't block anything on the flush flag request.
The driver guarantees that flush is only done on process termination after 
device dequeue, whether there were queues or not.
MES has no idea what an invalid context is.
It just has a value stored in its linked list that's associated with a driver 
allocated BO that no longer exists after process termination.

If you're still not sure about this solution, then this should be discussed 
offline with the MES team.
We're not going to gain ground discussing this here.  The solution has already 
been merged.
Feel free to propose a better solution if you're not satisfied with this one.

Jon

From: Liu, Shaoyun 
Sent: Tuesday, December 12, 2023 11:08 PM
To: Kim, Jonathan ; Huang, JinHuiEric 
; amd-gfx@lists.freedesktop.org
Cc: Wong, Alice ; Kuehling, Felix 
; Kasiviswanathan, Harish 

Subject: Re: [PATCH] drm/amdkfd: fix mes set shader debugger process management


[Public]

You try to add one new interface to inform mes about the context flush after 
driver side finish process termination , from my understanding, mes already 
know the process context need to be purged after all the related queues been 
removed even without this notification. What do you expect mes to do about this 
context flush flag ? Mes should block this process context for next set_sched 
command? Mes can achieve  this by ignore the set_sched command with trap 
disable parameter on an invalid process context .

Shaoyun.liu

Get Outlook for iOS<https://aka.ms/o0ukef>

From: Kim, Jonathan mailto:jonathan@amd.com>>
Sent: Tuesday, December 12, 2023 8:19:09 PM
To: Liu, Shaoyun mailto:shaoyun@amd.com>>; Huang, 
JinHuiEric mailto:jinhuieric.hu...@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> 
mailto:amd-gfx@lists.freedesktop.org>>
Cc: Wong, Alice mailto:shiwei.w...@amd.com>>; Kuehling, 
Felix mailto:felix.kuehl...@amd.com>>; Kasiviswanathan, 
Harish mailto:harish.kasiviswanat...@amd.com>>
Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

[Public]

> -Original Message-
> From: Liu, Shaoyun mailto:shaoyun@amd.com>>
> Sent: Tuesday, December 12, 2023 7:08 PM
> To: Kim, Jonathan mailto:jonathan@amd.com>>; Huang, 
> JinHuiEric
> mailto:jinhuieric.hu...@amd.com>>; 
> amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Cc: Wong, Alice mailto:shiwei.w...@amd.com>>; Kuehling, 
> Felix
> mailto:felix.kuehl...@amd.com>>; Kasiviswanathan, 
> Harish
> mailto:harish.kasiviswanat...@amd.com>>
> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
> [Public]
>
> I see,  so the  problem is after process context , set_shader been  called 
> with
> disable parameter,  do you know the  reason why  MES re-added the
> process context into the  list ?

Because MES has no idea what disable means.

All it knows is that without the flush flag, set_shader should update the 
necessary per-VMID (process) registers as requested by the driver, which 
requires persistent per-process HW settings so that potential future waves can 
inherit those settings i.e. ADD_QUEUE.skip_process_ctx_clear is set (why 
ADD_QUEUE auto clears the process context otherwise is another long story, 
basically an unsolvable MES cache bug problem).

Common use case example:
add_queue -> set_shader call either transiently stalls the SPI per-VMID or 
transiently dequeues the HWS per-VMID depending on the request settings -> 
fulfils the per-VMID register write updates -> resumes process queues so that 
potential waves on those queues inherit new debug settings.

You can't do this kind of operation at the queue level alone.

The problem that this patch solves (along with the MES FW upgrade) is an 
unfortunate quirk of having to operate between process (debug requests) and 
queue space (non-debug requests).
Old HWS used to operate at the per-process level via MAP_PROCESS so it was a 
lot easier to balance debug versus non-debug requests back then (but it was 
also lot less efficient performance wise).

Jon

>
> Shaoyun.liu
>
> -Original Message-
> From: Kim, Jonathan mailto:jonathan@amd.com>>
> Sent: Tuesday, December 12, 2023 6:07 PM
> To: Liu, Shaoyun mailto:shaoyun@amd.com>>; Huang, 
> JinHuiEric
> mailto:jinhuieric.hu...@amd.com>>; 
> amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Cc: Wong, Alice mailto:shiwei.w...@amd.com>>; Kuehling, 
> Felix
> mailto:f

Re: [PATCH] drm/amdkfd: fix mes set shader debugger process management

2023-12-12 Thread Liu, Shaoyun
[Public]

You try to add one new interface to inform mes about the context flush after 
driver side finish process termination , from my understanding, mes already 
know the process context need to be purged after all the related queues been 
removed even without this notification. What do you expect mes to do about this 
context flush flag ? Mes should block this process context for next set_sched 
command? Mes can achieve  this by ignore the set_sched command with trap 
disable parameter on an invalid process context .

Shaoyun.liu

Get Outlook for iOS<https://aka.ms/o0ukef>

From: Kim, Jonathan 
Sent: Tuesday, December 12, 2023 8:19:09 PM
To: Liu, Shaoyun ; Huang, JinHuiEric 
; amd-gfx@lists.freedesktop.org 

Cc: Wong, Alice ; Kuehling, Felix 
; Kasiviswanathan, Harish 

Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

[Public]

> -Original Message-
> From: Liu, Shaoyun 
> Sent: Tuesday, December 12, 2023 7:08 PM
> To: Kim, Jonathan ; Huang, JinHuiEric
> ; amd-gfx@lists.freedesktop.org
> Cc: Wong, Alice ; Kuehling, Felix
> ; Kasiviswanathan, Harish
> 
> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
> [Public]
>
> I see,  so the  problem is after process context , set_shader been  called 
> with
> disable parameter,  do you know the  reason why  MES re-added the
> process context into the  list ?

Because MES has no idea what disable means.

All it knows is that without the flush flag, set_shader should update the 
necessary per-VMID (process) registers as requested by the driver, which 
requires persistent per-process HW settings so that potential future waves can 
inherit those settings i.e. ADD_QUEUE.skip_process_ctx_clear is set (why 
ADD_QUEUE auto clears the process context otherwise is another long story, 
basically an unsolvable MES cache bug problem).

Common use case example:
add_queue -> set_shader call either transiently stalls the SPI per-VMID or 
transiently dequeues the HWS per-VMID depending on the request settings -> 
fulfils the per-VMID register write updates -> resumes process queues so that 
potential waves on those queues inherit new debug settings.

You can't do this kind of operation at the queue level alone.

The problem that this patch solves (along with the MES FW upgrade) is an 
unfortunate quirk of having to operate between process (debug requests) and 
queue space (non-debug requests).
Old HWS used to operate at the per-process level via MAP_PROCESS so it was a 
lot easier to balance debug versus non-debug requests back then (but it was 
also lot less efficient performance wise).

Jon

>
> Shaoyun.liu
>
> -Original Message-
> From: Kim, Jonathan 
> Sent: Tuesday, December 12, 2023 6:07 PM
> To: Liu, Shaoyun ; Huang, JinHuiEric
> ; amd-gfx@lists.freedesktop.org
> Cc: Wong, Alice ; Kuehling, Felix
> ; Kasiviswanathan, Harish
> 
> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
> [Public]
>
> > -Original Message-
> > From: Liu, Shaoyun 
> > Sent: Tuesday, December 12, 2023 5:44 PM
> > To: Kim, Jonathan ; Huang, JinHuiEric
> > ; amd-gfx@lists.freedesktop.org
> > Cc: Wong, Alice ; Kuehling, Felix
> > ; Kasiviswanathan, Harish
> > 
> > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> > management
> >
> > [Public]
> >
> > Do you mean SET_SHADER_DEBUGER can  be called before ADD_QUEUE ?
> I
> > think  even in that  situation MES should still be able to handle it
> > as long as MES already  remove the process context from its list , MES
> > will treat the process context as a new item. I still don't understand why
> MES haven't
> > purged the  process context from the list after process termination .   Will
> > debug queue itself  also use the add/remove queue interface  and  is
> > it possible the debug queue itself from the  old process  still not be
> > removed ?
>
> SET_SHADER_DEBUGGER can be called independently from ADD_QUEUE.
> The process list is updated on either on SET_SHADER_DEBUGGER or
> ADD_QUEUE.
> e.g. runtime_enable (set_shader) -> add_queue -> remove_queue (list
> purged) -> runtime_disable (set_shader process re-added) -> process
> termination (stale list) or debug attach (set_shader) -> add_queue ->
> remove_queue (list purged) -> debug detach (set_shader process re-added) -
> >process termination (stale list)
>
> MES has no idea what process termination means.  The new flag is a proxy
> for this.
> There are reasons for process settings to take place prior to queue add
> (debugger, gfx11 cwsr workaround, core dump etc need this).
>
> I'm not sure what kernel/debug 

RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

2023-12-12 Thread Kim, Jonathan
[Public]

> -Original Message-
> From: Liu, Shaoyun 
> Sent: Tuesday, December 12, 2023 7:08 PM
> To: Kim, Jonathan ; Huang, JinHuiEric
> ; amd-gfx@lists.freedesktop.org
> Cc: Wong, Alice ; Kuehling, Felix
> ; Kasiviswanathan, Harish
> 
> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
> [Public]
>
> I see,  so the  problem is after process context , set_shader been  called 
> with
> disable parameter,  do you know the  reason why  MES re-added the
> process context into the  list ?

Because MES has no idea what disable means.

All it knows is that without the flush flag, set_shader should update the 
necessary per-VMID (process) registers as requested by the driver, which 
requires persistent per-process HW settings so that potential future waves can 
inherit those settings i.e. ADD_QUEUE.skip_process_ctx_clear is set (why 
ADD_QUEUE auto clears the process context otherwise is another long story, 
basically an unsolvable MES cache bug problem).

Common use case example:
add_queue -> set_shader call either transiently stalls the SPI per-VMID or 
transiently dequeues the HWS per-VMID depending on the request settings -> 
fulfils the per-VMID register write updates -> resumes process queues so that 
potential waves on those queues inherit new debug settings.

You can't do this kind of operation at the queue level alone.

The problem that this patch solves (along with the MES FW upgrade) is an 
unfortunate quirk of having to operate between process (debug requests) and 
queue space (non-debug requests).
Old HWS used to operate at the per-process level via MAP_PROCESS so it was a 
lot easier to balance debug versus non-debug requests back then (but it was 
also lot less efficient performance wise).

Jon

>
> Shaoyun.liu
>
> -Original Message-
> From: Kim, Jonathan 
> Sent: Tuesday, December 12, 2023 6:07 PM
> To: Liu, Shaoyun ; Huang, JinHuiEric
> ; amd-gfx@lists.freedesktop.org
> Cc: Wong, Alice ; Kuehling, Felix
> ; Kasiviswanathan, Harish
> 
> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
> [Public]
>
> > -Original Message-
> > From: Liu, Shaoyun 
> > Sent: Tuesday, December 12, 2023 5:44 PM
> > To: Kim, Jonathan ; Huang, JinHuiEric
> > ; amd-gfx@lists.freedesktop.org
> > Cc: Wong, Alice ; Kuehling, Felix
> > ; Kasiviswanathan, Harish
> > 
> > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> > management
> >
> > [Public]
> >
> > Do you mean SET_SHADER_DEBUGER can  be called before ADD_QUEUE ?
> I
> > think  even in that  situation MES should still be able to handle it
> > as long as MES already  remove the process context from its list , MES
> > will treat the process context as a new item. I still don't understand why
> MES haven't
> > purged the  process context from the list after process termination .   Will
> > debug queue itself  also use the add/remove queue interface  and  is
> > it possible the debug queue itself from the  old process  still not be
> > removed ?
>
> SET_SHADER_DEBUGGER can be called independently from ADD_QUEUE.
> The process list is updated on either on SET_SHADER_DEBUGGER or
> ADD_QUEUE.
> e.g. runtime_enable (set_shader) -> add_queue -> remove_queue (list
> purged) -> runtime_disable (set_shader process re-added) -> process
> termination (stale list) or debug attach (set_shader) -> add_queue ->
> remove_queue (list purged) -> debug detach (set_shader process re-added) -
> >process termination (stale list)
>
> MES has no idea what process termination means.  The new flag is a proxy
> for this.
> There are reasons for process settings to take place prior to queue add
> (debugger, gfx11 cwsr workaround, core dump etc need this).
>
> I'm not sure what kernel/debug queues have to do with this.
> By that argument, the list should be purged.
>
> Jon
>
> >
> > Shaoyun.liu
> >
> > -Original Message-
> > From: Kim, Jonathan 
> > Sent: Tuesday, December 12, 2023 4:48 PM
> > To: Liu, Shaoyun ; Huang, JinHuiEric
> > ; amd-gfx@lists.freedesktop.org
> > Cc: Wong, Alice ; Kuehling, Felix
> > ; Kasiviswanathan, Harish
> > 
> > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> > management
> >
> > [Public]
> >
> > > -Original Message-
> > > From: Liu, Shaoyun 
> > > Sent: Tuesday, December 12, 2023 4:45 PM
> > > To: Kim, Jonathan ; Huang, JinHuiEric
> > > ; amd-gfx@lists.freedesktop.org
> > > Cc: Wong, Alice ; Kuehling, Felix
> > > ; Kasiviswanathan, Harish
> &g

RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

2023-12-12 Thread Liu, Shaoyun
[Public]

I see,  so the  problem is after process context , set_shader been  called with 
disable parameter,  do you know the  reason why  MES re-added the  process 
context into the  list ?

Shaoyun.liu

-Original Message-
From: Kim, Jonathan 
Sent: Tuesday, December 12, 2023 6:07 PM
To: Liu, Shaoyun ; Huang, JinHuiEric 
; amd-gfx@lists.freedesktop.org
Cc: Wong, Alice ; Kuehling, Felix 
; Kasiviswanathan, Harish 

Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

[Public]

> -Original Message-
> From: Liu, Shaoyun 
> Sent: Tuesday, December 12, 2023 5:44 PM
> To: Kim, Jonathan ; Huang, JinHuiEric
> ; amd-gfx@lists.freedesktop.org
> Cc: Wong, Alice ; Kuehling, Felix
> ; Kasiviswanathan, Harish
> 
> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
> [Public]
>
> Do you mean SET_SHADER_DEBUGER can  be called before ADD_QUEUE ?  I
> think  even in that  situation MES should still be able to handle it
> as long as MES already  remove the process context from its list , MES
> will treat the process context as a new item. I still don't understand why 
> MES haven't
> purged the  process context from the list after process termination .   Will
> debug queue itself  also use the add/remove queue interface  and  is
> it possible the debug queue itself from the  old process  still not be
> removed ?

SET_SHADER_DEBUGGER can be called independently from ADD_QUEUE.
The process list is updated on either on SET_SHADER_DEBUGGER or ADD_QUEUE.
e.g. runtime_enable (set_shader) -> add_queue -> remove_queue (list purged) -> 
runtime_disable (set_shader process re-added) -> process termination (stale 
list) or debug attach (set_shader) -> add_queue -> remove_queue (list purged) 
-> debug detach (set_shader process re-added) ->process termination (stale list)

MES has no idea what process termination means.  The new flag is a proxy for 
this.
There are reasons for process settings to take place prior to queue add 
(debugger, gfx11 cwsr workaround, core dump etc need this).

I'm not sure what kernel/debug queues have to do with this.
By that argument, the list should be purged.

Jon

>
> Shaoyun.liu
>
> -Original Message-
> From: Kim, Jonathan 
> Sent: Tuesday, December 12, 2023 4:48 PM
> To: Liu, Shaoyun ; Huang, JinHuiEric
> ; amd-gfx@lists.freedesktop.org
> Cc: Wong, Alice ; Kuehling, Felix
> ; Kasiviswanathan, Harish
> 
> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
> [Public]
>
> > -Original Message-
> > From: Liu, Shaoyun 
> > Sent: Tuesday, December 12, 2023 4:45 PM
> > To: Kim, Jonathan ; Huang, JinHuiEric
> > ; amd-gfx@lists.freedesktop.org
> > Cc: Wong, Alice ; Kuehling, Felix
> > ; Kasiviswanathan, Harish
> > 
> > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> > management
> >
> > [Public]
> >
> > Shouldn't the driver side  remove all the remaining  queues for the
> > process during  process termination ?  If all the  queues been
> > removed for the process ,  MES should purge the  process context
> > automatically , otherwise it's bug inside MES .
>
> That's only if there were queues added to begin with.
>
> Jon
>
> >
> > Regard
> > Sshaoyun.liu
> >
> > -----Original Message-
> > From: Kim, Jonathan 
> > Sent: Tuesday, December 12, 2023 4:33 PM
> > To: Liu, Shaoyun ; Huang, JinHuiEric
> > ; amd-gfx@lists.freedesktop.org
> > Cc: Wong, Alice ; Kuehling, Felix
> > ; Kasiviswanathan, Harish
> > 
> > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> > management
> >
> > [Public]
> >
> > > -Original Message-
> > > From: Liu, Shaoyun 
> > > Sent: Tuesday, December 12, 2023 4:00 PM
> > > To: Huang, JinHuiEric ; Kim, Jonathan
> > > ; amd-gfx@lists.freedesktop.org
> > > Cc: Wong, Alice ; Kuehling, Felix
> > > ; Kasiviswanathan, Harish
> > > 
> > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger
> > > process management
> > >
> > > [AMD Official Use Only - General]
> > >
> > > Does this requires the  new MES FW for this process_ctx_flush
> > > requirement ?  Can driver side add logic to guaranty when  call
> > > SET_SHADER_DEBUGGER, the process address  is always valid ?
> >
> > Call to flush on old fw is a NOP so it's harmless in that case.
> > Full solution will still require a new MES version as this is a
> > workaround on corner cases and not a new feature i.e. we can't stop
> > ROCm from runnin

RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

2023-12-12 Thread Kim, Jonathan
[Public]

> -Original Message-
> From: Liu, Shaoyun 
> Sent: Tuesday, December 12, 2023 5:44 PM
> To: Kim, Jonathan ; Huang, JinHuiEric
> ; amd-gfx@lists.freedesktop.org
> Cc: Wong, Alice ; Kuehling, Felix
> ; Kasiviswanathan, Harish
> 
> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
> [Public]
>
> Do you mean SET_SHADER_DEBUGER can  be called before ADD_QUEUE ?  I
> think  even in that  situation MES should still be able to handle it as long 
> as
> MES already  remove the process context from its list , MES will treat the
> process context as a new item. I still don't understand why MES haven't
> purged the  process context from the list after process termination .   Will
> debug queue itself  also use the add/remove queue interface  and  is it
> possible the debug queue itself from the  old process  still not be
> removed ?

SET_SHADER_DEBUGGER can be called independently from ADD_QUEUE.
The process list is updated on either on SET_SHADER_DEBUGGER or ADD_QUEUE.
e.g. runtime_enable (set_shader) -> add_queue -> remove_queue (list purged) -> 
runtime_disable (set_shader process re-added) -> process termination (stale 
list)
or debug attach (set_shader) -> add_queue -> remove_queue (list purged) -> 
debug detach (set_shader process re-added) ->process termination (stale list)

MES has no idea what process termination means.  The new flag is a proxy for 
this.
There are reasons for process settings to take place prior to queue add 
(debugger, gfx11 cwsr workaround, core dump etc need this).

I'm not sure what kernel/debug queues have to do with this.
By that argument, the list should be purged.

Jon

>
> Shaoyun.liu
>
> -Original Message-
> From: Kim, Jonathan 
> Sent: Tuesday, December 12, 2023 4:48 PM
> To: Liu, Shaoyun ; Huang, JinHuiEric
> ; amd-gfx@lists.freedesktop.org
> Cc: Wong, Alice ; Kuehling, Felix
> ; Kasiviswanathan, Harish
> 
> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
> [Public]
>
> > -Original Message-
> > From: Liu, Shaoyun 
> > Sent: Tuesday, December 12, 2023 4:45 PM
> > To: Kim, Jonathan ; Huang, JinHuiEric
> > ; amd-gfx@lists.freedesktop.org
> > Cc: Wong, Alice ; Kuehling, Felix
> > ; Kasiviswanathan, Harish
> > 
> > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> > management
> >
> > [Public]
> >
> > Shouldn't the driver side  remove all the remaining  queues for the
> > process during  process termination ?  If all the  queues been removed
> > for the process ,  MES should purge the  process context automatically
> > , otherwise it's bug inside MES .
>
> That's only if there were queues added to begin with.
>
> Jon
>
> >
> > Regard
> > Sshaoyun.liu
> >
> > -----Original Message-
> > From: Kim, Jonathan 
> > Sent: Tuesday, December 12, 2023 4:33 PM
> > To: Liu, Shaoyun ; Huang, JinHuiEric
> > ; amd-gfx@lists.freedesktop.org
> > Cc: Wong, Alice ; Kuehling, Felix
> > ; Kasiviswanathan, Harish
> > 
> > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> > management
> >
> > [Public]
> >
> > > -Original Message-
> > > From: Liu, Shaoyun 
> > > Sent: Tuesday, December 12, 2023 4:00 PM
> > > To: Huang, JinHuiEric ; Kim, Jonathan
> > > ; amd-gfx@lists.freedesktop.org
> > > Cc: Wong, Alice ; Kuehling, Felix
> > > ; Kasiviswanathan, Harish
> > > 
> > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> > > management
> > >
> > > [AMD Official Use Only - General]
> > >
> > > Does this requires the  new MES FW for this process_ctx_flush
> > > requirement ?  Can driver side add logic to guaranty when  call
> > > SET_SHADER_DEBUGGER, the process address  is always valid ?
> >
> > Call to flush on old fw is a NOP so it's harmless in that case.
> > Full solution will still require a new MES version as this is a
> > workaround on corner cases and not a new feature i.e. we can't stop
> > ROCm from running on old fw.
> > The process address is always valid from the driver side.  It's the
> > MES side of things that gets stale as mentioned in the description
> > (passed value to MES is reused with new BO but MES doesn't refresh).
> > i.e. MES auto refreshes it's process list assuming process queues were
> > all drained but driver can't guarantee that SET_SHADER_DEBUGGER
> (which
> > adds to MES's process list) will get called after queues ge

RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

2023-12-12 Thread Liu, Shaoyun
[Public]

Do you mean SET_SHADER_DEBUGER can  be called before ADD_QUEUE ?  I think  even 
in that  situation MES should still be able to handle it as long as MES already 
 remove the process context from its list , MES will treat the  process context 
as a new item. I still don't understand why MES haven't  purged the  process 
context from the list after process termination .   Will debug queue itself  
also use the add/remove queue interface  and  is it possible the debug queue 
itself from the  old process  still not be  removed ?

Shaoyun.liu

-Original Message-
From: Kim, Jonathan 
Sent: Tuesday, December 12, 2023 4:48 PM
To: Liu, Shaoyun ; Huang, JinHuiEric 
; amd-gfx@lists.freedesktop.org
Cc: Wong, Alice ; Kuehling, Felix 
; Kasiviswanathan, Harish 

Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

[Public]

> -Original Message-
> From: Liu, Shaoyun 
> Sent: Tuesday, December 12, 2023 4:45 PM
> To: Kim, Jonathan ; Huang, JinHuiEric
> ; amd-gfx@lists.freedesktop.org
> Cc: Wong, Alice ; Kuehling, Felix
> ; Kasiviswanathan, Harish
> 
> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
> [Public]
>
> Shouldn't the driver side  remove all the remaining  queues for the
> process during  process termination ?  If all the  queues been removed
> for the process ,  MES should purge the  process context automatically
> , otherwise it's bug inside MES .

That's only if there were queues added to begin with.

Jon

>
> Regard
> Sshaoyun.liu
>
> -Original Message-
> From: Kim, Jonathan 
> Sent: Tuesday, December 12, 2023 4:33 PM
> To: Liu, Shaoyun ; Huang, JinHuiEric
> ; amd-gfx@lists.freedesktop.org
> Cc: Wong, Alice ; Kuehling, Felix
> ; Kasiviswanathan, Harish
> 
> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
> [Public]
>
> > -Original Message-
> > From: Liu, Shaoyun 
> > Sent: Tuesday, December 12, 2023 4:00 PM
> > To: Huang, JinHuiEric ; Kim, Jonathan
> > ; amd-gfx@lists.freedesktop.org
> > Cc: Wong, Alice ; Kuehling, Felix
> > ; Kasiviswanathan, Harish
> > 
> > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> > management
> >
> > [AMD Official Use Only - General]
> >
> > Does this requires the  new MES FW for this process_ctx_flush
> > requirement ?  Can driver side add logic to guaranty when  call
> > SET_SHADER_DEBUGGER, the process address  is always valid ?
>
> Call to flush on old fw is a NOP so it's harmless in that case.
> Full solution will still require a new MES version as this is a
> workaround on corner cases and not a new feature i.e. we can't stop
> ROCm from running on old fw.
> The process address is always valid from the driver side.  It's the
> MES side of things that gets stale as mentioned in the description
> (passed value to MES is reused with new BO but MES doesn't refresh).
> i.e. MES auto refreshes it's process list assuming process queues were
> all drained but driver can't guarantee that SET_SHADER_DEBUGGER (which
> adds to MES's process list) will get called after queues get added (in
> fact it's a requirements that it can be called at any time).
> We can attempt to defer calls these calls in the KFD, considering all cases.
> But that would be a large shift in debugger/runtime_enable/KFD code,
> which is already complicated and could get buggy plus it would not be
> intuitive at all as to why we're doing this.
> I think a single flag set to flush MES on process termination is a
> simpler compromise that shows the limitation in a more obvious way.
>
> Thanks,
>
> Jon
>
>
> >
> > Regards
> > Shaoyun.liu
> >
> >
> > -----Original Message-----
> > From: amd-gfx  On Behalf Of
> > Eric Huang
> > Sent: Tuesday, December 12, 2023 12:49 PM
> > To: Kim, Jonathan ; amd-
> > g...@lists.freedesktop.org
> > Cc: Wong, Alice ; Kuehling, Felix
> > ; Kasiviswanathan, Harish
> > 
> > Subject: Re: [PATCH] drm/amdkfd: fix mes set shader debugger process
> > management
> >
> >
> > On 2023-12-11 16:16, Jonathan Kim wrote:
> > > MES provides the driver a call to explicitly flush stale process
> > > memory within the MES to avoid a race condition that results in a
> > > fatal memory violation.
> > >
> > > When SET_SHADER_DEBUGGER is called, the driver passes a memory
> > address
> > > that represents a process context address MES uses to keep track
> > > of future per-process calls.
> > >
> > > Normally, MES will purge its process context list when the last
> > > queue has be

RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

2023-12-12 Thread Kim, Jonathan
[Public]

> -Original Message-
> From: Liu, Shaoyun 
> Sent: Tuesday, December 12, 2023 4:45 PM
> To: Kim, Jonathan ; Huang, JinHuiEric
> ; amd-gfx@lists.freedesktop.org
> Cc: Wong, Alice ; Kuehling, Felix
> ; Kasiviswanathan, Harish
> 
> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
> [Public]
>
> Shouldn't the driver side  remove all the remaining  queues for the process
> during  process termination ?  If all the  queues been removed for the
> process ,  MES should purge the  process context automatically , otherwise
> it's bug inside MES .

That's only if there were queues added to begin with.

Jon

>
> Regard
> Sshaoyun.liu
>
> -Original Message-
> From: Kim, Jonathan 
> Sent: Tuesday, December 12, 2023 4:33 PM
> To: Liu, Shaoyun ; Huang, JinHuiEric
> ; amd-gfx@lists.freedesktop.org
> Cc: Wong, Alice ; Kuehling, Felix
> ; Kasiviswanathan, Harish
> 
> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
> [Public]
>
> > -Original Message-
> > From: Liu, Shaoyun 
> > Sent: Tuesday, December 12, 2023 4:00 PM
> > To: Huang, JinHuiEric ; Kim, Jonathan
> > ; amd-gfx@lists.freedesktop.org
> > Cc: Wong, Alice ; Kuehling, Felix
> > ; Kasiviswanathan, Harish
> > 
> > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> > management
> >
> > [AMD Official Use Only - General]
> >
> > Does this requires the  new MES FW for this process_ctx_flush
> > requirement ?  Can driver side add logic to guaranty when  call
> > SET_SHADER_DEBUGGER, the process address  is always valid ?
>
> Call to flush on old fw is a NOP so it's harmless in that case.
> Full solution will still require a new MES version as this is a workaround on
> corner cases and not a new feature i.e. we can't stop ROCm from running on
> old fw.
> The process address is always valid from the driver side.  It's the MES side 
> of
> things that gets stale as mentioned in the description (passed value to MES
> is reused with new BO but MES doesn't refresh).
> i.e. MES auto refreshes it's process list assuming process queues were all
> drained but driver can't guarantee that SET_SHADER_DEBUGGER (which
> adds to MES's process list) will get called after queues get added (in fact 
> it's
> a requirements that it can be called at any time).
> We can attempt to defer calls these calls in the KFD, considering all cases.
> But that would be a large shift in debugger/runtime_enable/KFD code,
> which is already complicated and could get buggy plus it would not be
> intuitive at all as to why we're doing this.
> I think a single flag set to flush MES on process termination is a simpler
> compromise that shows the limitation in a more obvious way.
>
> Thanks,
>
> Jon
>
>
> >
> > Regards
> > Shaoyun.liu
> >
> >
> > -Original Message-
> > From: amd-gfx  On Behalf Of
> > Eric Huang
> > Sent: Tuesday, December 12, 2023 12:49 PM
> > To: Kim, Jonathan ; amd-
> > g...@lists.freedesktop.org
> > Cc: Wong, Alice ; Kuehling, Felix
> > ; Kasiviswanathan, Harish
> > 
> > Subject: Re: [PATCH] drm/amdkfd: fix mes set shader debugger process
> > management
> >
> >
> > On 2023-12-11 16:16, Jonathan Kim wrote:
> > > MES provides the driver a call to explicitly flush stale process
> > > memory within the MES to avoid a race condition that results in a
> > > fatal memory violation.
> > >
> > > When SET_SHADER_DEBUGGER is called, the driver passes a memory
> > address
> > > that represents a process context address MES uses to keep track of
> > > future per-process calls.
> > >
> > > Normally, MES will purge its process context list when the last
> > > queue has been removed.  The driver, however, can call
> > > SET_SHADER_DEBUGGER regardless of whether a queue has been added
> or not.
> > >
> > > If SET_SHADER_DEBUGGER has been called with no queues as the last
> > > call prior to process termination, the passed process context
> > > address will still reside within MES.
> > >
> > > On a new process call to SET_SHADER_DEBUGGER, the driver may end
> up
> > > passing an identical process context address value (based on
> > > per-process gpu memory address) to MES but is now pointing to a new
> > > allocated buffer object during KFD process creation.  Since the MES
> > > is unaware of this, access of the passed address points to the stale
> > > object within MES and triggers a

RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

2023-12-12 Thread Liu, Shaoyun
[Public]

Shouldn't the driver side  remove all the remaining  queues for the process 
during  process termination ?  If all the  queues been removed for the process 
,  MES should purge the  process context automatically , otherwise it's bug 
inside MES .

Regard
Sshaoyun.liu

-Original Message-
From: Kim, Jonathan 
Sent: Tuesday, December 12, 2023 4:33 PM
To: Liu, Shaoyun ; Huang, JinHuiEric 
; amd-gfx@lists.freedesktop.org
Cc: Wong, Alice ; Kuehling, Felix 
; Kasiviswanathan, Harish 

Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

[Public]

> -Original Message-
> From: Liu, Shaoyun 
> Sent: Tuesday, December 12, 2023 4:00 PM
> To: Huang, JinHuiEric ; Kim, Jonathan
> ; amd-gfx@lists.freedesktop.org
> Cc: Wong, Alice ; Kuehling, Felix
> ; Kasiviswanathan, Harish
> 
> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
> [AMD Official Use Only - General]
>
> Does this requires the  new MES FW for this process_ctx_flush
> requirement ?  Can driver side add logic to guaranty when  call
> SET_SHADER_DEBUGGER, the process address  is always valid ?

Call to flush on old fw is a NOP so it's harmless in that case.
Full solution will still require a new MES version as this is a workaround on 
corner cases and not a new feature i.e. we can't stop ROCm from running on old 
fw.
The process address is always valid from the driver side.  It's the MES side of 
things that gets stale as mentioned in the description (passed value to MES is 
reused with new BO but MES doesn't refresh).
i.e. MES auto refreshes it's process list assuming process queues were all 
drained but driver can't guarantee that SET_SHADER_DEBUGGER (which adds to 
MES's process list) will get called after queues get added (in fact it's a 
requirements that it can be called at any time).
We can attempt to defer calls these calls in the KFD, considering all cases.
But that would be a large shift in debugger/runtime_enable/KFD code, which is 
already complicated and could get buggy plus it would not be intuitive at all 
as to why we're doing this.
I think a single flag set to flush MES on process termination is a simpler 
compromise that shows the limitation in a more obvious way.

Thanks,

Jon


>
> Regards
> Shaoyun.liu
>
>
> -Original Message-
> From: amd-gfx  On Behalf Of
> Eric Huang
> Sent: Tuesday, December 12, 2023 12:49 PM
> To: Kim, Jonathan ; amd-
> g...@lists.freedesktop.org
> Cc: Wong, Alice ; Kuehling, Felix
> ; Kasiviswanathan, Harish
> 
> Subject: Re: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
>
> On 2023-12-11 16:16, Jonathan Kim wrote:
> > MES provides the driver a call to explicitly flush stale process
> > memory within the MES to avoid a race condition that results in a
> > fatal memory violation.
> >
> > When SET_SHADER_DEBUGGER is called, the driver passes a memory
> address
> > that represents a process context address MES uses to keep track of
> > future per-process calls.
> >
> > Normally, MES will purge its process context list when the last
> > queue has been removed.  The driver, however, can call
> > SET_SHADER_DEBUGGER regardless of whether a queue has been added or not.
> >
> > If SET_SHADER_DEBUGGER has been called with no queues as the last
> > call prior to process termination, the passed process context
> > address will still reside within MES.
> >
> > On a new process call to SET_SHADER_DEBUGGER, the driver may end up
> > passing an identical process context address value (based on
> > per-process gpu memory address) to MES but is now pointing to a new
> > allocated buffer object during KFD process creation.  Since the MES
> > is unaware of this, access of the passed address points to the stale
> > object within MES and triggers a fatal memory violation.
> >
> > The solution is for KFD to explicitly flush the process context
> > address from MES on process termination.
> >
> > Note that the flush call and the MES debugger calls use the same MES
> > interface but are separated as KFD calls to avoid conflicting with
> > each other.
> >
> > Signed-off-by: Jonathan Kim 
> > Tested-by: Alice Wong 
> Reviewed-by: Eric Huang 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c   | 31
> +++
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h   | 10 +++---
> >   .../amd/amdkfd/kfd_process_queue_manager.c|  1 +
> >   drivers/gpu/drm/amd/include/mes_v11_api_def.h |  3 +-
> >   4 files changed, 40 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> > b/drivers/gpu/drm/amd/amdgpu/amd

RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

2023-12-12 Thread Kim, Jonathan
[Public]

> -Original Message-
> From: Liu, Shaoyun 
> Sent: Tuesday, December 12, 2023 4:00 PM
> To: Huang, JinHuiEric ; Kim, Jonathan
> ; amd-gfx@lists.freedesktop.org
> Cc: Wong, Alice ; Kuehling, Felix
> ; Kasiviswanathan, Harish
> 
> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
> [AMD Official Use Only - General]
>
> Does this requires the  new MES FW for this process_ctx_flush
> requirement ?  Can driver side add logic to guaranty when  call
> SET_SHADER_DEBUGGER, the process address  is always valid ?

Call to flush on old fw is a NOP so it's harmless in that case.
Full solution will still require a new MES version as this is a workaround on 
corner cases and not a new feature i.e. we can't stop ROCm from running on old 
fw.
The process address is always valid from the driver side.  It's the MES side of 
things that gets stale as mentioned in the description (passed value to MES is 
reused with new BO but MES doesn't refresh).
i.e. MES auto refreshes it's process list assuming process queues were all 
drained but driver can't guarantee that SET_SHADER_DEBUGGER (which adds to 
MES's process list) will get called after queues get added (in fact it's a 
requirements that it can be called at any time).
We can attempt to defer calls these calls in the KFD, considering all cases.
But that would be a large shift in debugger/runtime_enable/KFD code, which is 
already complicated and could get buggy plus it would not be intuitive at all 
as to why we're doing this.
I think a single flag set to flush MES on process termination is a simpler 
compromise that shows the limitation in a more obvious way.

Thanks,

Jon


>
> Regards
> Shaoyun.liu
>
>
> -Original Message-
> From: amd-gfx  On Behalf Of Eric
> Huang
> Sent: Tuesday, December 12, 2023 12:49 PM
> To: Kim, Jonathan ; amd-
> g...@lists.freedesktop.org
> Cc: Wong, Alice ; Kuehling, Felix
> ; Kasiviswanathan, Harish
> 
> Subject: Re: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
>
> On 2023-12-11 16:16, Jonathan Kim wrote:
> > MES provides the driver a call to explicitly flush stale process
> > memory within the MES to avoid a race condition that results in a
> > fatal memory violation.
> >
> > When SET_SHADER_DEBUGGER is called, the driver passes a memory
> address
> > that represents a process context address MES uses to keep track of
> > future per-process calls.
> >
> > Normally, MES will purge its process context list when the last queue
> > has been removed.  The driver, however, can call SET_SHADER_DEBUGGER
> > regardless of whether a queue has been added or not.
> >
> > If SET_SHADER_DEBUGGER has been called with no queues as the last call
> > prior to process termination, the passed process context address will
> > still reside within MES.
> >
> > On a new process call to SET_SHADER_DEBUGGER, the driver may end up
> > passing an identical process context address value (based on
> > per-process gpu memory address) to MES but is now pointing to a new
> > allocated buffer object during KFD process creation.  Since the MES is
> > unaware of this, access of the passed address points to the stale
> > object within MES and triggers a fatal memory violation.
> >
> > The solution is for KFD to explicitly flush the process context
> > address from MES on process termination.
> >
> > Note that the flush call and the MES debugger calls use the same MES
> > interface but are separated as KFD calls to avoid conflicting with
> > each other.
> >
> > Signed-off-by: Jonathan Kim 
> > Tested-by: Alice Wong 
> Reviewed-by: Eric Huang 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c   | 31
> +++
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h   | 10 +++---
> >   .../amd/amdkfd/kfd_process_queue_manager.c|  1 +
> >   drivers/gpu/drm/amd/include/mes_v11_api_def.h |  3 +-
> >   4 files changed, 40 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> > index e544b823abf6..e98de23250dc 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> > @@ -916,6 +916,11 @@ int amdgpu_mes_set_shader_debugger(struct
> amdgpu_device *adev,
> >   op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER;
> >   op_input.set_shader_debugger.process_context_addr =
> process_context_addr;
> >   op_input.set_shader_debugger.flags.u32all = flags;
> > +
> > + /* use amdgpu mes_flush_shader_debugger instead */
> > + if (

RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

2023-12-12 Thread Liu, Shaoyun
[AMD Official Use Only - General]

Does this requires the  new MES FW for this process_ctx_flush  requirement ?  
Can driver side add logic to guaranty when  call SET_SHADER_DEBUGGER, the 
process address  is always valid ?

Regards
Shaoyun.liu


-Original Message-
From: amd-gfx  On Behalf Of Eric Huang
Sent: Tuesday, December 12, 2023 12:49 PM
To: Kim, Jonathan ; amd-gfx@lists.freedesktop.org
Cc: Wong, Alice ; Kuehling, Felix 
; Kasiviswanathan, Harish 

Subject: Re: [PATCH] drm/amdkfd: fix mes set shader debugger process management


On 2023-12-11 16:16, Jonathan Kim wrote:
> MES provides the driver a call to explicitly flush stale process
> memory within the MES to avoid a race condition that results in a
> fatal memory violation.
>
> When SET_SHADER_DEBUGGER is called, the driver passes a memory address
> that represents a process context address MES uses to keep track of
> future per-process calls.
>
> Normally, MES will purge its process context list when the last queue
> has been removed.  The driver, however, can call SET_SHADER_DEBUGGER
> regardless of whether a queue has been added or not.
>
> If SET_SHADER_DEBUGGER has been called with no queues as the last call
> prior to process termination, the passed process context address will
> still reside within MES.
>
> On a new process call to SET_SHADER_DEBUGGER, the driver may end up
> passing an identical process context address value (based on
> per-process gpu memory address) to MES but is now pointing to a new
> allocated buffer object during KFD process creation.  Since the MES is
> unaware of this, access of the passed address points to the stale
> object within MES and triggers a fatal memory violation.
>
> The solution is for KFD to explicitly flush the process context
> address from MES on process termination.
>
> Note that the flush call and the MES debugger calls use the same MES
> interface but are separated as KFD calls to avoid conflicting with
> each other.
>
> Signed-off-by: Jonathan Kim 
> Tested-by: Alice Wong 
Reviewed-by: Eric Huang 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c   | 31 +++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h   | 10 +++---
>   .../amd/amdkfd/kfd_process_queue_manager.c|  1 +
>   drivers/gpu/drm/amd/include/mes_v11_api_def.h |  3 +-
>   4 files changed, 40 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> index e544b823abf6..e98de23250dc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> @@ -916,6 +916,11 @@ int amdgpu_mes_set_shader_debugger(struct amdgpu_device 
> *adev,
>   op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER;
>   op_input.set_shader_debugger.process_context_addr = 
> process_context_addr;
>   op_input.set_shader_debugger.flags.u32all = flags;
> +
> + /* use amdgpu mes_flush_shader_debugger instead */
> + if (op_input.set_shader_debugger.flags.process_ctx_flush)
> + return -EINVAL;
> +
>   op_input.set_shader_debugger.spi_gdbg_per_vmid_cntl = 
> spi_gdbg_per_vmid_cntl;
>   memcpy(op_input.set_shader_debugger.tcp_watch_cntl, tcp_watch_cntl,
>   sizeof(op_input.set_shader_debugger.tcp_watch_cntl));
> @@ -935,6 +940,32 @@ int amdgpu_mes_set_shader_debugger(struct amdgpu_device 
> *adev,
>   return r;
>   }
>
> +int amdgpu_mes_flush_shader_debugger(struct amdgpu_device *adev,
> +  uint64_t process_context_addr) {
> + struct mes_misc_op_input op_input = {0};
> + int r;
> +
> + if (!adev->mes.funcs->misc_op) {
> + DRM_ERROR("mes flush shader debugger is not supported!\n");
> + return -EINVAL;
> + }
> +
> + op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER;
> + op_input.set_shader_debugger.process_context_addr = 
> process_context_addr;
> + op_input.set_shader_debugger.flags.process_ctx_flush = true;
> +
> + amdgpu_mes_lock(>mes);
> +
> + r = adev->mes.funcs->misc_op(>mes, _input);
> + if (r)
> + DRM_ERROR("failed to set_shader_debugger\n");
> +
> + amdgpu_mes_unlock(>mes);
> +
> + return r;
> +}
> +
>   static void
>   amdgpu_mes_ring_to_queue_props(struct amdgpu_device *adev,
>  struct amdgpu_ring *ring,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> index 894b9b133000..7d4f93fea937 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> @@ -296,9 +296,10 @@ struct mes_mi

Re: [PATCH] drm/amdkfd: fix mes set shader debugger process management

2023-12-12 Thread Eric Huang



On 2023-12-11 16:16, Jonathan Kim wrote:

MES provides the driver a call to explicitly flush stale process memory
within the MES to avoid a race condition that results in a fatal
memory violation.

When SET_SHADER_DEBUGGER is called, the driver passes a memory address
that represents a process context address MES uses to keep track of
future per-process calls.

Normally, MES will purge its process context list when the last queue
has been removed.  The driver, however, can call SET_SHADER_DEBUGGER
regardless of whether a queue has been added or not.

If SET_SHADER_DEBUGGER has been called with no queues as the last call
prior to process termination, the passed process context address will
still reside within MES.

On a new process call to SET_SHADER_DEBUGGER, the driver may end up
passing an identical process context address value (based on per-process
gpu memory address) to MES but is now pointing to a new allocated buffer
object during KFD process creation.  Since the MES is unaware of this,
access of the passed address points to the stale object within MES and
triggers a fatal memory violation.

The solution is for KFD to explicitly flush the process context address
from MES on process termination.

Note that the flush call and the MES debugger calls use the same MES
interface but are separated as KFD calls to avoid conflicting with each
other.

Signed-off-by: Jonathan Kim 
Tested-by: Alice Wong 

Reviewed-by: Eric Huang 

---
  drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c   | 31 +++
  drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h   | 10 +++---
  .../amd/amdkfd/kfd_process_queue_manager.c|  1 +
  drivers/gpu/drm/amd/include/mes_v11_api_def.h |  3 +-
  4 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
index e544b823abf6..e98de23250dc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
@@ -916,6 +916,11 @@ int amdgpu_mes_set_shader_debugger(struct amdgpu_device 
*adev,
op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER;
op_input.set_shader_debugger.process_context_addr = 
process_context_addr;
op_input.set_shader_debugger.flags.u32all = flags;
+
+   /* use amdgpu mes_flush_shader_debugger instead */
+   if (op_input.set_shader_debugger.flags.process_ctx_flush)
+   return -EINVAL;
+
op_input.set_shader_debugger.spi_gdbg_per_vmid_cntl = 
spi_gdbg_per_vmid_cntl;
memcpy(op_input.set_shader_debugger.tcp_watch_cntl, tcp_watch_cntl,
sizeof(op_input.set_shader_debugger.tcp_watch_cntl));
@@ -935,6 +940,32 @@ int amdgpu_mes_set_shader_debugger(struct amdgpu_device 
*adev,
return r;
  }
  
+int amdgpu_mes_flush_shader_debugger(struct amdgpu_device *adev,

+uint64_t process_context_addr)
+{
+   struct mes_misc_op_input op_input = {0};
+   int r;
+
+   if (!adev->mes.funcs->misc_op) {
+   DRM_ERROR("mes flush shader debugger is not supported!\n");
+   return -EINVAL;
+   }
+
+   op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER;
+   op_input.set_shader_debugger.process_context_addr = 
process_context_addr;
+   op_input.set_shader_debugger.flags.process_ctx_flush = true;
+
+   amdgpu_mes_lock(>mes);
+
+   r = adev->mes.funcs->misc_op(>mes, _input);
+   if (r)
+   DRM_ERROR("failed to set_shader_debugger\n");
+
+   amdgpu_mes_unlock(>mes);
+
+   return r;
+}
+
  static void
  amdgpu_mes_ring_to_queue_props(struct amdgpu_device *adev,
   struct amdgpu_ring *ring,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
index 894b9b133000..7d4f93fea937 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
@@ -296,9 +296,10 @@ struct mes_misc_op_input {
uint64_t process_context_addr;
union {
struct {
-   uint64_t single_memop : 1;
-   uint64_t single_alu_op : 1;
-   uint64_t reserved: 30;
+   uint32_t single_memop : 1;
+   uint32_t single_alu_op : 1;
+   uint32_t reserved: 29;
+   uint32_t process_ctx_flush: 1;
};
uint32_t u32all;
} flags;
@@ -374,7 +375,8 @@ int amdgpu_mes_set_shader_debugger(struct amdgpu_device 
*adev,
const uint32_t *tcp_watch_cntl,
uint32_t flags,
bool trap_en);
-
+int amdgpu_mes_flush_shader_debugger(struct amdgpu_device