[Public]

Shouldn't the driver side  remove all the remaining  queues for the process 
during  process termination ?  If all the  queues been removed for the process 
,  MES should purge the  process context automatically , otherwise it's bug 
inside MES .

Regard
Sshaoyun.liu

-----Original Message-----
From: Kim, Jonathan <jonathan....@amd.com>
Sent: Tuesday, December 12, 2023 4:33 PM
To: Liu, Shaoyun <shaoyun....@amd.com>; Huang, JinHuiEric 
<jinhuieric.hu...@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Wong, Alice <shiwei.w...@amd.com>; Kuehling, Felix 
<felix.kuehl...@amd.com>; Kasiviswanathan, Harish 
<harish.kasiviswanat...@amd.com>
Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management

[Public]

> -----Original Message-----
> From: Liu, Shaoyun <shaoyun....@amd.com>
> Sent: Tuesday, December 12, 2023 4:00 PM
> To: Huang, JinHuiEric <jinhuieric.hu...@amd.com>; Kim, Jonathan
> <jonathan....@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Wong, Alice <shiwei.w...@amd.com>; Kuehling, Felix
> <felix.kuehl...@amd.com>; Kasiviswanathan, Harish
> <harish.kasiviswanat...@amd.com>
> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
> [AMD Official Use Only - General]
>
> Does this requires the  new MES FW for this process_ctx_flush
> requirement ?  Can driver side add logic to guaranty when  call
> SET_SHADER_DEBUGGER, the process address  is always valid ?

Call to flush on old fw is a NOP so it's harmless in that case.
Full solution will still require a new MES version as this is a workaround on 
corner cases and not a new feature i.e. we can't stop ROCm from running on old 
fw.
The process address is always valid from the driver side.  It's the MES side of 
things that gets stale as mentioned in the description (passed value to MES is 
reused with new BO but MES doesn't refresh).
i.e. MES auto refreshes it's process list assuming process queues were all 
drained but driver can't guarantee that SET_SHADER_DEBUGGER (which adds to 
MES's process list) will get called after queues get added (in fact it's a 
requirements that it can be called at any time).
We can attempt to defer calls these calls in the KFD, considering all cases.
But that would be a large shift in debugger/runtime_enable/KFD code, which is 
already complicated and could get buggy plus it would not be intuitive at all 
as to why we're doing this.
I think a single flag set to flush MES on process termination is a simpler 
compromise that shows the limitation in a more obvious way.

Thanks,

Jon


>
> Regards
> Shaoyun.liu
>
>
> -----Original Message-----
> From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of
> Eric Huang
> Sent: Tuesday, December 12, 2023 12:49 PM
> To: Kim, Jonathan <jonathan....@amd.com>; amd-
> g...@lists.freedesktop.org
> Cc: Wong, Alice <shiwei.w...@amd.com>; Kuehling, Felix
> <felix.kuehl...@amd.com>; Kasiviswanathan, Harish
> <harish.kasiviswanat...@amd.com>
> Subject: Re: [PATCH] drm/amdkfd: fix mes set shader debugger process
> management
>
>
> On 2023-12-11 16:16, Jonathan Kim wrote:
> > MES provides the driver a call to explicitly flush stale process
> > memory within the MES to avoid a race condition that results in a
> > fatal memory violation.
> >
> > When SET_SHADER_DEBUGGER is called, the driver passes a memory
> address
> > that represents a process context address MES uses to keep track of
> > future per-process calls.
> >
> > Normally, MES will purge its process context list when the last
> > queue has been removed.  The driver, however, can call
> > SET_SHADER_DEBUGGER regardless of whether a queue has been added or not.
> >
> > If SET_SHADER_DEBUGGER has been called with no queues as the last
> > call prior to process termination, the passed process context
> > address will still reside within MES.
> >
> > On a new process call to SET_SHADER_DEBUGGER, the driver may end up
> > passing an identical process context address value (based on
> > per-process gpu memory address) to MES but is now pointing to a new
> > allocated buffer object during KFD process creation.  Since the MES
> > is unaware of this, access of the passed address points to the stale
> > object within MES and triggers a fatal memory violation.
> >
> > The solution is for KFD to explicitly flush the process context
> > address from MES on process termination.
> >
> > Note that the flush call and the MES debugger calls use the same MES
> > interface but are separated as KFD calls to avoid conflicting with
> > each other.
> >
> > Signed-off-by: Jonathan Kim <jonathan....@amd.com>
> > Tested-by: Alice Wong <shiwei.w...@amd.com>
> Reviewed-by: Eric Huang <jinhuieric.hu...@amd.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c       | 31
> +++++++++++++++++++
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h       | 10 +++---
> >   .../amd/amdkfd/kfd_process_queue_manager.c    |  1 +
> >   drivers/gpu/drm/amd/include/mes_v11_api_def.h |  3 +-
> >   4 files changed, 40 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> > index e544b823abf6..e98de23250dc 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> > @@ -916,6 +916,11 @@ int amdgpu_mes_set_shader_debugger(struct
> amdgpu_device *adev,
> >       op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER;
> >       op_input.set_shader_debugger.process_context_addr =
> process_context_addr;
> >       op_input.set_shader_debugger.flags.u32all = flags;
> > +
> > +     /* use amdgpu mes_flush_shader_debugger instead */
> > +     if (op_input.set_shader_debugger.flags.process_ctx_flush)
> > +             return -EINVAL;
> > +
> >       op_input.set_shader_debugger.spi_gdbg_per_vmid_cntl =
> spi_gdbg_per_vmid_cntl;
> >       memcpy(op_input.set_shader_debugger.tcp_watch_cntl,
> tcp_watch_cntl,
> >
> > sizeof(op_input.set_shader_debugger.tcp_watch_cntl));
> > @@ -935,6 +940,32 @@ int amdgpu_mes_set_shader_debugger(struct
> amdgpu_device *adev,
> >       return r;
> >   }
> >
> > +int amdgpu_mes_flush_shader_debugger(struct amdgpu_device *adev,
> > +                                  uint64_t process_context_addr) {
> > +     struct mes_misc_op_input op_input = {0};
> > +     int r;
> > +
> > +     if (!adev->mes.funcs->misc_op) {
> > +             DRM_ERROR("mes flush shader debugger is not supported!\n");
> > +             return -EINVAL;
> > +     }
> > +
> > +     op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER;
> > +     op_input.set_shader_debugger.process_context_addr =
> process_context_addr;
> > +     op_input.set_shader_debugger.flags.process_ctx_flush = true;
> > +
> > +     amdgpu_mes_lock(&adev->mes);
> > +
> > +     r = adev->mes.funcs->misc_op(&adev->mes, &op_input);
> > +     if (r)
> > +             DRM_ERROR("failed to set_shader_debugger\n");
> > +
> > +     amdgpu_mes_unlock(&adev->mes);
> > +
> > +     return r;
> > +}
> > +
> >   static void
> >   amdgpu_mes_ring_to_queue_props(struct amdgpu_device *adev,
> >                              struct amdgpu_ring *ring, diff --git
> > a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> > index 894b9b133000..7d4f93fea937 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> > @@ -296,9 +296,10 @@ struct mes_misc_op_input {
> >                       uint64_t process_context_addr;
> >                       union {
> >                               struct {
> > -                                     uint64_t single_memop : 1;
> > -                                     uint64_t single_alu_op : 1;
> > -                                     uint64_t reserved: 30;
> > +                                     uint32_t single_memop : 1;
> > +                                     uint32_t single_alu_op : 1;
> > +                                     uint32_t reserved: 29;
> > +                                     uint32_t process_ctx_flush: 1;
> >                               };
> >                               uint32_t u32all;
> >                       } flags;
> > @@ -374,7 +375,8 @@ int amdgpu_mes_set_shader_debugger(struct
> amdgpu_device *adev,
> >                               const uint32_t *tcp_watch_cntl,
> >                               uint32_t flags,
> >                               bool trap_en);
> > -
> > +int amdgpu_mes_flush_shader_debugger(struct amdgpu_device *adev,
> > +                             uint64_t process_context_addr);
> >   int amdgpu_mes_add_ring(struct amdgpu_device *adev, int gang_id,
> >                       int queue_type, int idx,
> >                       struct amdgpu_mes_ctx_data *ctx_data, diff
> > --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> > b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> > index 77f493262e05..8e55e78fce4e 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> > @@ -87,6 +87,7 @@ void kfd_process_dequeue_from_device(struct
> kfd_process_device *pdd)
> >               return;
> >
> >       dev->dqm->ops.process_termination(dev->dqm, &pdd->qpd);
> > +     amdgpu_mes_flush_shader_debugger(dev->adev, pdd-
> >proc_ctx_gpu_addr);
> >       pdd->already_dequeued = true;
> >   }
> >
> > diff --git a/drivers/gpu/drm/amd/include/mes_v11_api_def.h
> > b/drivers/gpu/drm/amd/include/mes_v11_api_def.h
> > index 1fbfd1aa987e..ec5b9ab67c5e 100644
> > --- a/drivers/gpu/drm/amd/include/mes_v11_api_def.h
> > +++ b/drivers/gpu/drm/amd/include/mes_v11_api_def.h
> > @@ -572,7 +572,8 @@ struct SET_SHADER_DEBUGGER {
> >               struct {
> >                       uint32_t single_memop : 1;  /* SQ_DEBUG.single_memop 
> > */
> >                       uint32_t single_alu_op : 1; /* SQ_DEBUG.single_alu_op 
> > */
> > -                     uint32_t reserved : 30;
> > +                     uint32_t reserved : 29;
> > +                     uint32_t process_ctx_flush : 1;
> >               };
> >               uint32_t u32all;
> >       } flags;
>


Reply via email to