[Public] > -----Original Message----- > From: Liu, Shaoyun <shaoyun....@amd.com> > Sent: Tuesday, December 12, 2023 4:45 PM > To: Kim, Jonathan <jonathan....@amd.com>; Huang, JinHuiEric > <jinhuieric.hu...@amd.com>; amd-gfx@lists.freedesktop.org > Cc: Wong, Alice <shiwei.w...@amd.com>; Kuehling, Felix > <felix.kuehl...@amd.com>; Kasiviswanathan, Harish > <harish.kasiviswanat...@amd.com> > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > management > > [Public] > > Shouldn't the driver side remove all the remaining queues for the process > during process termination ? If all the queues been removed for the > process , MES should purge the process context automatically , otherwise > it's bug inside MES .
That's only if there were queues added to begin with. Jon > > Regard > Sshaoyun.liu > > -----Original Message----- > From: Kim, Jonathan <jonathan....@amd.com> > Sent: Tuesday, December 12, 2023 4:33 PM > To: Liu, Shaoyun <shaoyun....@amd.com>; Huang, JinHuiEric > <jinhuieric.hu...@amd.com>; amd-gfx@lists.freedesktop.org > Cc: Wong, Alice <shiwei.w...@amd.com>; Kuehling, Felix > <felix.kuehl...@amd.com>; Kasiviswanathan, Harish > <harish.kasiviswanat...@amd.com> > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > management > > [Public] > > > -----Original Message----- > > From: Liu, Shaoyun <shaoyun....@amd.com> > > Sent: Tuesday, December 12, 2023 4:00 PM > > To: Huang, JinHuiEric <jinhuieric.hu...@amd.com>; Kim, Jonathan > > <jonathan....@amd.com>; amd-gfx@lists.freedesktop.org > > Cc: Wong, Alice <shiwei.w...@amd.com>; Kuehling, Felix > > <felix.kuehl...@amd.com>; Kasiviswanathan, Harish > > <harish.kasiviswanat...@amd.com> > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > > management > > > > [AMD Official Use Only - General] > > > > Does this requires the new MES FW for this process_ctx_flush > > requirement ? Can driver side add logic to guaranty when call > > SET_SHADER_DEBUGGER, the process address is always valid ? > > Call to flush on old fw is a NOP so it's harmless in that case. > Full solution will still require a new MES version as this is a workaround on > corner cases and not a new feature i.e. we can't stop ROCm from running on > old fw. > The process address is always valid from the driver side. It's the MES side > of > things that gets stale as mentioned in the description (passed value to MES > is reused with new BO but MES doesn't refresh). > i.e. MES auto refreshes it's process list assuming process queues were all > drained but driver can't guarantee that SET_SHADER_DEBUGGER (which > adds to MES's process list) will get called after queues get added (in fact > it's > a requirements that it can be called at any time). > We can attempt to defer calls these calls in the KFD, considering all cases. > But that would be a large shift in debugger/runtime_enable/KFD code, > which is already complicated and could get buggy plus it would not be > intuitive at all as to why we're doing this. > I think a single flag set to flush MES on process termination is a simpler > compromise that shows the limitation in a more obvious way. > > Thanks, > > Jon > > > > > > Regards > > Shaoyun.liu > > > > > > -----Original Message----- > > From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of > > Eric Huang > > Sent: Tuesday, December 12, 2023 12:49 PM > > To: Kim, Jonathan <jonathan....@amd.com>; amd- > > g...@lists.freedesktop.org > > Cc: Wong, Alice <shiwei.w...@amd.com>; Kuehling, Felix > > <felix.kuehl...@amd.com>; Kasiviswanathan, Harish > > <harish.kasiviswanat...@amd.com> > > Subject: Re: [PATCH] drm/amdkfd: fix mes set shader debugger process > > management > > > > > > On 2023-12-11 16:16, Jonathan Kim wrote: > > > MES provides the driver a call to explicitly flush stale process > > > memory within the MES to avoid a race condition that results in a > > > fatal memory violation. > > > > > > When SET_SHADER_DEBUGGER is called, the driver passes a memory > > address > > > that represents a process context address MES uses to keep track of > > > future per-process calls. > > > > > > Normally, MES will purge its process context list when the last > > > queue has been removed. The driver, however, can call > > > SET_SHADER_DEBUGGER regardless of whether a queue has been added > or not. > > > > > > If SET_SHADER_DEBUGGER has been called with no queues as the last > > > call prior to process termination, the passed process context > > > address will still reside within MES. > > > > > > On a new process call to SET_SHADER_DEBUGGER, the driver may end > up > > > passing an identical process context address value (based on > > > per-process gpu memory address) to MES but is now pointing to a new > > > allocated buffer object during KFD process creation. Since the MES > > > is unaware of this, access of the passed address points to the stale > > > object within MES and triggers a fatal memory violation. > > > > > > The solution is for KFD to explicitly flush the process context > > > address from MES on process termination. > > > > > > Note that the flush call and the MES debugger calls use the same MES > > > interface but are separated as KFD calls to avoid conflicting with > > > each other. > > > > > > Signed-off-by: Jonathan Kim <jonathan....@amd.com> > > > Tested-by: Alice Wong <shiwei.w...@amd.com> > > Reviewed-by: Eric Huang <jinhuieric.hu...@amd.com> > > > --- > > > drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 31 > > +++++++++++++++++++ > > > drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h | 10 +++--- > > > .../amd/amdkfd/kfd_process_queue_manager.c | 1 + > > > drivers/gpu/drm/amd/include/mes_v11_api_def.h | 3 +- > > > 4 files changed, 40 insertions(+), 5 deletions(-) > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > > > index e544b823abf6..e98de23250dc 100644 > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > > > @@ -916,6 +916,11 @@ int amdgpu_mes_set_shader_debugger(struct > > amdgpu_device *adev, > > > op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER; > > > op_input.set_shader_debugger.process_context_addr = > > process_context_addr; > > > op_input.set_shader_debugger.flags.u32all = flags; > > > + > > > + /* use amdgpu mes_flush_shader_debugger instead */ > > > + if (op_input.set_shader_debugger.flags.process_ctx_flush) > > > + return -EINVAL; > > > + > > > op_input.set_shader_debugger.spi_gdbg_per_vmid_cntl = > > spi_gdbg_per_vmid_cntl; > > > memcpy(op_input.set_shader_debugger.tcp_watch_cntl, > > tcp_watch_cntl, > > > > > > sizeof(op_input.set_shader_debugger.tcp_watch_cntl)); > > > @@ -935,6 +940,32 @@ int amdgpu_mes_set_shader_debugger(struct > > amdgpu_device *adev, > > > return r; > > > } > > > > > > +int amdgpu_mes_flush_shader_debugger(struct amdgpu_device *adev, > > > + uint64_t process_context_addr) { > > > + struct mes_misc_op_input op_input = {0}; > > > + int r; > > > + > > > + if (!adev->mes.funcs->misc_op) { > > > + DRM_ERROR("mes flush shader debugger is not supported!\n"); > > > + return -EINVAL; > > > + } > > > + > > > + op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER; > > > + op_input.set_shader_debugger.process_context_addr = > > process_context_addr; > > > + op_input.set_shader_debugger.flags.process_ctx_flush = true; > > > + > > > + amdgpu_mes_lock(&adev->mes); > > > + > > > + r = adev->mes.funcs->misc_op(&adev->mes, &op_input); > > > + if (r) > > > + DRM_ERROR("failed to set_shader_debugger\n"); > > > + > > > + amdgpu_mes_unlock(&adev->mes); > > > + > > > + return r; > > > +} > > > + > > > static void > > > amdgpu_mes_ring_to_queue_props(struct amdgpu_device *adev, > > > struct amdgpu_ring *ring, diff --git > > > a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h > > > index 894b9b133000..7d4f93fea937 100644 > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h > > > @@ -296,9 +296,10 @@ struct mes_misc_op_input { > > > uint64_t process_context_addr; > > > union { > > > struct { > > > - uint64_t single_memop : 1; > > > - uint64_t single_alu_op : 1; > > > - uint64_t reserved: 30; > > > + uint32_t single_memop : 1; > > > + uint32_t single_alu_op : 1; > > > + uint32_t reserved: 29; > > > + uint32_t process_ctx_flush: 1; > > > }; > > > uint32_t u32all; > > > } flags; > > > @@ -374,7 +375,8 @@ int amdgpu_mes_set_shader_debugger(struct > > amdgpu_device *adev, > > > const uint32_t *tcp_watch_cntl, > > > uint32_t flags, > > > bool trap_en); > > > - > > > +int amdgpu_mes_flush_shader_debugger(struct amdgpu_device *adev, > > > + uint64_t process_context_addr); > > > int amdgpu_mes_add_ring(struct amdgpu_device *adev, int gang_id, > > > int queue_type, int idx, > > > struct amdgpu_mes_ctx_data *ctx_data, diff > > > --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c > > > b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c > > > index 77f493262e05..8e55e78fce4e 100644 > > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c > > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c > > > @@ -87,6 +87,7 @@ void kfd_process_dequeue_from_device(struct > > kfd_process_device *pdd) > > > return; > > > > > > dev->dqm->ops.process_termination(dev->dqm, &pdd->qpd); > > > + amdgpu_mes_flush_shader_debugger(dev->adev, pdd- > > >proc_ctx_gpu_addr); > > > pdd->already_dequeued = true; > > > } > > > > > > diff --git a/drivers/gpu/drm/amd/include/mes_v11_api_def.h > > > b/drivers/gpu/drm/amd/include/mes_v11_api_def.h > > > index 1fbfd1aa987e..ec5b9ab67c5e 100644 > > > --- a/drivers/gpu/drm/amd/include/mes_v11_api_def.h > > > +++ b/drivers/gpu/drm/amd/include/mes_v11_api_def.h > > > @@ -572,7 +572,8 @@ struct SET_SHADER_DEBUGGER { > > > struct { > > > uint32_t single_memop : 1; /* SQ_DEBUG.single_memop > */ > > > uint32_t single_alu_op : 1; /* > > > SQ_DEBUG.single_alu_op */ > > > - uint32_t reserved : 30; > > > + uint32_t reserved : 29; > > > + uint32_t process_ctx_flush : 1; > > > }; > > > uint32_t u32all; > > > } flags; > > > >