RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management
[Public] Check the MES API, It's my fault , originally I think it will use trap_en as parameter to tell MES to enable/disable shader debugger , but actually it's not . So we either need to add a parameter for this (ex . add flag for enable/disable) or as your solution add flag for flush. Consider it's possible user process can be killed after call set_shader but before any add_queue , then notify MES to do a process context flush after process termination seems more reasonable . Regards Shaoyun.liu From: Kim, Jonathan Sent: Tuesday, December 12, 2023 11:43 PM To: Liu, Shaoyun ; Huang, JinHuiEric ; amd-gfx@lists.freedesktop.org Cc: Wong, Alice ; Kuehling, Felix ; Kasiviswanathan, Harish Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management [Public] Again, MES only knows to flush if there was something enqueued in the first place. SET_SHADER dictates what's on the process list. SET_SHADER can be the last call prior to process termination with nothing enqueued, hence no MES auto flush occurs. MES doesn't block anything on the flush flag request. The driver guarantees that flush is only done on process termination after device dequeue, whether there were queues or not. MES has no idea what an invalid context is. It just has a value stored in its linked list that's associated with a driver allocated BO that no longer exists after process termination. If you're still not sure about this solution, then this should be discussed offline with the MES team. We're not going to gain ground discussing this here. The solution has already been merged. Feel free to propose a better solution if you're not satisfied with this one. Jon From: Liu, Shaoyun mailto:shaoyun@amd.com>> Sent: Tuesday, December 12, 2023 11:08 PM To: Kim, Jonathan mailto:jonathan@amd.com>>; Huang, JinHuiEric mailto:jinhuieric.hu...@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> Cc: Wong, Alice mailto:shiwei.w...@amd.com>>; Kuehling, Felix mailto:felix.kuehl...@amd.com>>; Kasiviswanathan, Harish mailto:harish.kasiviswanat...@amd.com>> Subject: Re: [PATCH] drm/amdkfd: fix mes set shader debugger process management [Public] You try to add one new interface to inform mes about the context flush after driver side finish process termination , from my understanding, mes already know the process context need to be purged after all the related queues been removed even without this notification. What do you expect mes to do about this context flush flag ? Mes should block this process context for next set_sched command? Mes can achieve this by ignore the set_sched command with trap disable parameter on an invalid process context . Shaoyun.liu Get Outlook for iOS<https://aka.ms/o0ukef> From: Kim, Jonathan mailto:jonathan@amd.com>> Sent: Tuesday, December 12, 2023 8:19:09 PM To: Liu, Shaoyun mailto:shaoyun@amd.com>>; Huang, JinHuiEric mailto:jinhuieric.hu...@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> mailto:amd-gfx@lists.freedesktop.org>> Cc: Wong, Alice mailto:shiwei.w...@amd.com>>; Kuehling, Felix mailto:felix.kuehl...@amd.com>>; Kasiviswanathan, Harish mailto:harish.kasiviswanat...@amd.com>> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management [Public] > -Original Message- > From: Liu, Shaoyun mailto:shaoyun@amd.com>> > Sent: Tuesday, December 12, 2023 7:08 PM > To: Kim, Jonathan mailto:jonathan@amd.com>>; Huang, > JinHuiEric > mailto:jinhuieric.hu...@amd.com>>; > amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> > Cc: Wong, Alice mailto:shiwei.w...@amd.com>>; Kuehling, > Felix > mailto:felix.kuehl...@amd.com>>; Kasiviswanathan, > Harish > mailto:harish.kasiviswanat...@amd.com>> > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > management > > [Public] > > I see, so the problem is after process context , set_shader been called > with > disable parameter, do you know the reason why MES re-added the > process context into the list ? Because MES has no idea what disable means. All it knows is that without the flush flag, set_shader should update the necessary per-VMID (process) registers as requested by the driver, which requires persistent per-process HW settings so that potential future waves can inherit those settings i.e. ADD_QUEUE.skip_process_ctx_clear is set (why ADD_QUEUE auto clears the process context otherwise is another long story, basically an unsolvable MES cache bug problem). Common use case example: add_queue -> set_shader call either transiently stalls the SPI per-VMID or transiently dequeues the HWS per-VMID depending on the request settings
RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management
[Public] Again, MES only knows to flush if there was something enqueued in the first place. SET_SHADER dictates what's on the process list. SET_SHADER can be the last call prior to process termination with nothing enqueued, hence no MES auto flush occurs. MES doesn't block anything on the flush flag request. The driver guarantees that flush is only done on process termination after device dequeue, whether there were queues or not. MES has no idea what an invalid context is. It just has a value stored in its linked list that's associated with a driver allocated BO that no longer exists after process termination. If you're still not sure about this solution, then this should be discussed offline with the MES team. We're not going to gain ground discussing this here. The solution has already been merged. Feel free to propose a better solution if you're not satisfied with this one. Jon From: Liu, Shaoyun Sent: Tuesday, December 12, 2023 11:08 PM To: Kim, Jonathan ; Huang, JinHuiEric ; amd-gfx@lists.freedesktop.org Cc: Wong, Alice ; Kuehling, Felix ; Kasiviswanathan, Harish Subject: Re: [PATCH] drm/amdkfd: fix mes set shader debugger process management [Public] You try to add one new interface to inform mes about the context flush after driver side finish process termination , from my understanding, mes already know the process context need to be purged after all the related queues been removed even without this notification. What do you expect mes to do about this context flush flag ? Mes should block this process context for next set_sched command? Mes can achieve this by ignore the set_sched command with trap disable parameter on an invalid process context . Shaoyun.liu Get Outlook for iOS<https://aka.ms/o0ukef> From: Kim, Jonathan mailto:jonathan@amd.com>> Sent: Tuesday, December 12, 2023 8:19:09 PM To: Liu, Shaoyun mailto:shaoyun@amd.com>>; Huang, JinHuiEric mailto:jinhuieric.hu...@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> mailto:amd-gfx@lists.freedesktop.org>> Cc: Wong, Alice mailto:shiwei.w...@amd.com>>; Kuehling, Felix mailto:felix.kuehl...@amd.com>>; Kasiviswanathan, Harish mailto:harish.kasiviswanat...@amd.com>> Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management [Public] > -Original Message- > From: Liu, Shaoyun mailto:shaoyun@amd.com>> > Sent: Tuesday, December 12, 2023 7:08 PM > To: Kim, Jonathan mailto:jonathan@amd.com>>; Huang, > JinHuiEric > mailto:jinhuieric.hu...@amd.com>>; > amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> > Cc: Wong, Alice mailto:shiwei.w...@amd.com>>; Kuehling, > Felix > mailto:felix.kuehl...@amd.com>>; Kasiviswanathan, > Harish > mailto:harish.kasiviswanat...@amd.com>> > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > management > > [Public] > > I see, so the problem is after process context , set_shader been called > with > disable parameter, do you know the reason why MES re-added the > process context into the list ? Because MES has no idea what disable means. All it knows is that without the flush flag, set_shader should update the necessary per-VMID (process) registers as requested by the driver, which requires persistent per-process HW settings so that potential future waves can inherit those settings i.e. ADD_QUEUE.skip_process_ctx_clear is set (why ADD_QUEUE auto clears the process context otherwise is another long story, basically an unsolvable MES cache bug problem). Common use case example: add_queue -> set_shader call either transiently stalls the SPI per-VMID or transiently dequeues the HWS per-VMID depending on the request settings -> fulfils the per-VMID register write updates -> resumes process queues so that potential waves on those queues inherit new debug settings. You can't do this kind of operation at the queue level alone. The problem that this patch solves (along with the MES FW upgrade) is an unfortunate quirk of having to operate between process (debug requests) and queue space (non-debug requests). Old HWS used to operate at the per-process level via MAP_PROCESS so it was a lot easier to balance debug versus non-debug requests back then (but it was also lot less efficient performance wise). Jon > > Shaoyun.liu > > -Original Message- > From: Kim, Jonathan mailto:jonathan@amd.com>> > Sent: Tuesday, December 12, 2023 6:07 PM > To: Liu, Shaoyun mailto:shaoyun@amd.com>>; Huang, > JinHuiEric > mailto:jinhuieric.hu...@amd.com>>; > amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> > Cc: Wong, Alice mailto:shiwei.w...@amd.com>>; Kuehling, > Felix > mailto:f
Re: [PATCH] drm/amdkfd: fix mes set shader debugger process management
[Public] You try to add one new interface to inform mes about the context flush after driver side finish process termination , from my understanding, mes already know the process context need to be purged after all the related queues been removed even without this notification. What do you expect mes to do about this context flush flag ? Mes should block this process context for next set_sched command? Mes can achieve this by ignore the set_sched command with trap disable parameter on an invalid process context . Shaoyun.liu Get Outlook for iOS<https://aka.ms/o0ukef> From: Kim, Jonathan Sent: Tuesday, December 12, 2023 8:19:09 PM To: Liu, Shaoyun ; Huang, JinHuiEric ; amd-gfx@lists.freedesktop.org Cc: Wong, Alice ; Kuehling, Felix ; Kasiviswanathan, Harish Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management [Public] > -Original Message- > From: Liu, Shaoyun > Sent: Tuesday, December 12, 2023 7:08 PM > To: Kim, Jonathan ; Huang, JinHuiEric > ; amd-gfx@lists.freedesktop.org > Cc: Wong, Alice ; Kuehling, Felix > ; Kasiviswanathan, Harish > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > management > > [Public] > > I see, so the problem is after process context , set_shader been called > with > disable parameter, do you know the reason why MES re-added the > process context into the list ? Because MES has no idea what disable means. All it knows is that without the flush flag, set_shader should update the necessary per-VMID (process) registers as requested by the driver, which requires persistent per-process HW settings so that potential future waves can inherit those settings i.e. ADD_QUEUE.skip_process_ctx_clear is set (why ADD_QUEUE auto clears the process context otherwise is another long story, basically an unsolvable MES cache bug problem). Common use case example: add_queue -> set_shader call either transiently stalls the SPI per-VMID or transiently dequeues the HWS per-VMID depending on the request settings -> fulfils the per-VMID register write updates -> resumes process queues so that potential waves on those queues inherit new debug settings. You can't do this kind of operation at the queue level alone. The problem that this patch solves (along with the MES FW upgrade) is an unfortunate quirk of having to operate between process (debug requests) and queue space (non-debug requests). Old HWS used to operate at the per-process level via MAP_PROCESS so it was a lot easier to balance debug versus non-debug requests back then (but it was also lot less efficient performance wise). Jon > > Shaoyun.liu > > -Original Message- > From: Kim, Jonathan > Sent: Tuesday, December 12, 2023 6:07 PM > To: Liu, Shaoyun ; Huang, JinHuiEric > ; amd-gfx@lists.freedesktop.org > Cc: Wong, Alice ; Kuehling, Felix > ; Kasiviswanathan, Harish > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > management > > [Public] > > > -Original Message- > > From: Liu, Shaoyun > > Sent: Tuesday, December 12, 2023 5:44 PM > > To: Kim, Jonathan ; Huang, JinHuiEric > > ; amd-gfx@lists.freedesktop.org > > Cc: Wong, Alice ; Kuehling, Felix > > ; Kasiviswanathan, Harish > > > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > > management > > > > [Public] > > > > Do you mean SET_SHADER_DEBUGER can be called before ADD_QUEUE ? > I > > think even in that situation MES should still be able to handle it > > as long as MES already remove the process context from its list , MES > > will treat the process context as a new item. I still don't understand why > MES haven't > > purged the process context from the list after process termination . Will > > debug queue itself also use the add/remove queue interface and is > > it possible the debug queue itself from the old process still not be > > removed ? > > SET_SHADER_DEBUGGER can be called independently from ADD_QUEUE. > The process list is updated on either on SET_SHADER_DEBUGGER or > ADD_QUEUE. > e.g. runtime_enable (set_shader) -> add_queue -> remove_queue (list > purged) -> runtime_disable (set_shader process re-added) -> process > termination (stale list) or debug attach (set_shader) -> add_queue -> > remove_queue (list purged) -> debug detach (set_shader process re-added) - > >process termination (stale list) > > MES has no idea what process termination means. The new flag is a proxy > for this. > There are reasons for process settings to take place prior to queue add > (debugger, gfx11 cwsr workaround, core dump etc need this). > > I'm not sure what kernel/debug
RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management
[Public] > -Original Message- > From: Liu, Shaoyun > Sent: Tuesday, December 12, 2023 7:08 PM > To: Kim, Jonathan ; Huang, JinHuiEric > ; amd-gfx@lists.freedesktop.org > Cc: Wong, Alice ; Kuehling, Felix > ; Kasiviswanathan, Harish > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > management > > [Public] > > I see, so the problem is after process context , set_shader been called > with > disable parameter, do you know the reason why MES re-added the > process context into the list ? Because MES has no idea what disable means. All it knows is that without the flush flag, set_shader should update the necessary per-VMID (process) registers as requested by the driver, which requires persistent per-process HW settings so that potential future waves can inherit those settings i.e. ADD_QUEUE.skip_process_ctx_clear is set (why ADD_QUEUE auto clears the process context otherwise is another long story, basically an unsolvable MES cache bug problem). Common use case example: add_queue -> set_shader call either transiently stalls the SPI per-VMID or transiently dequeues the HWS per-VMID depending on the request settings -> fulfils the per-VMID register write updates -> resumes process queues so that potential waves on those queues inherit new debug settings. You can't do this kind of operation at the queue level alone. The problem that this patch solves (along with the MES FW upgrade) is an unfortunate quirk of having to operate between process (debug requests) and queue space (non-debug requests). Old HWS used to operate at the per-process level via MAP_PROCESS so it was a lot easier to balance debug versus non-debug requests back then (but it was also lot less efficient performance wise). Jon > > Shaoyun.liu > > -Original Message- > From: Kim, Jonathan > Sent: Tuesday, December 12, 2023 6:07 PM > To: Liu, Shaoyun ; Huang, JinHuiEric > ; amd-gfx@lists.freedesktop.org > Cc: Wong, Alice ; Kuehling, Felix > ; Kasiviswanathan, Harish > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > management > > [Public] > > > -Original Message- > > From: Liu, Shaoyun > > Sent: Tuesday, December 12, 2023 5:44 PM > > To: Kim, Jonathan ; Huang, JinHuiEric > > ; amd-gfx@lists.freedesktop.org > > Cc: Wong, Alice ; Kuehling, Felix > > ; Kasiviswanathan, Harish > > > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > > management > > > > [Public] > > > > Do you mean SET_SHADER_DEBUGER can be called before ADD_QUEUE ? > I > > think even in that situation MES should still be able to handle it > > as long as MES already remove the process context from its list , MES > > will treat the process context as a new item. I still don't understand why > MES haven't > > purged the process context from the list after process termination . Will > > debug queue itself also use the add/remove queue interface and is > > it possible the debug queue itself from the old process still not be > > removed ? > > SET_SHADER_DEBUGGER can be called independently from ADD_QUEUE. > The process list is updated on either on SET_SHADER_DEBUGGER or > ADD_QUEUE. > e.g. runtime_enable (set_shader) -> add_queue -> remove_queue (list > purged) -> runtime_disable (set_shader process re-added) -> process > termination (stale list) or debug attach (set_shader) -> add_queue -> > remove_queue (list purged) -> debug detach (set_shader process re-added) - > >process termination (stale list) > > MES has no idea what process termination means. The new flag is a proxy > for this. > There are reasons for process settings to take place prior to queue add > (debugger, gfx11 cwsr workaround, core dump etc need this). > > I'm not sure what kernel/debug queues have to do with this. > By that argument, the list should be purged. > > Jon > > > > > Shaoyun.liu > > > > -Original Message- > > From: Kim, Jonathan > > Sent: Tuesday, December 12, 2023 4:48 PM > > To: Liu, Shaoyun ; Huang, JinHuiEric > > ; amd-gfx@lists.freedesktop.org > > Cc: Wong, Alice ; Kuehling, Felix > > ; Kasiviswanathan, Harish > > > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > > management > > > > [Public] > > > > > -Original Message- > > > From: Liu, Shaoyun > > > Sent: Tuesday, December 12, 2023 4:45 PM > > > To: Kim, Jonathan ; Huang, JinHuiEric > > > ; amd-gfx@lists.freedesktop.org > > > Cc: Wong, Alice ; Kuehling, Felix > > > ; Kasiviswanathan, Harish > &g
RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management
[Public] I see, so the problem is after process context , set_shader been called with disable parameter, do you know the reason why MES re-added the process context into the list ? Shaoyun.liu -Original Message- From: Kim, Jonathan Sent: Tuesday, December 12, 2023 6:07 PM To: Liu, Shaoyun ; Huang, JinHuiEric ; amd-gfx@lists.freedesktop.org Cc: Wong, Alice ; Kuehling, Felix ; Kasiviswanathan, Harish Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management [Public] > -Original Message- > From: Liu, Shaoyun > Sent: Tuesday, December 12, 2023 5:44 PM > To: Kim, Jonathan ; Huang, JinHuiEric > ; amd-gfx@lists.freedesktop.org > Cc: Wong, Alice ; Kuehling, Felix > ; Kasiviswanathan, Harish > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > management > > [Public] > > Do you mean SET_SHADER_DEBUGER can be called before ADD_QUEUE ? I > think even in that situation MES should still be able to handle it > as long as MES already remove the process context from its list , MES > will treat the process context as a new item. I still don't understand why > MES haven't > purged the process context from the list after process termination . Will > debug queue itself also use the add/remove queue interface and is > it possible the debug queue itself from the old process still not be > removed ? SET_SHADER_DEBUGGER can be called independently from ADD_QUEUE. The process list is updated on either on SET_SHADER_DEBUGGER or ADD_QUEUE. e.g. runtime_enable (set_shader) -> add_queue -> remove_queue (list purged) -> runtime_disable (set_shader process re-added) -> process termination (stale list) or debug attach (set_shader) -> add_queue -> remove_queue (list purged) -> debug detach (set_shader process re-added) ->process termination (stale list) MES has no idea what process termination means. The new flag is a proxy for this. There are reasons for process settings to take place prior to queue add (debugger, gfx11 cwsr workaround, core dump etc need this). I'm not sure what kernel/debug queues have to do with this. By that argument, the list should be purged. Jon > > Shaoyun.liu > > -Original Message- > From: Kim, Jonathan > Sent: Tuesday, December 12, 2023 4:48 PM > To: Liu, Shaoyun ; Huang, JinHuiEric > ; amd-gfx@lists.freedesktop.org > Cc: Wong, Alice ; Kuehling, Felix > ; Kasiviswanathan, Harish > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > management > > [Public] > > > -Original Message- > > From: Liu, Shaoyun > > Sent: Tuesday, December 12, 2023 4:45 PM > > To: Kim, Jonathan ; Huang, JinHuiEric > > ; amd-gfx@lists.freedesktop.org > > Cc: Wong, Alice ; Kuehling, Felix > > ; Kasiviswanathan, Harish > > > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > > management > > > > [Public] > > > > Shouldn't the driver side remove all the remaining queues for the > > process during process termination ? If all the queues been > > removed for the process , MES should purge the process context > > automatically , otherwise it's bug inside MES . > > That's only if there were queues added to begin with. > > Jon > > > > > Regard > > Sshaoyun.liu > > > > -----Original Message- > > From: Kim, Jonathan > > Sent: Tuesday, December 12, 2023 4:33 PM > > To: Liu, Shaoyun ; Huang, JinHuiEric > > ; amd-gfx@lists.freedesktop.org > > Cc: Wong, Alice ; Kuehling, Felix > > ; Kasiviswanathan, Harish > > > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > > management > > > > [Public] > > > > > -Original Message- > > > From: Liu, Shaoyun > > > Sent: Tuesday, December 12, 2023 4:00 PM > > > To: Huang, JinHuiEric ; Kim, Jonathan > > > ; amd-gfx@lists.freedesktop.org > > > Cc: Wong, Alice ; Kuehling, Felix > > > ; Kasiviswanathan, Harish > > > > > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger > > > process management > > > > > > [AMD Official Use Only - General] > > > > > > Does this requires the new MES FW for this process_ctx_flush > > > requirement ? Can driver side add logic to guaranty when call > > > SET_SHADER_DEBUGGER, the process address is always valid ? > > > > Call to flush on old fw is a NOP so it's harmless in that case. > > Full solution will still require a new MES version as this is a > > workaround on corner cases and not a new feature i.e. we can't stop > > ROCm from runnin
RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management
[Public] > -Original Message- > From: Liu, Shaoyun > Sent: Tuesday, December 12, 2023 5:44 PM > To: Kim, Jonathan ; Huang, JinHuiEric > ; amd-gfx@lists.freedesktop.org > Cc: Wong, Alice ; Kuehling, Felix > ; Kasiviswanathan, Harish > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > management > > [Public] > > Do you mean SET_SHADER_DEBUGER can be called before ADD_QUEUE ? I > think even in that situation MES should still be able to handle it as long > as > MES already remove the process context from its list , MES will treat the > process context as a new item. I still don't understand why MES haven't > purged the process context from the list after process termination . Will > debug queue itself also use the add/remove queue interface and is it > possible the debug queue itself from the old process still not be > removed ? SET_SHADER_DEBUGGER can be called independently from ADD_QUEUE. The process list is updated on either on SET_SHADER_DEBUGGER or ADD_QUEUE. e.g. runtime_enable (set_shader) -> add_queue -> remove_queue (list purged) -> runtime_disable (set_shader process re-added) -> process termination (stale list) or debug attach (set_shader) -> add_queue -> remove_queue (list purged) -> debug detach (set_shader process re-added) ->process termination (stale list) MES has no idea what process termination means. The new flag is a proxy for this. There are reasons for process settings to take place prior to queue add (debugger, gfx11 cwsr workaround, core dump etc need this). I'm not sure what kernel/debug queues have to do with this. By that argument, the list should be purged. Jon > > Shaoyun.liu > > -Original Message- > From: Kim, Jonathan > Sent: Tuesday, December 12, 2023 4:48 PM > To: Liu, Shaoyun ; Huang, JinHuiEric > ; amd-gfx@lists.freedesktop.org > Cc: Wong, Alice ; Kuehling, Felix > ; Kasiviswanathan, Harish > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > management > > [Public] > > > -Original Message- > > From: Liu, Shaoyun > > Sent: Tuesday, December 12, 2023 4:45 PM > > To: Kim, Jonathan ; Huang, JinHuiEric > > ; amd-gfx@lists.freedesktop.org > > Cc: Wong, Alice ; Kuehling, Felix > > ; Kasiviswanathan, Harish > > > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > > management > > > > [Public] > > > > Shouldn't the driver side remove all the remaining queues for the > > process during process termination ? If all the queues been removed > > for the process , MES should purge the process context automatically > > , otherwise it's bug inside MES . > > That's only if there were queues added to begin with. > > Jon > > > > > Regard > > Sshaoyun.liu > > > > -----Original Message- > > From: Kim, Jonathan > > Sent: Tuesday, December 12, 2023 4:33 PM > > To: Liu, Shaoyun ; Huang, JinHuiEric > > ; amd-gfx@lists.freedesktop.org > > Cc: Wong, Alice ; Kuehling, Felix > > ; Kasiviswanathan, Harish > > > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > > management > > > > [Public] > > > > > -Original Message- > > > From: Liu, Shaoyun > > > Sent: Tuesday, December 12, 2023 4:00 PM > > > To: Huang, JinHuiEric ; Kim, Jonathan > > > ; amd-gfx@lists.freedesktop.org > > > Cc: Wong, Alice ; Kuehling, Felix > > > ; Kasiviswanathan, Harish > > > > > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > > > management > > > > > > [AMD Official Use Only - General] > > > > > > Does this requires the new MES FW for this process_ctx_flush > > > requirement ? Can driver side add logic to guaranty when call > > > SET_SHADER_DEBUGGER, the process address is always valid ? > > > > Call to flush on old fw is a NOP so it's harmless in that case. > > Full solution will still require a new MES version as this is a > > workaround on corner cases and not a new feature i.e. we can't stop > > ROCm from running on old fw. > > The process address is always valid from the driver side. It's the > > MES side of things that gets stale as mentioned in the description > > (passed value to MES is reused with new BO but MES doesn't refresh). > > i.e. MES auto refreshes it's process list assuming process queues were > > all drained but driver can't guarantee that SET_SHADER_DEBUGGER > (which > > adds to MES's process list) will get called after queues ge
RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management
[Public] Do you mean SET_SHADER_DEBUGER can be called before ADD_QUEUE ? I think even in that situation MES should still be able to handle it as long as MES already remove the process context from its list , MES will treat the process context as a new item. I still don't understand why MES haven't purged the process context from the list after process termination . Will debug queue itself also use the add/remove queue interface and is it possible the debug queue itself from the old process still not be removed ? Shaoyun.liu -Original Message- From: Kim, Jonathan Sent: Tuesday, December 12, 2023 4:48 PM To: Liu, Shaoyun ; Huang, JinHuiEric ; amd-gfx@lists.freedesktop.org Cc: Wong, Alice ; Kuehling, Felix ; Kasiviswanathan, Harish Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management [Public] > -Original Message- > From: Liu, Shaoyun > Sent: Tuesday, December 12, 2023 4:45 PM > To: Kim, Jonathan ; Huang, JinHuiEric > ; amd-gfx@lists.freedesktop.org > Cc: Wong, Alice ; Kuehling, Felix > ; Kasiviswanathan, Harish > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > management > > [Public] > > Shouldn't the driver side remove all the remaining queues for the > process during process termination ? If all the queues been removed > for the process , MES should purge the process context automatically > , otherwise it's bug inside MES . That's only if there were queues added to begin with. Jon > > Regard > Sshaoyun.liu > > -Original Message- > From: Kim, Jonathan > Sent: Tuesday, December 12, 2023 4:33 PM > To: Liu, Shaoyun ; Huang, JinHuiEric > ; amd-gfx@lists.freedesktop.org > Cc: Wong, Alice ; Kuehling, Felix > ; Kasiviswanathan, Harish > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > management > > [Public] > > > -Original Message- > > From: Liu, Shaoyun > > Sent: Tuesday, December 12, 2023 4:00 PM > > To: Huang, JinHuiEric ; Kim, Jonathan > > ; amd-gfx@lists.freedesktop.org > > Cc: Wong, Alice ; Kuehling, Felix > > ; Kasiviswanathan, Harish > > > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > > management > > > > [AMD Official Use Only - General] > > > > Does this requires the new MES FW for this process_ctx_flush > > requirement ? Can driver side add logic to guaranty when call > > SET_SHADER_DEBUGGER, the process address is always valid ? > > Call to flush on old fw is a NOP so it's harmless in that case. > Full solution will still require a new MES version as this is a > workaround on corner cases and not a new feature i.e. we can't stop > ROCm from running on old fw. > The process address is always valid from the driver side. It's the > MES side of things that gets stale as mentioned in the description > (passed value to MES is reused with new BO but MES doesn't refresh). > i.e. MES auto refreshes it's process list assuming process queues were > all drained but driver can't guarantee that SET_SHADER_DEBUGGER (which > adds to MES's process list) will get called after queues get added (in > fact it's a requirements that it can be called at any time). > We can attempt to defer calls these calls in the KFD, considering all cases. > But that would be a large shift in debugger/runtime_enable/KFD code, > which is already complicated and could get buggy plus it would not be > intuitive at all as to why we're doing this. > I think a single flag set to flush MES on process termination is a > simpler compromise that shows the limitation in a more obvious way. > > Thanks, > > Jon > > > > > > Regards > > Shaoyun.liu > > > > > > -----Original Message----- > > From: amd-gfx On Behalf Of > > Eric Huang > > Sent: Tuesday, December 12, 2023 12:49 PM > > To: Kim, Jonathan ; amd- > > g...@lists.freedesktop.org > > Cc: Wong, Alice ; Kuehling, Felix > > ; Kasiviswanathan, Harish > > > > Subject: Re: [PATCH] drm/amdkfd: fix mes set shader debugger process > > management > > > > > > On 2023-12-11 16:16, Jonathan Kim wrote: > > > MES provides the driver a call to explicitly flush stale process > > > memory within the MES to avoid a race condition that results in a > > > fatal memory violation. > > > > > > When SET_SHADER_DEBUGGER is called, the driver passes a memory > > address > > > that represents a process context address MES uses to keep track > > > of future per-process calls. > > > > > > Normally, MES will purge its process context list when the last > > > queue has be
RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management
[Public] > -Original Message- > From: Liu, Shaoyun > Sent: Tuesday, December 12, 2023 4:45 PM > To: Kim, Jonathan ; Huang, JinHuiEric > ; amd-gfx@lists.freedesktop.org > Cc: Wong, Alice ; Kuehling, Felix > ; Kasiviswanathan, Harish > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > management > > [Public] > > Shouldn't the driver side remove all the remaining queues for the process > during process termination ? If all the queues been removed for the > process , MES should purge the process context automatically , otherwise > it's bug inside MES . That's only if there were queues added to begin with. Jon > > Regard > Sshaoyun.liu > > -Original Message- > From: Kim, Jonathan > Sent: Tuesday, December 12, 2023 4:33 PM > To: Liu, Shaoyun ; Huang, JinHuiEric > ; amd-gfx@lists.freedesktop.org > Cc: Wong, Alice ; Kuehling, Felix > ; Kasiviswanathan, Harish > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > management > > [Public] > > > -Original Message- > > From: Liu, Shaoyun > > Sent: Tuesday, December 12, 2023 4:00 PM > > To: Huang, JinHuiEric ; Kim, Jonathan > > ; amd-gfx@lists.freedesktop.org > > Cc: Wong, Alice ; Kuehling, Felix > > ; Kasiviswanathan, Harish > > > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > > management > > > > [AMD Official Use Only - General] > > > > Does this requires the new MES FW for this process_ctx_flush > > requirement ? Can driver side add logic to guaranty when call > > SET_SHADER_DEBUGGER, the process address is always valid ? > > Call to flush on old fw is a NOP so it's harmless in that case. > Full solution will still require a new MES version as this is a workaround on > corner cases and not a new feature i.e. we can't stop ROCm from running on > old fw. > The process address is always valid from the driver side. It's the MES side > of > things that gets stale as mentioned in the description (passed value to MES > is reused with new BO but MES doesn't refresh). > i.e. MES auto refreshes it's process list assuming process queues were all > drained but driver can't guarantee that SET_SHADER_DEBUGGER (which > adds to MES's process list) will get called after queues get added (in fact > it's > a requirements that it can be called at any time). > We can attempt to defer calls these calls in the KFD, considering all cases. > But that would be a large shift in debugger/runtime_enable/KFD code, > which is already complicated and could get buggy plus it would not be > intuitive at all as to why we're doing this. > I think a single flag set to flush MES on process termination is a simpler > compromise that shows the limitation in a more obvious way. > > Thanks, > > Jon > > > > > > Regards > > Shaoyun.liu > > > > > > -Original Message- > > From: amd-gfx On Behalf Of > > Eric Huang > > Sent: Tuesday, December 12, 2023 12:49 PM > > To: Kim, Jonathan ; amd- > > g...@lists.freedesktop.org > > Cc: Wong, Alice ; Kuehling, Felix > > ; Kasiviswanathan, Harish > > > > Subject: Re: [PATCH] drm/amdkfd: fix mes set shader debugger process > > management > > > > > > On 2023-12-11 16:16, Jonathan Kim wrote: > > > MES provides the driver a call to explicitly flush stale process > > > memory within the MES to avoid a race condition that results in a > > > fatal memory violation. > > > > > > When SET_SHADER_DEBUGGER is called, the driver passes a memory > > address > > > that represents a process context address MES uses to keep track of > > > future per-process calls. > > > > > > Normally, MES will purge its process context list when the last > > > queue has been removed. The driver, however, can call > > > SET_SHADER_DEBUGGER regardless of whether a queue has been added > or not. > > > > > > If SET_SHADER_DEBUGGER has been called with no queues as the last > > > call prior to process termination, the passed process context > > > address will still reside within MES. > > > > > > On a new process call to SET_SHADER_DEBUGGER, the driver may end > up > > > passing an identical process context address value (based on > > > per-process gpu memory address) to MES but is now pointing to a new > > > allocated buffer object during KFD process creation. Since the MES > > > is unaware of this, access of the passed address points to the stale > > > object within MES and triggers a
RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management
[Public] Shouldn't the driver side remove all the remaining queues for the process during process termination ? If all the queues been removed for the process , MES should purge the process context automatically , otherwise it's bug inside MES . Regard Sshaoyun.liu -Original Message- From: Kim, Jonathan Sent: Tuesday, December 12, 2023 4:33 PM To: Liu, Shaoyun ; Huang, JinHuiEric ; amd-gfx@lists.freedesktop.org Cc: Wong, Alice ; Kuehling, Felix ; Kasiviswanathan, Harish Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management [Public] > -Original Message- > From: Liu, Shaoyun > Sent: Tuesday, December 12, 2023 4:00 PM > To: Huang, JinHuiEric ; Kim, Jonathan > ; amd-gfx@lists.freedesktop.org > Cc: Wong, Alice ; Kuehling, Felix > ; Kasiviswanathan, Harish > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > management > > [AMD Official Use Only - General] > > Does this requires the new MES FW for this process_ctx_flush > requirement ? Can driver side add logic to guaranty when call > SET_SHADER_DEBUGGER, the process address is always valid ? Call to flush on old fw is a NOP so it's harmless in that case. Full solution will still require a new MES version as this is a workaround on corner cases and not a new feature i.e. we can't stop ROCm from running on old fw. The process address is always valid from the driver side. It's the MES side of things that gets stale as mentioned in the description (passed value to MES is reused with new BO but MES doesn't refresh). i.e. MES auto refreshes it's process list assuming process queues were all drained but driver can't guarantee that SET_SHADER_DEBUGGER (which adds to MES's process list) will get called after queues get added (in fact it's a requirements that it can be called at any time). We can attempt to defer calls these calls in the KFD, considering all cases. But that would be a large shift in debugger/runtime_enable/KFD code, which is already complicated and could get buggy plus it would not be intuitive at all as to why we're doing this. I think a single flag set to flush MES on process termination is a simpler compromise that shows the limitation in a more obvious way. Thanks, Jon > > Regards > Shaoyun.liu > > > -Original Message- > From: amd-gfx On Behalf Of > Eric Huang > Sent: Tuesday, December 12, 2023 12:49 PM > To: Kim, Jonathan ; amd- > g...@lists.freedesktop.org > Cc: Wong, Alice ; Kuehling, Felix > ; Kasiviswanathan, Harish > > Subject: Re: [PATCH] drm/amdkfd: fix mes set shader debugger process > management > > > On 2023-12-11 16:16, Jonathan Kim wrote: > > MES provides the driver a call to explicitly flush stale process > > memory within the MES to avoid a race condition that results in a > > fatal memory violation. > > > > When SET_SHADER_DEBUGGER is called, the driver passes a memory > address > > that represents a process context address MES uses to keep track of > > future per-process calls. > > > > Normally, MES will purge its process context list when the last > > queue has been removed. The driver, however, can call > > SET_SHADER_DEBUGGER regardless of whether a queue has been added or not. > > > > If SET_SHADER_DEBUGGER has been called with no queues as the last > > call prior to process termination, the passed process context > > address will still reside within MES. > > > > On a new process call to SET_SHADER_DEBUGGER, the driver may end up > > passing an identical process context address value (based on > > per-process gpu memory address) to MES but is now pointing to a new > > allocated buffer object during KFD process creation. Since the MES > > is unaware of this, access of the passed address points to the stale > > object within MES and triggers a fatal memory violation. > > > > The solution is for KFD to explicitly flush the process context > > address from MES on process termination. > > > > Note that the flush call and the MES debugger calls use the same MES > > interface but are separated as KFD calls to avoid conflicting with > > each other. > > > > Signed-off-by: Jonathan Kim > > Tested-by: Alice Wong > Reviewed-by: Eric Huang > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 31 > +++ > > drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h | 10 +++--- > > .../amd/amdkfd/kfd_process_queue_manager.c| 1 + > > drivers/gpu/drm/amd/include/mes_v11_api_def.h | 3 +- > > 4 files changed, 40 insertions(+), 5 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > > b/drivers/gpu/drm/amd/amdgpu/amd
RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management
[Public] > -Original Message- > From: Liu, Shaoyun > Sent: Tuesday, December 12, 2023 4:00 PM > To: Huang, JinHuiEric ; Kim, Jonathan > ; amd-gfx@lists.freedesktop.org > Cc: Wong, Alice ; Kuehling, Felix > ; Kasiviswanathan, Harish > > Subject: RE: [PATCH] drm/amdkfd: fix mes set shader debugger process > management > > [AMD Official Use Only - General] > > Does this requires the new MES FW for this process_ctx_flush > requirement ? Can driver side add logic to guaranty when call > SET_SHADER_DEBUGGER, the process address is always valid ? Call to flush on old fw is a NOP so it's harmless in that case. Full solution will still require a new MES version as this is a workaround on corner cases and not a new feature i.e. we can't stop ROCm from running on old fw. The process address is always valid from the driver side. It's the MES side of things that gets stale as mentioned in the description (passed value to MES is reused with new BO but MES doesn't refresh). i.e. MES auto refreshes it's process list assuming process queues were all drained but driver can't guarantee that SET_SHADER_DEBUGGER (which adds to MES's process list) will get called after queues get added (in fact it's a requirements that it can be called at any time). We can attempt to defer calls these calls in the KFD, considering all cases. But that would be a large shift in debugger/runtime_enable/KFD code, which is already complicated and could get buggy plus it would not be intuitive at all as to why we're doing this. I think a single flag set to flush MES on process termination is a simpler compromise that shows the limitation in a more obvious way. Thanks, Jon > > Regards > Shaoyun.liu > > > -Original Message- > From: amd-gfx On Behalf Of Eric > Huang > Sent: Tuesday, December 12, 2023 12:49 PM > To: Kim, Jonathan ; amd- > g...@lists.freedesktop.org > Cc: Wong, Alice ; Kuehling, Felix > ; Kasiviswanathan, Harish > > Subject: Re: [PATCH] drm/amdkfd: fix mes set shader debugger process > management > > > On 2023-12-11 16:16, Jonathan Kim wrote: > > MES provides the driver a call to explicitly flush stale process > > memory within the MES to avoid a race condition that results in a > > fatal memory violation. > > > > When SET_SHADER_DEBUGGER is called, the driver passes a memory > address > > that represents a process context address MES uses to keep track of > > future per-process calls. > > > > Normally, MES will purge its process context list when the last queue > > has been removed. The driver, however, can call SET_SHADER_DEBUGGER > > regardless of whether a queue has been added or not. > > > > If SET_SHADER_DEBUGGER has been called with no queues as the last call > > prior to process termination, the passed process context address will > > still reside within MES. > > > > On a new process call to SET_SHADER_DEBUGGER, the driver may end up > > passing an identical process context address value (based on > > per-process gpu memory address) to MES but is now pointing to a new > > allocated buffer object during KFD process creation. Since the MES is > > unaware of this, access of the passed address points to the stale > > object within MES and triggers a fatal memory violation. > > > > The solution is for KFD to explicitly flush the process context > > address from MES on process termination. > > > > Note that the flush call and the MES debugger calls use the same MES > > interface but are separated as KFD calls to avoid conflicting with > > each other. > > > > Signed-off-by: Jonathan Kim > > Tested-by: Alice Wong > Reviewed-by: Eric Huang > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 31 > +++ > > drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h | 10 +++--- > > .../amd/amdkfd/kfd_process_queue_manager.c| 1 + > > drivers/gpu/drm/amd/include/mes_v11_api_def.h | 3 +- > > 4 files changed, 40 insertions(+), 5 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > > index e544b823abf6..e98de23250dc 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > > @@ -916,6 +916,11 @@ int amdgpu_mes_set_shader_debugger(struct > amdgpu_device *adev, > > op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER; > > op_input.set_shader_debugger.process_context_addr = > process_context_addr; > > op_input.set_shader_debugger.flags.u32all = flags; > > + > > + /* use amdgpu mes_flush_shader_debugger instead */ > > + if (
RE: [PATCH] drm/amdkfd: fix mes set shader debugger process management
[AMD Official Use Only - General] Does this requires the new MES FW for this process_ctx_flush requirement ? Can driver side add logic to guaranty when call SET_SHADER_DEBUGGER, the process address is always valid ? Regards Shaoyun.liu -Original Message- From: amd-gfx On Behalf Of Eric Huang Sent: Tuesday, December 12, 2023 12:49 PM To: Kim, Jonathan ; amd-gfx@lists.freedesktop.org Cc: Wong, Alice ; Kuehling, Felix ; Kasiviswanathan, Harish Subject: Re: [PATCH] drm/amdkfd: fix mes set shader debugger process management On 2023-12-11 16:16, Jonathan Kim wrote: > MES provides the driver a call to explicitly flush stale process > memory within the MES to avoid a race condition that results in a > fatal memory violation. > > When SET_SHADER_DEBUGGER is called, the driver passes a memory address > that represents a process context address MES uses to keep track of > future per-process calls. > > Normally, MES will purge its process context list when the last queue > has been removed. The driver, however, can call SET_SHADER_DEBUGGER > regardless of whether a queue has been added or not. > > If SET_SHADER_DEBUGGER has been called with no queues as the last call > prior to process termination, the passed process context address will > still reside within MES. > > On a new process call to SET_SHADER_DEBUGGER, the driver may end up > passing an identical process context address value (based on > per-process gpu memory address) to MES but is now pointing to a new > allocated buffer object during KFD process creation. Since the MES is > unaware of this, access of the passed address points to the stale > object within MES and triggers a fatal memory violation. > > The solution is for KFD to explicitly flush the process context > address from MES on process termination. > > Note that the flush call and the MES debugger calls use the same MES > interface but are separated as KFD calls to avoid conflicting with > each other. > > Signed-off-by: Jonathan Kim > Tested-by: Alice Wong Reviewed-by: Eric Huang > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 31 +++ > drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h | 10 +++--- > .../amd/amdkfd/kfd_process_queue_manager.c| 1 + > drivers/gpu/drm/amd/include/mes_v11_api_def.h | 3 +- > 4 files changed, 40 insertions(+), 5 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > index e544b823abf6..e98de23250dc 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c > @@ -916,6 +916,11 @@ int amdgpu_mes_set_shader_debugger(struct amdgpu_device > *adev, > op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER; > op_input.set_shader_debugger.process_context_addr = > process_context_addr; > op_input.set_shader_debugger.flags.u32all = flags; > + > + /* use amdgpu mes_flush_shader_debugger instead */ > + if (op_input.set_shader_debugger.flags.process_ctx_flush) > + return -EINVAL; > + > op_input.set_shader_debugger.spi_gdbg_per_vmid_cntl = > spi_gdbg_per_vmid_cntl; > memcpy(op_input.set_shader_debugger.tcp_watch_cntl, tcp_watch_cntl, > sizeof(op_input.set_shader_debugger.tcp_watch_cntl)); > @@ -935,6 +940,32 @@ int amdgpu_mes_set_shader_debugger(struct amdgpu_device > *adev, > return r; > } > > +int amdgpu_mes_flush_shader_debugger(struct amdgpu_device *adev, > + uint64_t process_context_addr) { > + struct mes_misc_op_input op_input = {0}; > + int r; > + > + if (!adev->mes.funcs->misc_op) { > + DRM_ERROR("mes flush shader debugger is not supported!\n"); > + return -EINVAL; > + } > + > + op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER; > + op_input.set_shader_debugger.process_context_addr = > process_context_addr; > + op_input.set_shader_debugger.flags.process_ctx_flush = true; > + > + amdgpu_mes_lock(>mes); > + > + r = adev->mes.funcs->misc_op(>mes, _input); > + if (r) > + DRM_ERROR("failed to set_shader_debugger\n"); > + > + amdgpu_mes_unlock(>mes); > + > + return r; > +} > + > static void > amdgpu_mes_ring_to_queue_props(struct amdgpu_device *adev, > struct amdgpu_ring *ring, > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h > index 894b9b133000..7d4f93fea937 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h > @@ -296,9 +296,10 @@ struct mes_mi
Re: [PATCH] drm/amdkfd: fix mes set shader debugger process management
On 2023-12-11 16:16, Jonathan Kim wrote: MES provides the driver a call to explicitly flush stale process memory within the MES to avoid a race condition that results in a fatal memory violation. When SET_SHADER_DEBUGGER is called, the driver passes a memory address that represents a process context address MES uses to keep track of future per-process calls. Normally, MES will purge its process context list when the last queue has been removed. The driver, however, can call SET_SHADER_DEBUGGER regardless of whether a queue has been added or not. If SET_SHADER_DEBUGGER has been called with no queues as the last call prior to process termination, the passed process context address will still reside within MES. On a new process call to SET_SHADER_DEBUGGER, the driver may end up passing an identical process context address value (based on per-process gpu memory address) to MES but is now pointing to a new allocated buffer object during KFD process creation. Since the MES is unaware of this, access of the passed address points to the stale object within MES and triggers a fatal memory violation. The solution is for KFD to explicitly flush the process context address from MES on process termination. Note that the flush call and the MES debugger calls use the same MES interface but are separated as KFD calls to avoid conflicting with each other. Signed-off-by: Jonathan Kim Tested-by: Alice Wong Reviewed-by: Eric Huang --- drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 31 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h | 10 +++--- .../amd/amdkfd/kfd_process_queue_manager.c| 1 + drivers/gpu/drm/amd/include/mes_v11_api_def.h | 3 +- 4 files changed, 40 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c index e544b823abf6..e98de23250dc 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c @@ -916,6 +916,11 @@ int amdgpu_mes_set_shader_debugger(struct amdgpu_device *adev, op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER; op_input.set_shader_debugger.process_context_addr = process_context_addr; op_input.set_shader_debugger.flags.u32all = flags; + + /* use amdgpu mes_flush_shader_debugger instead */ + if (op_input.set_shader_debugger.flags.process_ctx_flush) + return -EINVAL; + op_input.set_shader_debugger.spi_gdbg_per_vmid_cntl = spi_gdbg_per_vmid_cntl; memcpy(op_input.set_shader_debugger.tcp_watch_cntl, tcp_watch_cntl, sizeof(op_input.set_shader_debugger.tcp_watch_cntl)); @@ -935,6 +940,32 @@ int amdgpu_mes_set_shader_debugger(struct amdgpu_device *adev, return r; } +int amdgpu_mes_flush_shader_debugger(struct amdgpu_device *adev, +uint64_t process_context_addr) +{ + struct mes_misc_op_input op_input = {0}; + int r; + + if (!adev->mes.funcs->misc_op) { + DRM_ERROR("mes flush shader debugger is not supported!\n"); + return -EINVAL; + } + + op_input.op = MES_MISC_OP_SET_SHADER_DEBUGGER; + op_input.set_shader_debugger.process_context_addr = process_context_addr; + op_input.set_shader_debugger.flags.process_ctx_flush = true; + + amdgpu_mes_lock(>mes); + + r = adev->mes.funcs->misc_op(>mes, _input); + if (r) + DRM_ERROR("failed to set_shader_debugger\n"); + + amdgpu_mes_unlock(>mes); + + return r; +} + static void amdgpu_mes_ring_to_queue_props(struct amdgpu_device *adev, struct amdgpu_ring *ring, diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h index 894b9b133000..7d4f93fea937 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h @@ -296,9 +296,10 @@ struct mes_misc_op_input { uint64_t process_context_addr; union { struct { - uint64_t single_memop : 1; - uint64_t single_alu_op : 1; - uint64_t reserved: 30; + uint32_t single_memop : 1; + uint32_t single_alu_op : 1; + uint32_t reserved: 29; + uint32_t process_ctx_flush: 1; }; uint32_t u32all; } flags; @@ -374,7 +375,8 @@ int amdgpu_mes_set_shader_debugger(struct amdgpu_device *adev, const uint32_t *tcp_watch_cntl, uint32_t flags, bool trap_en); - +int amdgpu_mes_flush_shader_debugger(struct amdgpu_device