Re: [PATCH] drm/amdkfd: fix shift out of bounds about gpu debug
On 3/1/2024 00:35, Kim, Jonathan wrote: > The range check should probably flag any exception prefixed as > EC_QUEUE_PACKET_* as valid defined in kfd_dbg_trap_exception_code: > https://github.com/torvalds/linux/blob/master/include/uapi/linux/kfd_ioctl.h#L857 > + Jay to confirm this is the correct exception range for CP_BAD_OPCODE Yes, that covers the full range of possible values.
RE: [PATCH] drm/amdkfd: fix shift out of bounds about gpu debug
[Public] > -Original Message- > From: Zhang, Jesse(Jie) > Sent: Friday, March 1, 2024 12:50 AM > To: Kim, Jonathan ; amd-gfx@lists.freedesktop.org > Cc: Deucher, Alexander ; Kuehling, Felix > ; Zhang, Yifan > Subject: RE: [PATCH] drm/amdkfd: fix shift out of bounds about gpu debug > > [Public] > > Hi Jon, > > -Original Message- > From: Kim, Jonathan > Sent: Thursday, February 29, 2024 11:58 PM > To: Zhang, Jesse(Jie) ; amd- > g...@lists.freedesktop.org > Cc: Deucher, Alexander ; Kuehling, Felix > ; Zhang, Yifan ; Zhang, > Jesse(Jie) ; Zhang, Jesse(Jie) > > Subject: RE: [PATCH] drm/amdkfd: fix shift out of bounds about gpu debug > > [Public] > > I think this was discussed in another thread. > Exception codes should be range checked prior to applying the mask. Raising > null events to the debugger or runtime isn't useful. > I haven't gotten around to fixing this yet. I should have time this week. > Just to double check, the out of bounds shift is because of a CP interrupt > that > generates a null exception code? > > [Zhang, Jesse(Jie)] Thanks for your reminder, I saw that discussion. > In this interrupt, other fields(such as, source id, client id pasid ) are > correct. > only the value of context_id0 (0xf) is invalid. >How about do the check ,like this: > } else if (source_id == SOC15_INTSRC_CP_BAD_OPCODE) { > + /* filter out the invalidate context_id0 */ > + if (!(context_id0 >> KFD_DEBUG_CP_BAD_OP_ECODE_SHIFT) > || > + (context_id0 >> > KFD_DEBUG_CP_BAD_OP_ECODE_SHIFT) > > EC_MAX) > + return; The range check should probably flag any exception prefixed as EC_QUEUE_PACKET_* as valid defined in kfd_dbg_trap_exception_code: https://github.com/torvalds/linux/blob/master/include/uapi/linux/kfd_ioctl.h#L857 + Jay to confirm this is the correct exception range for CP_BAD_OPCODE If that's the case, then I think we can define a KFD_DBG_EC_TYPE_IS_QUEUE_PACKET macro similar to: https://github.com/torvalds/linux/blob/master/include/uapi/linux/kfd_ioctl.h#L917 That way, KFD process interrupts v9, v10, v11 can use that check prior to mask conversion and user space may find it useful as well. Jon > kfd_set_dbg_ev_from_interrupt(dev, pasid, > KFD_DEBUG_DOORBELL_ID(context_id0), > > KFD_EC_MASK(KFD_DEBUG_CP_BAD_OP_ECODE(context_id0)), > Thanks > Jesse > Jon > > > -Original Message- > > From: Jesse Zhang > > Sent: Thursday, February 29, 2024 3:45 AM > > To: amd-gfx@lists.freedesktop.org > > Cc: Deucher, Alexander ; Kuehling, Felix > > ; Kim, Jonathan ; > Zhang, > > Yifan ; Zhang, Jesse(Jie) > ; > > Zhang, Jesse(Jie) > > Subject: [PATCH] drm/amdkfd: fix shift out of bounds about gpu debug > > > > the issue is : > > [ 388.151802] UBSAN: shift-out-of-bounds in > > drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_int_process_v10.c:346:5 > > [ 388.151807] shift exponent 4294967295 is too large for 64-bit type > > 'long long unsigned int' > > [ 388.151812] CPU: 6 PID: 347 Comm: kworker/6:1H Tainted: GE > > 6.7.0+ #1 > > [ 388.151814] Hardware name: AMD Splinter/Splinter-GNR, BIOS > > WS54117N_140 01/16/2024 > > [ 388.151816] Workqueue: KFD IH interrupt_wq [amdgpu] [ 388.152084] > > Call Trace: > > [ 388.152086] > > [ 388.152089] dump_stack_lvl+0x4c/0x70 [ 388.152096] > > dump_stack+0x14/0x20 [ 388.152098] ubsan_epilogue+0x9/0x40 [ > > 388.152101] __ubsan_handle_shift_out_of_bounds+0x113/0x170 > > [ 388.152103] ? vprintk+0x40/0x70 > > [ 388.152106] ? swsusp_check+0x131/0x190 [ 388.152110] > > event_interrupt_wq_v10.cold+0x16/0x1e [amdgpu] [ 388.152411] ? > > raw_spin_rq_unlock+0x14/0x40 [ 388.152415] ? > > finish_task_switch+0x85/0x2a0 [ 388.152417] ? > > kfifo_copy_out+0x5f/0x70 [ 388.152420] interrupt_wq+0xb2/0x120 > > [amdgpu] [ 388.152642] ? interrupt_wq+0xb2/0x120 [amdgpu] [ > > 388.152728] process_scheduled_works+0x9a/0x3a0 > > [ 388.152731] ? __pfx_worker_thread+0x10/0x10 [ 388.152732] > > worker_thread+0x15f/0x2d0 [ 388.152733] ? > > __pfx_worker_thread+0x10/0x10 [ 388.152734] kthread+0xfb/0x130 [ > > 388.152735] ? __pfx_kthread+0x10/0x10 [ 388.152736] > > ret_from_fork+0x3d/0x60 [ 388.152738] ? __pfx_kthread+0x10/0x10 [ > > 388.152739] ret_from_fork_asm+0x1b/0x30 [ 388.152742] > > > > Signed-off-by: Jesse Zhang > > --- > > include/uapi/linux/kfd_ioctl.h | 2 +- > > 1 file changed, 1
RE: [PATCH] drm/amdkfd: fix shift out of bounds about gpu debug
[Public] Hi Jon, -Original Message- From: Kim, Jonathan Sent: Thursday, February 29, 2024 11:58 PM To: Zhang, Jesse(Jie) ; amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander ; Kuehling, Felix ; Zhang, Yifan ; Zhang, Jesse(Jie) ; Zhang, Jesse(Jie) Subject: RE: [PATCH] drm/amdkfd: fix shift out of bounds about gpu debug [Public] I think this was discussed in another thread. Exception codes should be range checked prior to applying the mask. Raising null events to the debugger or runtime isn't useful. I haven't gotten around to fixing this yet. I should have time this week. Just to double check, the out of bounds shift is because of a CP interrupt that generates a null exception code? [Zhang, Jesse(Jie)] Thanks for your reminder, I saw that discussion. In this interrupt, other fields(such as, source id, client id pasid ) are correct. only the value of context_id0 (0xf) is invalid. How about do the check ,like this: } else if (source_id == SOC15_INTSRC_CP_BAD_OPCODE) { + /* filter out the invalidate context_id0 */ + if (!(context_id0 >> KFD_DEBUG_CP_BAD_OP_ECODE_SHIFT) || + (context_id0 >> KFD_DEBUG_CP_BAD_OP_ECODE_SHIFT) > EC_MAX) + return; kfd_set_dbg_ev_from_interrupt(dev, pasid, KFD_DEBUG_DOORBELL_ID(context_id0), KFD_EC_MASK(KFD_DEBUG_CP_BAD_OP_ECODE(context_id0)), Thanks Jesse Jon > -Original Message- > From: Jesse Zhang > Sent: Thursday, February 29, 2024 3:45 AM > To: amd-gfx@lists.freedesktop.org > Cc: Deucher, Alexander ; Kuehling, Felix > ; Kim, Jonathan ; Zhang, > Yifan ; Zhang, Jesse(Jie) ; > Zhang, Jesse(Jie) > Subject: [PATCH] drm/amdkfd: fix shift out of bounds about gpu debug > > the issue is : > [ 388.151802] UBSAN: shift-out-of-bounds in > drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_int_process_v10.c:346:5 > [ 388.151807] shift exponent 4294967295 is too large for 64-bit type > 'long long unsigned int' > [ 388.151812] CPU: 6 PID: 347 Comm: kworker/6:1H Tainted: GE > 6.7.0+ #1 > [ 388.151814] Hardware name: AMD Splinter/Splinter-GNR, BIOS > WS54117N_140 01/16/2024 > [ 388.151816] Workqueue: KFD IH interrupt_wq [amdgpu] [ 388.152084] > Call Trace: > [ 388.152086] > [ 388.152089] dump_stack_lvl+0x4c/0x70 [ 388.152096] > dump_stack+0x14/0x20 [ 388.152098] ubsan_epilogue+0x9/0x40 [ > 388.152101] __ubsan_handle_shift_out_of_bounds+0x113/0x170 > [ 388.152103] ? vprintk+0x40/0x70 > [ 388.152106] ? swsusp_check+0x131/0x190 [ 388.152110] > event_interrupt_wq_v10.cold+0x16/0x1e [amdgpu] [ 388.152411] ? > raw_spin_rq_unlock+0x14/0x40 [ 388.152415] ? > finish_task_switch+0x85/0x2a0 [ 388.152417] ? > kfifo_copy_out+0x5f/0x70 [ 388.152420] interrupt_wq+0xb2/0x120 > [amdgpu] [ 388.152642] ? interrupt_wq+0xb2/0x120 [amdgpu] [ > 388.152728] process_scheduled_works+0x9a/0x3a0 > [ 388.152731] ? __pfx_worker_thread+0x10/0x10 [ 388.152732] > worker_thread+0x15f/0x2d0 [ 388.152733] ? > __pfx_worker_thread+0x10/0x10 [ 388.152734] kthread+0xfb/0x130 [ > 388.152735] ? __pfx_kthread+0x10/0x10 [ 388.152736] > ret_from_fork+0x3d/0x60 [ 388.152738] ? __pfx_kthread+0x10/0x10 [ > 388.152739] ret_from_fork_asm+0x1b/0x30 [ 388.152742] > > Signed-off-by: Jesse Zhang > --- > include/uapi/linux/kfd_ioctl.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/include/uapi/linux/kfd_ioctl.h > b/include/uapi/linux/kfd_ioctl.h index 9ce46edc62a5..3d5867df17e8 > 100644 > --- a/include/uapi/linux/kfd_ioctl.h > +++ b/include/uapi/linux/kfd_ioctl.h > @@ -887,7 +887,7 @@ enum kfd_dbg_trap_exception_code { }; > > /* Mask generated by ecode in kfd_dbg_trap_exception_code */ > -#define KFD_EC_MASK(ecode) (1ULL << (ecode - 1)) > +#define KFD_EC_MASK(ecode) (ecode ? (1ULL << (ecode - 1)) : 0ULL) > > /* Masks for exception code type checks below */ #define > KFD_EC_MASK_QUEUE > (KFD_EC_MASK(EC_QUEUE_WAVE_ABORT) | \ > -- > 2.25.1
RE: [PATCH] drm/amdkfd: fix shift out of bounds about gpu debug
[Public] I think this was discussed in another thread. Exception codes should be range checked prior to applying the mask. Raising null events to the debugger or runtime isn't useful. I haven't gotten around to fixing this yet. I should have time this week. Just to double check, the out of bounds shift is because of a CP interrupt that generates a null exception code? Jon > -Original Message- > From: Jesse Zhang > Sent: Thursday, February 29, 2024 3:45 AM > To: amd-gfx@lists.freedesktop.org > Cc: Deucher, Alexander ; Kuehling, Felix > ; Kim, Jonathan ; > Zhang, Yifan ; Zhang, Jesse(Jie) > ; Zhang, Jesse(Jie) > Subject: [PATCH] drm/amdkfd: fix shift out of bounds about gpu debug > > the issue is : > [ 388.151802] UBSAN: shift-out-of-bounds in > drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_int_process_v10.c:346:5 > [ 388.151807] shift exponent 4294967295 is too large for 64-bit type 'long > long unsigned int' > [ 388.151812] CPU: 6 PID: 347 Comm: kworker/6:1H Tainted: GE > 6.7.0+ #1 > [ 388.151814] Hardware name: AMD Splinter/Splinter-GNR, BIOS > WS54117N_140 01/16/2024 > [ 388.151816] Workqueue: KFD IH interrupt_wq [amdgpu] > [ 388.152084] Call Trace: > [ 388.152086] > [ 388.152089] dump_stack_lvl+0x4c/0x70 > [ 388.152096] dump_stack+0x14/0x20 > [ 388.152098] ubsan_epilogue+0x9/0x40 > [ 388.152101] __ubsan_handle_shift_out_of_bounds+0x113/0x170 > [ 388.152103] ? vprintk+0x40/0x70 > [ 388.152106] ? swsusp_check+0x131/0x190 > [ 388.152110] event_interrupt_wq_v10.cold+0x16/0x1e [amdgpu] > [ 388.152411] ? raw_spin_rq_unlock+0x14/0x40 > [ 388.152415] ? finish_task_switch+0x85/0x2a0 > [ 388.152417] ? kfifo_copy_out+0x5f/0x70 > [ 388.152420] interrupt_wq+0xb2/0x120 [amdgpu] > [ 388.152642] ? interrupt_wq+0xb2/0x120 [amdgpu] > [ 388.152728] process_scheduled_works+0x9a/0x3a0 > [ 388.152731] ? __pfx_worker_thread+0x10/0x10 > [ 388.152732] worker_thread+0x15f/0x2d0 > [ 388.152733] ? __pfx_worker_thread+0x10/0x10 > [ 388.152734] kthread+0xfb/0x130 > [ 388.152735] ? __pfx_kthread+0x10/0x10 > [ 388.152736] ret_from_fork+0x3d/0x60 > [ 388.152738] ? __pfx_kthread+0x10/0x10 > [ 388.152739] ret_from_fork_asm+0x1b/0x30 > [ 388.152742] > > Signed-off-by: Jesse Zhang > --- > include/uapi/linux/kfd_ioctl.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h > index 9ce46edc62a5..3d5867df17e8 100644 > --- a/include/uapi/linux/kfd_ioctl.h > +++ b/include/uapi/linux/kfd_ioctl.h > @@ -887,7 +887,7 @@ enum kfd_dbg_trap_exception_code { > }; > > /* Mask generated by ecode in kfd_dbg_trap_exception_code */ > -#define KFD_EC_MASK(ecode) (1ULL << (ecode - 1)) > +#define KFD_EC_MASK(ecode) (ecode ? (1ULL << (ecode - 1)) : 0ULL) > > /* Masks for exception code type checks below */ > #define KFD_EC_MASK_QUEUE > (KFD_EC_MASK(EC_QUEUE_WAVE_ABORT) | \ > -- > 2.25.1
[PATCH] drm/amdkfd: fix shift out of bounds about gpu debug
the issue is : [ 388.151802] UBSAN: shift-out-of-bounds in drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_int_process_v10.c:346:5 [ 388.151807] shift exponent 4294967295 is too large for 64-bit type 'long long unsigned int' [ 388.151812] CPU: 6 PID: 347 Comm: kworker/6:1H Tainted: GE 6.7.0+ #1 [ 388.151814] Hardware name: AMD Splinter/Splinter-GNR, BIOS WS54117N_140 01/16/2024 [ 388.151816] Workqueue: KFD IH interrupt_wq [amdgpu] [ 388.152084] Call Trace: [ 388.152086] [ 388.152089] dump_stack_lvl+0x4c/0x70 [ 388.152096] dump_stack+0x14/0x20 [ 388.152098] ubsan_epilogue+0x9/0x40 [ 388.152101] __ubsan_handle_shift_out_of_bounds+0x113/0x170 [ 388.152103] ? vprintk+0x40/0x70 [ 388.152106] ? swsusp_check+0x131/0x190 [ 388.152110] event_interrupt_wq_v10.cold+0x16/0x1e [amdgpu] [ 388.152411] ? raw_spin_rq_unlock+0x14/0x40 [ 388.152415] ? finish_task_switch+0x85/0x2a0 [ 388.152417] ? kfifo_copy_out+0x5f/0x70 [ 388.152420] interrupt_wq+0xb2/0x120 [amdgpu] [ 388.152642] ? interrupt_wq+0xb2/0x120 [amdgpu] [ 388.152728] process_scheduled_works+0x9a/0x3a0 [ 388.152731] ? __pfx_worker_thread+0x10/0x10 [ 388.152732] worker_thread+0x15f/0x2d0 [ 388.152733] ? __pfx_worker_thread+0x10/0x10 [ 388.152734] kthread+0xfb/0x130 [ 388.152735] ? __pfx_kthread+0x10/0x10 [ 388.152736] ret_from_fork+0x3d/0x60 [ 388.152738] ? __pfx_kthread+0x10/0x10 [ 388.152739] ret_from_fork_asm+0x1b/0x30 [ 388.152742] Signed-off-by: Jesse Zhang --- include/uapi/linux/kfd_ioctl.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h index 9ce46edc62a5..3d5867df17e8 100644 --- a/include/uapi/linux/kfd_ioctl.h +++ b/include/uapi/linux/kfd_ioctl.h @@ -887,7 +887,7 @@ enum kfd_dbg_trap_exception_code { }; /* Mask generated by ecode in kfd_dbg_trap_exception_code */ -#define KFD_EC_MASK(ecode) (1ULL << (ecode - 1)) +#define KFD_EC_MASK(ecode) (ecode ? (1ULL << (ecode - 1)) : 0ULL) /* Masks for exception code type checks below */ #define KFD_EC_MASK_QUEUE (KFD_EC_MASK(EC_QUEUE_WAVE_ABORT) | \ -- 2.25.1