Re: [PATCH] drm/amdkfd: fix shift out of bounds about gpu debug

2024-03-01 Thread Jay Cornwall
On 3/1/2024 00:35, Kim, Jonathan wrote:

> The range check should probably flag any exception prefixed as 
> EC_QUEUE_PACKET_* as valid defined in kfd_dbg_trap_exception_code:
> https://github.com/torvalds/linux/blob/master/include/uapi/linux/kfd_ioctl.h#L857
> + Jay to confirm this is the correct exception range for CP_BAD_OPCODE

Yes, that covers the full range of possible values.


RE: [PATCH] drm/amdkfd: fix shift out of bounds about gpu debug

2024-02-29 Thread Kim, Jonathan
[Public]

> -Original Message-
> From: Zhang, Jesse(Jie) 
> Sent: Friday, March 1, 2024 12:50 AM
> To: Kim, Jonathan ; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Kuehling, Felix
> ; Zhang, Yifan 
> Subject: RE: [PATCH] drm/amdkfd: fix shift out of bounds about gpu debug
>
> [Public]
>
> Hi Jon,
>
> -Original Message-
> From: Kim, Jonathan 
> Sent: Thursday, February 29, 2024 11:58 PM
> To: Zhang, Jesse(Jie) ; amd-
> g...@lists.freedesktop.org
> Cc: Deucher, Alexander ; Kuehling, Felix
> ; Zhang, Yifan ; Zhang,
> Jesse(Jie) ; Zhang, Jesse(Jie)
> 
> Subject: RE: [PATCH] drm/amdkfd: fix shift out of bounds about gpu debug
>
> [Public]
>
> I think this was discussed in another thread.
> Exception codes should be range checked prior to applying the mask.  Raising
> null events to the debugger or runtime isn't useful.
> I haven't gotten around to fixing this yet.  I should have time this week.
> Just to double check, the out of bounds shift is because of a CP interrupt 
> that
> generates a null exception code?
>
> [Zhang, Jesse(Jie)] Thanks for your reminder, I saw that discussion.
> In this interrupt, other fields(such as, source id, client id pasid ) are 
> correct.
> only the value of context_id0 (0xf) is invalid.
>How about do the check ,like this:
>   } else if (source_id == SOC15_INTSRC_CP_BAD_OPCODE) {
> +   /* filter out the invalidate context_id0 */
> +   if (!(context_id0 >> KFD_DEBUG_CP_BAD_OP_ECODE_SHIFT) 
> ||
> +   (context_id0 >> 
> KFD_DEBUG_CP_BAD_OP_ECODE_SHIFT) >
> EC_MAX)
> +   return;

The range check should probably flag any exception prefixed as 
EC_QUEUE_PACKET_* as valid defined in kfd_dbg_trap_exception_code:
https://github.com/torvalds/linux/blob/master/include/uapi/linux/kfd_ioctl.h#L857
+ Jay to confirm this is the correct exception range for CP_BAD_OPCODE

If that's the case, then I think we can define a 
KFD_DBG_EC_TYPE_IS_QUEUE_PACKET macro similar to:
https://github.com/torvalds/linux/blob/master/include/uapi/linux/kfd_ioctl.h#L917

That way, KFD process interrupts v9, v10, v11 can use that check prior to mask 
conversion and user space may find it useful as well.

Jon
> kfd_set_dbg_ev_from_interrupt(dev, pasid,
> KFD_DEBUG_DOORBELL_ID(context_id0),
>
> KFD_EC_MASK(KFD_DEBUG_CP_BAD_OP_ECODE(context_id0)),
>  Thanks
>  Jesse
> Jon
>
> > -Original Message-
> > From: Jesse Zhang 
> > Sent: Thursday, February 29, 2024 3:45 AM
> > To: amd-gfx@lists.freedesktop.org
> > Cc: Deucher, Alexander ; Kuehling, Felix
> > ; Kim, Jonathan ;
> Zhang,
> > Yifan ; Zhang, Jesse(Jie)
> ;
> > Zhang, Jesse(Jie) 
> > Subject: [PATCH] drm/amdkfd: fix shift out of bounds about gpu debug
> >
> >  the issue is :
> > [  388.151802] UBSAN: shift-out-of-bounds in
> > drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_int_process_v10.c:346:5
> > [  388.151807] shift exponent 4294967295 is too large for 64-bit type
> > 'long long unsigned int'
> > [  388.151812] CPU: 6 PID: 347 Comm: kworker/6:1H Tainted: GE
> > 6.7.0+ #1
> > [  388.151814] Hardware name: AMD Splinter/Splinter-GNR, BIOS
> > WS54117N_140 01/16/2024
> > [  388.151816] Workqueue: KFD IH interrupt_wq [amdgpu] [  388.152084]
> > Call Trace:
> > [  388.152086]  
> > [  388.152089]  dump_stack_lvl+0x4c/0x70 [  388.152096]
> > dump_stack+0x14/0x20 [  388.152098]  ubsan_epilogue+0x9/0x40 [
> > 388.152101]  __ubsan_handle_shift_out_of_bounds+0x113/0x170
> > [  388.152103]  ? vprintk+0x40/0x70
> > [  388.152106]  ? swsusp_check+0x131/0x190 [  388.152110]
> > event_interrupt_wq_v10.cold+0x16/0x1e [amdgpu] [  388.152411]  ?
> > raw_spin_rq_unlock+0x14/0x40 [  388.152415]  ?
> > finish_task_switch+0x85/0x2a0 [  388.152417]  ?
> > kfifo_copy_out+0x5f/0x70 [  388.152420]  interrupt_wq+0xb2/0x120
> > [amdgpu] [  388.152642]  ? interrupt_wq+0xb2/0x120 [amdgpu] [
> > 388.152728]  process_scheduled_works+0x9a/0x3a0
> > [  388.152731]  ? __pfx_worker_thread+0x10/0x10 [  388.152732]
> > worker_thread+0x15f/0x2d0 [  388.152733]  ?
> > __pfx_worker_thread+0x10/0x10 [  388.152734]  kthread+0xfb/0x130 [
> > 388.152735]  ? __pfx_kthread+0x10/0x10 [  388.152736]
> > ret_from_fork+0x3d/0x60 [  388.152738]  ? __pfx_kthread+0x10/0x10 [
> > 388.152739]  ret_from_fork_asm+0x1b/0x30 [  388.152742]  
> >
> > Signed-off-by: Jesse Zhang 
> > ---
> >  include/uapi/linux/kfd_ioctl.h | 2 +-
> >  1 file changed, 1 

RE: [PATCH] drm/amdkfd: fix shift out of bounds about gpu debug

2024-02-29 Thread Zhang, Jesse(Jie)
[Public]

Hi Jon,

-Original Message-
From: Kim, Jonathan 
Sent: Thursday, February 29, 2024 11:58 PM
To: Zhang, Jesse(Jie) ; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Kuehling, Felix 
; Zhang, Yifan ; Zhang, 
Jesse(Jie) ; Zhang, Jesse(Jie) 
Subject: RE: [PATCH] drm/amdkfd: fix shift out of bounds about gpu debug

[Public]

I think this was discussed in another thread.
Exception codes should be range checked prior to applying the mask.  Raising 
null events to the debugger or runtime isn't useful.
I haven't gotten around to fixing this yet.  I should have time this week.
Just to double check, the out of bounds shift is because of a CP interrupt that 
generates a null exception code?

[Zhang, Jesse(Jie)] Thanks for your reminder, I saw that discussion.
In this interrupt, other fields(such as, source id, client id pasid ) are 
correct.
only the value of context_id0 (0xf) is invalid.
   How about do the check ,like this:
  } else if (source_id == SOC15_INTSRC_CP_BAD_OPCODE) {
+   /* filter out the invalidate context_id0 */
+   if (!(context_id0 >> KFD_DEBUG_CP_BAD_OP_ECODE_SHIFT) ||
+   (context_id0 >> 
KFD_DEBUG_CP_BAD_OP_ECODE_SHIFT) > EC_MAX)
+   return;
kfd_set_dbg_ev_from_interrupt(dev, pasid,
KFD_DEBUG_DOORBELL_ID(context_id0),

KFD_EC_MASK(KFD_DEBUG_CP_BAD_OP_ECODE(context_id0)),
 Thanks
 Jesse
Jon

> -Original Message-
> From: Jesse Zhang 
> Sent: Thursday, February 29, 2024 3:45 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Kuehling, Felix
> ; Kim, Jonathan ; Zhang,
> Yifan ; Zhang, Jesse(Jie) ;
> Zhang, Jesse(Jie) 
> Subject: [PATCH] drm/amdkfd: fix shift out of bounds about gpu debug
>
>  the issue is :
> [  388.151802] UBSAN: shift-out-of-bounds in
> drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_int_process_v10.c:346:5
> [  388.151807] shift exponent 4294967295 is too large for 64-bit type
> 'long long unsigned int'
> [  388.151812] CPU: 6 PID: 347 Comm: kworker/6:1H Tainted: GE
> 6.7.0+ #1
> [  388.151814] Hardware name: AMD Splinter/Splinter-GNR, BIOS
> WS54117N_140 01/16/2024
> [  388.151816] Workqueue: KFD IH interrupt_wq [amdgpu] [  388.152084]
> Call Trace:
> [  388.152086]  
> [  388.152089]  dump_stack_lvl+0x4c/0x70 [  388.152096]
> dump_stack+0x14/0x20 [  388.152098]  ubsan_epilogue+0x9/0x40 [
> 388.152101]  __ubsan_handle_shift_out_of_bounds+0x113/0x170
> [  388.152103]  ? vprintk+0x40/0x70
> [  388.152106]  ? swsusp_check+0x131/0x190 [  388.152110]
> event_interrupt_wq_v10.cold+0x16/0x1e [amdgpu] [  388.152411]  ?
> raw_spin_rq_unlock+0x14/0x40 [  388.152415]  ?
> finish_task_switch+0x85/0x2a0 [  388.152417]  ?
> kfifo_copy_out+0x5f/0x70 [  388.152420]  interrupt_wq+0xb2/0x120
> [amdgpu] [  388.152642]  ? interrupt_wq+0xb2/0x120 [amdgpu] [
> 388.152728]  process_scheduled_works+0x9a/0x3a0
> [  388.152731]  ? __pfx_worker_thread+0x10/0x10 [  388.152732]
> worker_thread+0x15f/0x2d0 [  388.152733]  ?
> __pfx_worker_thread+0x10/0x10 [  388.152734]  kthread+0xfb/0x130 [
> 388.152735]  ? __pfx_kthread+0x10/0x10 [  388.152736]
> ret_from_fork+0x3d/0x60 [  388.152738]  ? __pfx_kthread+0x10/0x10 [
> 388.152739]  ret_from_fork_asm+0x1b/0x30 [  388.152742]  
>
> Signed-off-by: Jesse Zhang 
> ---
>  include/uapi/linux/kfd_ioctl.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/uapi/linux/kfd_ioctl.h
> b/include/uapi/linux/kfd_ioctl.h index 9ce46edc62a5..3d5867df17e8
> 100644
> --- a/include/uapi/linux/kfd_ioctl.h
> +++ b/include/uapi/linux/kfd_ioctl.h
> @@ -887,7 +887,7 @@ enum kfd_dbg_trap_exception_code {  };
>
>  /* Mask generated by ecode in kfd_dbg_trap_exception_code */
> -#define KFD_EC_MASK(ecode)   (1ULL << (ecode - 1))
> +#define KFD_EC_MASK(ecode)   (ecode ? (1ULL << (ecode - 1)) : 0ULL)
>
>  /* Masks for exception code type checks below */  #define
> KFD_EC_MASK_QUEUE
>   (KFD_EC_MASK(EC_QUEUE_WAVE_ABORT) | \
> --
> 2.25.1




RE: [PATCH] drm/amdkfd: fix shift out of bounds about gpu debug

2024-02-29 Thread Kim, Jonathan
[Public]

I think this was discussed in another thread.
Exception codes should be range checked prior to applying the mask.  Raising 
null events to the debugger or runtime isn't useful.
I haven't gotten around to fixing this yet.  I should have time this week.
Just to double check, the out of bounds shift is because of a CP interrupt that 
generates a null exception code?

Jon

> -Original Message-
> From: Jesse Zhang 
> Sent: Thursday, February 29, 2024 3:45 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Kuehling, Felix
> ; Kim, Jonathan ;
> Zhang, Yifan ; Zhang, Jesse(Jie)
> ; Zhang, Jesse(Jie) 
> Subject: [PATCH] drm/amdkfd: fix shift out of bounds about gpu debug
>
>  the issue is :
> [  388.151802] UBSAN: shift-out-of-bounds in
> drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_int_process_v10.c:346:5
> [  388.151807] shift exponent 4294967295 is too large for 64-bit type 'long
> long unsigned int'
> [  388.151812] CPU: 6 PID: 347 Comm: kworker/6:1H Tainted: GE
> 6.7.0+ #1
> [  388.151814] Hardware name: AMD Splinter/Splinter-GNR, BIOS
> WS54117N_140 01/16/2024
> [  388.151816] Workqueue: KFD IH interrupt_wq [amdgpu]
> [  388.152084] Call Trace:
> [  388.152086]  
> [  388.152089]  dump_stack_lvl+0x4c/0x70
> [  388.152096]  dump_stack+0x14/0x20
> [  388.152098]  ubsan_epilogue+0x9/0x40
> [  388.152101]  __ubsan_handle_shift_out_of_bounds+0x113/0x170
> [  388.152103]  ? vprintk+0x40/0x70
> [  388.152106]  ? swsusp_check+0x131/0x190
> [  388.152110]  event_interrupt_wq_v10.cold+0x16/0x1e [amdgpu]
> [  388.152411]  ? raw_spin_rq_unlock+0x14/0x40
> [  388.152415]  ? finish_task_switch+0x85/0x2a0
> [  388.152417]  ? kfifo_copy_out+0x5f/0x70
> [  388.152420]  interrupt_wq+0xb2/0x120 [amdgpu]
> [  388.152642]  ? interrupt_wq+0xb2/0x120 [amdgpu]
> [  388.152728]  process_scheduled_works+0x9a/0x3a0
> [  388.152731]  ? __pfx_worker_thread+0x10/0x10
> [  388.152732]  worker_thread+0x15f/0x2d0
> [  388.152733]  ? __pfx_worker_thread+0x10/0x10
> [  388.152734]  kthread+0xfb/0x130
> [  388.152735]  ? __pfx_kthread+0x10/0x10
> [  388.152736]  ret_from_fork+0x3d/0x60
> [  388.152738]  ? __pfx_kthread+0x10/0x10
> [  388.152739]  ret_from_fork_asm+0x1b/0x30
> [  388.152742]  
>
> Signed-off-by: Jesse Zhang 
> ---
>  include/uapi/linux/kfd_ioctl.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h
> index 9ce46edc62a5..3d5867df17e8 100644
> --- a/include/uapi/linux/kfd_ioctl.h
> +++ b/include/uapi/linux/kfd_ioctl.h
> @@ -887,7 +887,7 @@ enum kfd_dbg_trap_exception_code {
>  };
>
>  /* Mask generated by ecode in kfd_dbg_trap_exception_code */
> -#define KFD_EC_MASK(ecode)   (1ULL << (ecode - 1))
> +#define KFD_EC_MASK(ecode)   (ecode ? (1ULL << (ecode - 1)) : 0ULL)
>
>  /* Masks for exception code type checks below */
>  #define KFD_EC_MASK_QUEUE
>   (KFD_EC_MASK(EC_QUEUE_WAVE_ABORT) | \
> --
> 2.25.1



[PATCH] drm/amdkfd: fix shift out of bounds about gpu debug

2024-02-29 Thread Jesse Zhang
 the issue is :
[  388.151802] UBSAN: shift-out-of-bounds in 
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_int_process_v10.c:346:5
[  388.151807] shift exponent 4294967295 is too large for 64-bit type 'long 
long unsigned int'
[  388.151812] CPU: 6 PID: 347 Comm: kworker/6:1H Tainted: GE  
6.7.0+ #1
[  388.151814] Hardware name: AMD Splinter/Splinter-GNR, BIOS WS54117N_140 
01/16/2024
[  388.151816] Workqueue: KFD IH interrupt_wq [amdgpu]
[  388.152084] Call Trace:
[  388.152086]  
[  388.152089]  dump_stack_lvl+0x4c/0x70
[  388.152096]  dump_stack+0x14/0x20
[  388.152098]  ubsan_epilogue+0x9/0x40
[  388.152101]  __ubsan_handle_shift_out_of_bounds+0x113/0x170
[  388.152103]  ? vprintk+0x40/0x70
[  388.152106]  ? swsusp_check+0x131/0x190
[  388.152110]  event_interrupt_wq_v10.cold+0x16/0x1e [amdgpu]
[  388.152411]  ? raw_spin_rq_unlock+0x14/0x40
[  388.152415]  ? finish_task_switch+0x85/0x2a0
[  388.152417]  ? kfifo_copy_out+0x5f/0x70
[  388.152420]  interrupt_wq+0xb2/0x120 [amdgpu]
[  388.152642]  ? interrupt_wq+0xb2/0x120 [amdgpu]
[  388.152728]  process_scheduled_works+0x9a/0x3a0
[  388.152731]  ? __pfx_worker_thread+0x10/0x10
[  388.152732]  worker_thread+0x15f/0x2d0
[  388.152733]  ? __pfx_worker_thread+0x10/0x10
[  388.152734]  kthread+0xfb/0x130
[  388.152735]  ? __pfx_kthread+0x10/0x10
[  388.152736]  ret_from_fork+0x3d/0x60
[  388.152738]  ? __pfx_kthread+0x10/0x10
[  388.152739]  ret_from_fork_asm+0x1b/0x30
[  388.152742]  

Signed-off-by: Jesse Zhang 
---
 include/uapi/linux/kfd_ioctl.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h
index 9ce46edc62a5..3d5867df17e8 100644
--- a/include/uapi/linux/kfd_ioctl.h
+++ b/include/uapi/linux/kfd_ioctl.h
@@ -887,7 +887,7 @@ enum kfd_dbg_trap_exception_code {
 };
 
 /* Mask generated by ecode in kfd_dbg_trap_exception_code */
-#define KFD_EC_MASK(ecode) (1ULL << (ecode - 1))
+#define KFD_EC_MASK(ecode) (ecode ? (1ULL << (ecode - 1)) : 0ULL)
 
 /* Masks for exception code type checks below */
 #define KFD_EC_MASK_QUEUE  (KFD_EC_MASK(EC_QUEUE_WAVE_ABORT) | \
-- 
2.25.1