On 3/25/26 11:26, Donet Tom wrote:
> 
> On 3/25/26 3:04 PM, Christian König wrote:
>> On 3/25/26 03:26, Kuehling, Felix wrote:
>>> On 2026-03-24 14:19, Donet Tom wrote:
>>>> On 3/23/26 6:42 PM, Christian König wrote:
>>>>> On 3/23/26 12:50, Donet Tom wrote:
>>>>>> On 3/23/26 3:41 PM, Christian König wrote:
>>>>>>
>>>>>> Hi Christian
>>>>>>
>>>>>>> On 3/23/26 05:28, Donet Tom wrote:
>>>>>>>> Currently, AMDGPU_VA_RESERVED_TRAP_SIZE is hardcoded to 8KB, while
>>>>>>>> KFD_CWSR_TBA_TMA_SIZE is defined as 2 * PAGE_SIZE. On systems with
>>>>>>>> 4K pages, both values match (8KB), so allocation and reserved space
>>>>>>>> are consistent.
>>>>>>>>
>>>>>>>> However, on 64K page-size systems, KFD_CWSR_TBA_TMA_SIZE becomes 128KB,
>>>>>>>> while the reserved trap area remains 8KB. This mismatch causes the
>>>>>>>> kernel to crash when running rocminfo or rccl unit tests.
>>>>>>>>
>>>>>>>> Kernel attempted to read user page (2) - exploit attempt? (uid: 1001)
>>>>>>>> BUG: Kernel NULL pointer dereference on read at 0x00000002
>>>>>>>> Faulting instruction address: 0xc0000000002c8a64
>>>>>>>> Oops: Kernel access of bad area, sig: 11 [#1]
>>>>>>>> LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
>>>>>>>> CPU: 34 UID: 1001 PID: 9379 Comm: rocminfo Tainted: G E
>>>>>>>> 6.19.0-rc4-amdgpu-00320-gf23176405700 #56 VOLUNTARY
>>>>>>>> Tainted: [E]=UNSIGNED_MODULE
>>>>>>>> Hardware name: IBM,9105-42A POWER10 (architected) 0x800200 0xf000006
>>>>>>>> of:IBM,FW1060.30 (ML1060_896) hv:phyp pSeries
>>>>>>>> NIP:  c0000000002c8a64 LR: c00000000125dbc8 CTR: c00000000125e730
>>>>>>>> REGS: c0000001e0957580 TRAP: 0300 Tainted: G E
>>>>>>>> MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24008268
>>>>>>>> XER: 00000036
>>>>>>>> CFAR: c00000000125dbc4 DAR: 0000000000000002 DSISR: 40000000
>>>>>>>> IRQMASK: 1
>>>>>>>> GPR00: c00000000125d908 c0000001e0957820 c0000000016e8100
>>>>>>>> c00000013d814540
>>>>>>>> GPR04: 0000000000000002 c00000013d814550 0000000000000045
>>>>>>>> 0000000000000000
>>>>>>>> GPR08: c00000013444d000 c00000013d814538 c00000013d814538
>>>>>>>> 0000000084002268
>>>>>>>> GPR12: c00000000125e730 c000007e2ffd5f00 ffffffffffffffff
>>>>>>>> 0000000000020000
>>>>>>>> GPR16: 0000000000000000 0000000000000002 c00000015f653000
>>>>>>>> 0000000000000000
>>>>>>>> GPR20: c000000138662400 c00000013d814540 0000000000000000
>>>>>>>> c00000013d814500
>>>>>>>> GPR24: 0000000000000000 0000000000000002 c0000001e0957888
>>>>>>>> c0000001e0957878
>>>>>>>> GPR28: c00000013d814548 0000000000000000 c00000013d814540
>>>>>>>> c0000001e0957888
>>>>>>>> NIP [c0000000002c8a64] __mutex_add_waiter+0x24/0xc0
>>>>>>>> LR [c00000000125dbc8] __mutex_lock.constprop.0+0x318/0xd00
>>>>>>>> Call Trace:
>>>>>>>> 0xc0000001e0957890 (unreliable)
>>>>>>>> __mutex_lock.constprop.0+0x58/0xd00
>>>>>>>> amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x6fc/0xb60 [amdgpu]
>>>>>>>> kfd_process_alloc_gpuvm+0x54/0x1f0 [amdgpu]
>>>>>>>> kfd_process_device_init_cwsr_dgpu+0xa4/0x1a0 [amdgpu]
>>>>>>>> kfd_process_device_init_vm+0xd8/0x2e0 [amdgpu]
>>>>>>>> kfd_ioctl_acquire_vm+0xd0/0x130 [amdgpu]
>>>>>>>> kfd_ioctl+0x514/0x670 [amdgpu]
>>>>>>>> sys_ioctl+0x134/0x180
>>>>>>>> system_call_exception+0x114/0x300
>>>>>>>> system_call_vectored_common+0x15c/0x2ec
>>>>>>>>
>>>>>>>> This patch changes AMDGPU_VA_RESERVED_TRAP_SIZE to 2 * PAGE_SIZE,
>>>>>>>> ensuring that the reserved trap area matches the allocation size
>>>>>>>> across all page sizes.
>>>>>>>>
>>>>>>>> cc: [email protected]
>>>>>>>> Fixes: 34a1de0f7935 ("drm/amdkfd: Relocate TBA/TMA to opposite side of 
>>>>>>>> VM hole")
>>>>>>>> Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
>>>>>>>> Signed-off-by: Donet Tom <[email protected]>
>>>>>>>> ---
>>>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 2 +-
>>>>>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h 
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>>>>> index 139642eacdd0..a5eae49f9471 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>>>>> @@ -173,7 +173,7 @@ struct amdgpu_bo_vm;
>>>>>>>>    #define AMDGPU_VA_RESERVED_SEQ64_SIZE        (2ULL << 20)
>>>>>>>>    #define AMDGPU_VA_RESERVED_SEQ64_START(adev) 
>>>>>>>> (AMDGPU_VA_RESERVED_CSA_START(adev) \
>>>>>>>>                             - AMDGPU_VA_RESERVED_SEQ64_SIZE)
>>>>>>>> -#define AMDGPU_VA_RESERVED_TRAP_SIZE        (2ULL << 12)
>>>>>>>> +#define AMDGPU_VA_RESERVED_TRAP_SIZE        (2ULL << PAGE_SHIFT)
>>>>>>> Well using PAGE_SHIFT in amdgpu_vm.h looks quite broken to me.
>>>>>>>
>>>>>>> That makes the GPU VA reservation depend on the CPU page size and that 
>>>>>>> is clearly not something we want to have.
>>>>>>>
>>>>>>> Where is KFD_CWSR_TBA_TMA_SIZE defined?
>>>>>>>
>>>>>> Thanks Christian for reviewing this patch.
>>>>>>
>>>>>> It is defined in kfd_priv.h.
>>>>>>
>>>>>> /*
>>>>>>    * Size of the per-process TBA+TMA buffer: 2 pages
>>>>>>    *
>>>>>>    * The first chunk is the TBA used for the CWSR ISA code. The second
>>>>>>    * chunk is used as TMA for user-mode trap handler setup in 
>>>>>> daisy-chain mode.
>>>>>>    */
>>>>>> #define KFD_CWSR_TBA_TMA_SIZE (PAGE_SIZE * 2)
>>>>>>
>>>>>>
>>>>>>
>>>>>> Could you please suggest the correct way to fix this issue?
>>>>> I'm only looking from the POV of the VM code on this, but my educated 
>>>>> guess is that KFD_CWSR_TBA_TMA_SIZE should be 8k independent of the CPU 
>>>>> page size.
>>>>>
>>>>> Background is that this is written by the shader trap handler and that 
>>>>> byte code doesn't care what CPU architecture you have.
>>>>>
>>>>> But I think only the engineers working on that trap handler can really 
>>>>> answer this. @Felix / @Philip?
>>>>
>>>> Hi @christian @Felix @Philip
>>>>
>>>> To remove the dependency on CPU page size, can we use
>>>>
>>>> +#define AMDGPU_VA_RESERVED_TRAP_SIZE    (2ULL << 16)
>>>>
>>>> During reservation, we reserve 128 bytes, but during
>>>> allocation, we use 2 * PAGE_SIZE.
>>> We only need two GPU pages here. I think what Christian is objecting to is, 
>>> that the GPU VM layout should not depend on the CPU page size.
>> Yes, exactly that was my concern.
>>
>>> @Christian, it sounds like the BO allocations happen with 64KB granularity, 
>>> but the mapping is still using 4KB granularity. Is the right solution to 
>>> GPU-map only the first 8KB of the trap handler BO to keep the layout the 
>>> same across CPU architectures?
>> Well that would work technically, but I agree that it also sounds a bit 
>> questionable as well.
>>
>>> I guess then the "correct" solution would be to change 
>>> amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu and 
>>> amdgpu_amdkfd_gpuvm_map_memory_to_gpu to support mapping of the requested 
>>> size with GPU page size granularity regardless of the CPU page size. But 
>>> that would increase complexity for a very niche uses case.
>>>
>>> An easier solution would be to PAGE_ALIGN 8KB to the system page size. But 
>>> that results in the virtual address space layout to depend on the system 
>>> page size.
>> Yeah, that dependency is certainly undesirable. We could easily end up with 
>> issues which can only be reproduced on systems with 64k page size.
>>
>>> If that's objectionable, then the next best solution is to round up the 
>>> trap handler size to 64KB byte unconditionally, so its the same with 4KB or 
>>> 64KB system page size. But that would mean unnecessarily wasting a little 
>>> memory per process/GPU on x86.
>> How about we always reserve 64KiB address space (or maybe even more, if you 
>> reserve 2MiB or 64KiB doesn't matter), but only map as large as the 
>> allocated buffer actually is?
>>
>> I think that this would be my preferred solution.
> 
> 
> Hi @Christian @Felix
> 
> Thanks for the review.
> 
> I have made the suggested change. I am now reserving 64 KB
> in the  address space for the trap, while allocating
> only 8 KB for both 4K and 64K page sizes. With this change,
> I am no longer seeing crashes on either 4K or 64K systems.
> 
> Does this approach look reasonable to you?

Looks correct to me, but Felix clearly has the last word on that.

Regards,
Christian.

> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> index bb276c0ad06d..d5b7061556ba 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> @@ -173,7 +173,7 @@ struct amdgpu_bo_vm;
>  #define AMDGPU_VA_RESERVED_SEQ64_SIZE          (2ULL << 20)
>  #define AMDGPU_VA_RESERVED_SEQ64_START(adev)  
> (AMDGPU_VA_RESERVED_CSA_START(adev) \
>                                                  - 
> AMDGPU_VA_RESERVED_SEQ64_SIZE)
> -#define AMDGPU_VA_RESERVED_TRAP_SIZE           (2ULL << 12)
> +#define AMDGPU_VA_RESERVED_TRAP_SIZE           (1ULL << 16)
>  #define AMDGPU_VA_RESERVED_TRAP_START(adev) 
> (AMDGPU_VA_RESERVED_SEQ64_START(adev) \
>                                                  - 
> AMDGPU_VA_RESERVED_TRAP_SIZE)
>  #define AMDGPU_VA_RESERVED_BOTTOM              (1ULL << 16)
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
> b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> index e5b56412931b..035687a17d89 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> @@ -102,8 +102,8 @@
>   * The first chunk is the TBA used for the CWSR ISA code. The second
>   * chunk is used as TMA for user-mode trap handler setup in daisy-chain mode.
>   */
> -#define KFD_CWSR_TBA_TMA_SIZE (PAGE_SIZE * 2)
> -#define KFD_CWSR_TMA_OFFSET (PAGE_SIZE + 2048)
> +#define KFD_CWSR_TBA_TMA_SIZE (AMDGPU_GPU_PAGE_SIZE * 2)
> +#define KFD_CWSR_TMA_OFFSET (AMDGPU_GPU_PAGE_SIZE + 2048)
> 
>  #define KFD_MAX_NUM_OF_QUEUES_PER_DEVICE               \
>         (KFD_MAX_NUM_OF_PROCESSES *
> 
> 
> 
>>
>> Regards,
>> Christian.
>>
>>> Regards,
>>>    Felix
>>>
>>>
>>>>
>>>> -Donet
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> -Donet
>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>>    #define AMDGPU_VA_RESERVED_TRAP_START(adev) 
>>>>>>>> (AMDGPU_VA_RESERVED_SEQ64_START(adev) \
>>>>>>>>                             - AMDGPU_VA_RESERVED_TRAP_SIZE)
>>>>>>>>    #define AMDGPU_VA_RESERVED_BOTTOM        (1ULL << 16)

Reply via email to