On 3/25/26 11:26, Donet Tom wrote: > > On 3/25/26 3:04 PM, Christian König wrote: >> On 3/25/26 03:26, Kuehling, Felix wrote: >>> On 2026-03-24 14:19, Donet Tom wrote: >>>> On 3/23/26 6:42 PM, Christian König wrote: >>>>> On 3/23/26 12:50, Donet Tom wrote: >>>>>> On 3/23/26 3:41 PM, Christian König wrote: >>>>>> >>>>>> Hi Christian >>>>>> >>>>>>> On 3/23/26 05:28, Donet Tom wrote: >>>>>>>> Currently, AMDGPU_VA_RESERVED_TRAP_SIZE is hardcoded to 8KB, while >>>>>>>> KFD_CWSR_TBA_TMA_SIZE is defined as 2 * PAGE_SIZE. On systems with >>>>>>>> 4K pages, both values match (8KB), so allocation and reserved space >>>>>>>> are consistent. >>>>>>>> >>>>>>>> However, on 64K page-size systems, KFD_CWSR_TBA_TMA_SIZE becomes 128KB, >>>>>>>> while the reserved trap area remains 8KB. This mismatch causes the >>>>>>>> kernel to crash when running rocminfo or rccl unit tests. >>>>>>>> >>>>>>>> Kernel attempted to read user page (2) - exploit attempt? (uid: 1001) >>>>>>>> BUG: Kernel NULL pointer dereference on read at 0x00000002 >>>>>>>> Faulting instruction address: 0xc0000000002c8a64 >>>>>>>> Oops: Kernel access of bad area, sig: 11 [#1] >>>>>>>> LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries >>>>>>>> CPU: 34 UID: 1001 PID: 9379 Comm: rocminfo Tainted: G E >>>>>>>> 6.19.0-rc4-amdgpu-00320-gf23176405700 #56 VOLUNTARY >>>>>>>> Tainted: [E]=UNSIGNED_MODULE >>>>>>>> Hardware name: IBM,9105-42A POWER10 (architected) 0x800200 0xf000006 >>>>>>>> of:IBM,FW1060.30 (ML1060_896) hv:phyp pSeries >>>>>>>> NIP: c0000000002c8a64 LR: c00000000125dbc8 CTR: c00000000125e730 >>>>>>>> REGS: c0000001e0957580 TRAP: 0300 Tainted: G E >>>>>>>> MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24008268 >>>>>>>> XER: 00000036 >>>>>>>> CFAR: c00000000125dbc4 DAR: 0000000000000002 DSISR: 40000000 >>>>>>>> IRQMASK: 1 >>>>>>>> GPR00: c00000000125d908 c0000001e0957820 c0000000016e8100 >>>>>>>> c00000013d814540 >>>>>>>> GPR04: 0000000000000002 c00000013d814550 0000000000000045 >>>>>>>> 0000000000000000 >>>>>>>> GPR08: c00000013444d000 c00000013d814538 c00000013d814538 >>>>>>>> 0000000084002268 >>>>>>>> GPR12: c00000000125e730 c000007e2ffd5f00 ffffffffffffffff >>>>>>>> 0000000000020000 >>>>>>>> GPR16: 0000000000000000 0000000000000002 c00000015f653000 >>>>>>>> 0000000000000000 >>>>>>>> GPR20: c000000138662400 c00000013d814540 0000000000000000 >>>>>>>> c00000013d814500 >>>>>>>> GPR24: 0000000000000000 0000000000000002 c0000001e0957888 >>>>>>>> c0000001e0957878 >>>>>>>> GPR28: c00000013d814548 0000000000000000 c00000013d814540 >>>>>>>> c0000001e0957888 >>>>>>>> NIP [c0000000002c8a64] __mutex_add_waiter+0x24/0xc0 >>>>>>>> LR [c00000000125dbc8] __mutex_lock.constprop.0+0x318/0xd00 >>>>>>>> Call Trace: >>>>>>>> 0xc0000001e0957890 (unreliable) >>>>>>>> __mutex_lock.constprop.0+0x58/0xd00 >>>>>>>> amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x6fc/0xb60 [amdgpu] >>>>>>>> kfd_process_alloc_gpuvm+0x54/0x1f0 [amdgpu] >>>>>>>> kfd_process_device_init_cwsr_dgpu+0xa4/0x1a0 [amdgpu] >>>>>>>> kfd_process_device_init_vm+0xd8/0x2e0 [amdgpu] >>>>>>>> kfd_ioctl_acquire_vm+0xd0/0x130 [amdgpu] >>>>>>>> kfd_ioctl+0x514/0x670 [amdgpu] >>>>>>>> sys_ioctl+0x134/0x180 >>>>>>>> system_call_exception+0x114/0x300 >>>>>>>> system_call_vectored_common+0x15c/0x2ec >>>>>>>> >>>>>>>> This patch changes AMDGPU_VA_RESERVED_TRAP_SIZE to 2 * PAGE_SIZE, >>>>>>>> ensuring that the reserved trap area matches the allocation size >>>>>>>> across all page sizes. >>>>>>>> >>>>>>>> cc: [email protected] >>>>>>>> Fixes: 34a1de0f7935 ("drm/amdkfd: Relocate TBA/TMA to opposite side of >>>>>>>> VM hole") >>>>>>>> Reviewed-by: Ritesh Harjani (IBM) <[email protected]> >>>>>>>> Signed-off-by: Donet Tom <[email protected]> >>>>>>>> --- >>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 2 +- >>>>>>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>>>>>> >>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h >>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h >>>>>>>> index 139642eacdd0..a5eae49f9471 100644 >>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h >>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h >>>>>>>> @@ -173,7 +173,7 @@ struct amdgpu_bo_vm; >>>>>>>> #define AMDGPU_VA_RESERVED_SEQ64_SIZE (2ULL << 20) >>>>>>>> #define AMDGPU_VA_RESERVED_SEQ64_START(adev) >>>>>>>> (AMDGPU_VA_RESERVED_CSA_START(adev) \ >>>>>>>> - AMDGPU_VA_RESERVED_SEQ64_SIZE) >>>>>>>> -#define AMDGPU_VA_RESERVED_TRAP_SIZE (2ULL << 12) >>>>>>>> +#define AMDGPU_VA_RESERVED_TRAP_SIZE (2ULL << PAGE_SHIFT) >>>>>>> Well using PAGE_SHIFT in amdgpu_vm.h looks quite broken to me. >>>>>>> >>>>>>> That makes the GPU VA reservation depend on the CPU page size and that >>>>>>> is clearly not something we want to have. >>>>>>> >>>>>>> Where is KFD_CWSR_TBA_TMA_SIZE defined? >>>>>>> >>>>>> Thanks Christian for reviewing this patch. >>>>>> >>>>>> It is defined in kfd_priv.h. >>>>>> >>>>>> /* >>>>>> * Size of the per-process TBA+TMA buffer: 2 pages >>>>>> * >>>>>> * The first chunk is the TBA used for the CWSR ISA code. The second >>>>>> * chunk is used as TMA for user-mode trap handler setup in >>>>>> daisy-chain mode. >>>>>> */ >>>>>> #define KFD_CWSR_TBA_TMA_SIZE (PAGE_SIZE * 2) >>>>>> >>>>>> >>>>>> >>>>>> Could you please suggest the correct way to fix this issue? >>>>> I'm only looking from the POV of the VM code on this, but my educated >>>>> guess is that KFD_CWSR_TBA_TMA_SIZE should be 8k independent of the CPU >>>>> page size. >>>>> >>>>> Background is that this is written by the shader trap handler and that >>>>> byte code doesn't care what CPU architecture you have. >>>>> >>>>> But I think only the engineers working on that trap handler can really >>>>> answer this. @Felix / @Philip? >>>> >>>> Hi @christian @Felix @Philip >>>> >>>> To remove the dependency on CPU page size, can we use >>>> >>>> +#define AMDGPU_VA_RESERVED_TRAP_SIZE (2ULL << 16) >>>> >>>> During reservation, we reserve 128 bytes, but during >>>> allocation, we use 2 * PAGE_SIZE. >>> We only need two GPU pages here. I think what Christian is objecting to is, >>> that the GPU VM layout should not depend on the CPU page size. >> Yes, exactly that was my concern. >> >>> @Christian, it sounds like the BO allocations happen with 64KB granularity, >>> but the mapping is still using 4KB granularity. Is the right solution to >>> GPU-map only the first 8KB of the trap handler BO to keep the layout the >>> same across CPU architectures? >> Well that would work technically, but I agree that it also sounds a bit >> questionable as well. >> >>> I guess then the "correct" solution would be to change >>> amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu and >>> amdgpu_amdkfd_gpuvm_map_memory_to_gpu to support mapping of the requested >>> size with GPU page size granularity regardless of the CPU page size. But >>> that would increase complexity for a very niche uses case. >>> >>> An easier solution would be to PAGE_ALIGN 8KB to the system page size. But >>> that results in the virtual address space layout to depend on the system >>> page size. >> Yeah, that dependency is certainly undesirable. We could easily end up with >> issues which can only be reproduced on systems with 64k page size. >> >>> If that's objectionable, then the next best solution is to round up the >>> trap handler size to 64KB byte unconditionally, so its the same with 4KB or >>> 64KB system page size. But that would mean unnecessarily wasting a little >>> memory per process/GPU on x86. >> How about we always reserve 64KiB address space (or maybe even more, if you >> reserve 2MiB or 64KiB doesn't matter), but only map as large as the >> allocated buffer actually is? >> >> I think that this would be my preferred solution. > > > Hi @Christian @Felix > > Thanks for the review. > > I have made the suggested change. I am now reserving 64 KB > in the address space for the trap, while allocating > only 8 KB for both 4K and 64K page sizes. With this change, > I am no longer seeing crashes on either 4K or 64K systems. > > Does this approach look reasonable to you?
Looks correct to me, but Felix clearly has the last word on that. Regards, Christian. > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h > index bb276c0ad06d..d5b7061556ba 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h > @@ -173,7 +173,7 @@ struct amdgpu_bo_vm; > #define AMDGPU_VA_RESERVED_SEQ64_SIZE (2ULL << 20) > #define AMDGPU_VA_RESERVED_SEQ64_START(adev) > (AMDGPU_VA_RESERVED_CSA_START(adev) \ > - > AMDGPU_VA_RESERVED_SEQ64_SIZE) > -#define AMDGPU_VA_RESERVED_TRAP_SIZE (2ULL << 12) > +#define AMDGPU_VA_RESERVED_TRAP_SIZE (1ULL << 16) > #define AMDGPU_VA_RESERVED_TRAP_START(adev) > (AMDGPU_VA_RESERVED_SEQ64_START(adev) \ > - > AMDGPU_VA_RESERVED_TRAP_SIZE) > #define AMDGPU_VA_RESERVED_BOTTOM (1ULL << 16) > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h > b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h > index e5b56412931b..035687a17d89 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h > @@ -102,8 +102,8 @@ > * The first chunk is the TBA used for the CWSR ISA code. The second > * chunk is used as TMA for user-mode trap handler setup in daisy-chain mode. > */ > -#define KFD_CWSR_TBA_TMA_SIZE (PAGE_SIZE * 2) > -#define KFD_CWSR_TMA_OFFSET (PAGE_SIZE + 2048) > +#define KFD_CWSR_TBA_TMA_SIZE (AMDGPU_GPU_PAGE_SIZE * 2) > +#define KFD_CWSR_TMA_OFFSET (AMDGPU_GPU_PAGE_SIZE + 2048) > > #define KFD_MAX_NUM_OF_QUEUES_PER_DEVICE \ > (KFD_MAX_NUM_OF_PROCESSES * > > > >> >> Regards, >> Christian. >> >>> Regards, >>> Felix >>> >>> >>>> >>>> -Donet >>>> >>>>> Regards, >>>>> Christian. >>>>> >>>>>> -Donet >>>>>> >>>>>>> Regards, >>>>>>> Christian. >>>>>>> >>>>>>>> #define AMDGPU_VA_RESERVED_TRAP_START(adev) >>>>>>>> (AMDGPU_VA_RESERVED_SEQ64_START(adev) \ >>>>>>>> - AMDGPU_VA_RESERVED_TRAP_SIZE) >>>>>>>> #define AMDGPU_VA_RESERVED_BOTTOM (1ULL << 16)
