On 3/25/26 14:18, SHANMUGAM, SRINIVASAN wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>> -----Original Message-----
>> From: Koenig, Christian <[email protected]>
>> Sent: Wednesday, March 25, 2026 5:39 PM
>> To: SHANMUGAM, SRINIVASAN <[email protected]>;
>> Deucher, Alexander <[email protected]>
>> Cc: [email protected]
>> Subject: Re: [PATCH] drm/amdgpu: Fix PRT VA handling and guard BO access in
>> VA update path
>>
>> On 3/25/26 12:58, Srinivasan Shanmugam wrote:
>>> PRT (Page Request Table) mappings are not backed by a real buffer. In
>>
>> PRT (Partial Resident Texture).
>>
>>> this case, bo_va is valid, but bo_va->bo is NULL, meaning the mapping
>>> exists but does not point to any real buffer object.
>>>
>>> amdgpu_gem_va_ioctl() currently mixes CLEAR and PRT handling, which
>>> can result in incorrect bo_va selection. CLEAR should use bo_va =
>>> NULL, while PRT should use the special fpriv->prt_va mapping.
>>>
>>> Fix this by clearly selecting bo_va:
>>> - use fpriv->prt_va for PRT
>>> - use NULL only for CLEAR
>>> - use amdgpu_vm_bo_find() for normal BO mappings
>>>
>>> Also, amdgpu_gem_va_update_vm() accesses bo_va->base.bo without
>>> checking if it is NULL. This is not valid for PRT mappings.
>>>
>>> This keeps CLEAR, PRT, and normal cases separate and avoids invalid
>>> memory access.
>>>
>>> Cc: Alex Deucher <[email protected]>
>>> Suggested-by: Christian König <[email protected]>
>>> Signed-off-by: Srinivasan Shanmugam <[email protected]>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 18 ++++++++++++++----
>>> 1 file changed, 14 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> index b0ba2bdaf43a..289d6b58b579 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> @@ -772,8 +772,10 @@ amdgpu_gem_va_update_vm(struct amdgpu_device
>> *adev,
>>> if (r)
>>> goto error;
>>>
>>> + /* Only do BO-specific handling if this VA is backed by a real BO */
>>> if ((operation == AMDGPU_VA_OP_MAP ||
>>> operation == AMDGPU_VA_OP_REPLACE) &&
>>> + bo_va->base.bo &&
>>
>> That is not correct. This branch here should also be taken for PRT mappings.
>>
>>> !amdgpu_vm_is_bo_always_valid(vm, bo_va->base.bo)) {
>>>
>>> /*
>>> @@ -909,15 +911,23 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev,
>> void *data,
>>> goto error;
>>> }
>>>
>>> - /* Resolve the BO-VA mapping for this VM/BO combination. */
>>> - if (abo) {
>>> + /* Resolve the BO-VA mapping for this VM/BO combination.
>>> + *
>>> + * Depending on the case decide bo_va:
>>> + * - PRT: use special per-file prt_va (bo_va valid, but bo_va->bo ==
>>> NULL)
>>> + * - CLEAR: no BO involved → bo_va = NULL
>>> + * - Normal BO path: lookup mapping from VM
>>> + */
>>> + if (args->flags & AMDGPU_VM_PAGE_PRT) {
>>> + bo_va = fpriv->prt_va;
>>> + } else if (args->operation == AMDGPU_VA_OP_CLEAR) {
>>> + bo_va = NULL;
>>> + } else if (abo) {
>>> bo_va = amdgpu_vm_bo_find(&fpriv->vm, abo);
>>> if (!bo_va) {
>>> r = -ENOENT;
>>> goto error;
>>> }
>>> - } else if (args->operation != AMDGPU_VA_OP_CLEAR) {
>>> - bo_va = fpriv->prt_va;
>>
>> That code already looks correct to me. I don't think we need to change
>> anything
>> here.
>>
>> Where is your crash actually coming from?
>
> Hi Christian,
>
> The issue was observed in CI during IGT (amd_bo) runs, but I have not
> yet been able to reproduce it locally. Will continue investigating to
> identify the exact failing path.
That is most likely something completely different. As far as I can see the
bo_va handling is correct.
>
> Below is the crash signature for reference:
>
> BUG: KASAN: null-ptr-deref in amdgpu_gem_va_ioctl+0x380/0x1130 [amdgpu]
> Write of size 4 at addr 0000000000000000 by task amd_bo
That sounds a bit like the fallout from Pikes patch:
drm/amdgpu: fix syncobj leak for amdgpu_gem_va_ioctl()
It requires freeing the syncobj and chain
alloction resource.
Not sure what exactly goes wrong here.
Regards,
Christian.
>
> RIP: amdgpu_gem_va_ioctl+0x385/0x1130 [amdgpu]
> CR2: 0000000000000000
>
> I also tried to map the crash offset using gdb/objdump, but the results
> were not conclusive. The reported amdgpu_gem_va_ioctl+0x380 offset did
> not map cleanly to a single obvious source line
>
> So at this point I can localize the crash to amdgpu_gem_va_ioctl(), but
> still need to identify the exact failing pointer/path.
>
>
> [ 325.779102]
> ==================================================================
> [ 325.786483] BUG: KASAN: null-ptr-deref in amdgpu_gem_va_ioctl+0x380/0x1130
> [amdgpu]
> [ 325.795105] Write of size 4 at addr 0000000000000000 by task amd_bo/7893
> [ 325.801997]
> [ 325.803595] CPU: 12 UID: 0 PID: 7893 Comm: amd_bo Not tainted
> 6.19.0-1314135.2.zuul.928a0cbbebc74c4f8d5a99a4d0a7ca55 #1 PREEMPT(voluntary)
> [ 325.803602] Hardware name: TYAN B8021G88V2HR-2T/S8021GM2NR-2T, BIOS
> V1.03.B10 04/01/2019
> [ 325.803606] Call Trace:
> [ 325.803609] <TASK>
> [ 325.803612] dump_stack_lvl+0x64/0x80
> [ 325.803623] kasan_report+0xb8/0xf0
> [ 325.803631] ? amdgpu_gem_va_ioctl+0x380/0x1130 [amdgpu]
> [ 325.804427] kasan_check_range+0x105/0x1b0
> [ 325.804432] amdgpu_gem_va_ioctl+0x380/0x1130 [amdgpu]
> [ 325.805229] ? __pfx_amdgpu_gem_create_ioctl+0x10/0x10 [amdgpu]
> [ 325.806022] ? __pfx_amdgpu_gem_va_ioctl+0x10/0x10 [amdgpu]
> [ 325.806815] ? __pfx___drm_dev_dbg+0x10/0x10 [drm]
> [ 325.806894] ? __pfx_amdgpu_gem_va_ioctl+0x10/0x10 [amdgpu]
> [ 325.807686] drm_ioctl_kernel+0x13d/0x2b0 [drm]
> [ 325.807767] ? __pfx_file_has_perm+0x10/0x10
> [ 325.807777] ? __pfx_drm_ioctl_kernel+0x10/0x10 [drm]
> [ 325.807857] drm_ioctl+0x4be/0xae0 [drm]
> [ 325.807936] ? __pfx_amdgpu_gem_va_ioctl+0x10/0x10 [amdgpu]
> [ 325.808728] ? __pfx_sock_write_iter+0x10/0x10
> [ 325.808737] ? __pfx_drm_ioctl+0x10/0x10 [drm]
> [ 325.808816] ? ioctl_has_perm.constprop.0.isra.0+0x2ad/0x490
> [ 325.808823] ? __pfx_ioctl_has_perm.constprop.0.isra.0+0x10/0x10
> [ 325.808827] ? _raw_spin_lock_irqsave+0x86/0xd0
> [ 325.808835] ? __pfx__raw_spin_lock_irqsave+0x10/0x10
> [ 325.808841] amdgpu_drm_ioctl+0xce/0x180 [amdgpu]
> [ 325.809622] __x64_sys_ioctl+0x139/0x1c0
> [ 325.809630] do_syscall_64+0x64/0x880
> [ 325.809638] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 325.809645] RIP: 0033:0x7f205fd12e1d
> [ 325.809650] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0
> 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2
> 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
> [ 325.809654] RSP: 002b:00007ffe9032b510 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [ 325.809660] RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
> 00007f205fd12e1d
> [ 325.809663] RDX: 00007ffe9032b5b0 RSI: 00000000c0406448 RDI:
> 0000000000000006
> [ 325.809665] RBP: 00007ffe9032b560 R08: 0000000100000000 R09:
> 000000000000000e
> [ 325.809668] R10: 0000000000000000 R11: 0000000000000246 R12:
> 00000000c0406448
> [ 325.809670] R13: 0000000000000006 R14: 0000000000001000 R15:
> 0000000000000001
> [ 325.809675] </TASK>
> [ 325.809678]
> ==================================================================
> [ 326.029964] Disabling lock debugging due to kernel taint
> [ 326.035486] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [ 326.042557] #PF: supervisor write access in kernel mode
> [ 326.047887] #PF: error_code(0x0002) - not-present page
> [ 326.053132] PGD 0 P4D 0
> [ 326.055766] Oops: Oops: 0002 [#1] SMP KASAN NOPTI
> [ 326.060577] CPU: 12 UID: 0 PID: 7893 Comm: amd_bo Tainted: G B
> 6.19.0-1314135.2.zuul.928a0cbbebc74c4f8d5a99a4d0a7ca55 #1
> PREEMPT(voluntary)
> [ 326.074815] Tainted: [B]=BAD_PAGE
> [ 326.078233] Hardware name: TYAN B�8021G88V2HR-2T/7] RIP:
> 0010:amdgpu_gem_va_ioctl+0x385/0x1130 [amdgpu]
> [ 326.093279] Code: 00 00 75 aa 85 c0 74 a6 41 89 c7 31 ed 45 31 f6 48 89 ef
> e8 dd bf 09 ce be 04 00 00 00 4c 89 f7 e8 90 0e 13 ce b8 ff ff ff ff <f0> 41
> 0f c1 06 83 f8 01 0f 84 3c 05 00 00 85 c0 0f 8e 75 05 00 00
> [ 326.112237] RSP: 0018:ffff88a0d02d7b60 EFLAGS: 00010246
> [ 326.117568] RAX: 00000000ffffffff RBX: ffff88907f0c2848 RCX:
> ffffffff8f43434a
> [ 326.124813] RDX: fffffbfff2a16c0d RSI: 0000000000000008 RDI:
> ffffffff950b6060
> [ 326.132056] RBP: 0000000000000000 R08: 0000000000000001 R09:
> fffffbfff2a16c0c
> [ 326.139303] R10: ffffffff950b6067 R11: 0000000000000001 R12:
> ffff88b1349d7778
> [ 326.146548] R13: ffff88a0d02d7c00 R14: 0000000000000000 R15:
> 0000000000000000
> [ 326.153794] FS: 00007f205dbad940(0000) GS:ffff88c00aa09000(0000)
> knlGS:0000000000000000
> [ 326.162023] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 326.167872] CR2: 0000000000000000 CR3: 000000207940e000 CR4:
> 00000000003506f0
> [ 326.175113] Call Trace:
> [ 326.177661] <TASK>
> [ 326.179861] ? __pfx_amdgpu_gem_create_ioctl+0x10/0x10 [amdgpu]
> [ 326.186637] ? __pfx_amdgpu_gem_va_ioctl+0x10/0x10 [amdgpu]
> [ 326.193168] ? __pfx___drm_dev_dbg+0x10/0x10 [drm]
> [ 326.198141] ? __pfx_amdgpu_gem_va_ioctl+0x10/0x10 [amdgpu]
> [ 326.204608] drm_ioctl_kernel+0x13d/0x2b0 [drm]
> [ 326.209319] ? __pfx_file_has_perm+0x10/0x10
> [ 326.213696] ? __pfx_drm_ioctl_kernel+0x10/0x10 [drm]
> [ 326.218934] drm_ioctl+0x4be/0xae0 [drm]
> [ 326.223109] ? __pfx_amdgpu_gem_va_ioctl+0x10/0x10 [amdgpu]
> [ 326.229576] ? __pfx_sock_write_iter+0x10/0x10
> [ 326.234130] ? __pfx_drm_ioctl+0x10/0x10 [drm]
> [ 326.238752] ? ioctl_has_perm.constprop.0.isra.0+0x2ad/0x490
> [ 326.244518] ? __pfx_ioctl_has_perm.constprop.0.isra.0+0x10/0x10
> [ 326.250630] ? _raw_spin_lock_irqsave+0x86/0xd0
> [ 326.255268] ? __pfx__raw_spin_lock_irqsave+0x10/0x10
> [ 326.260429] amdgpu_drm_ioctl+0xce/0x180 [amdgpu]
> [ 326.266018] __x64_sys_ioctl+0x139/0x1c0
> [ 326.270056] do_syscall_64+0x64/0x880
> [ 326.273827] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 326.278983] RIP: 0033:0x7f205fd12e1d
> [ 326.282660] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0
> 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2
> 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
> [ 326.301609] RSP: 002b:00007ffe9032b510 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [ 326.309316] RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
> 00007f205fd12e1d
> [ 326.316560] RDX: 00007ffe9032b5b0 RSI: 00000000c0406448 RDI:
> 0000000000000006
> [ 326.323855] RBP: 00007ffe9032b560 R08: 0000000100000000 R09:
> 000000000000000e
> [ 326.331103] R10: 0000000000000000 R11: 0000000000000246 R12:
> 00000000c0406448
> [ 326.338347] R13: 0000000000000006 R14: 0000000000001000 R15:
> 0000000000000001
> [ 326.345595] </TASK>
>
> Thanks!
> Srini
>
>>
>> Regards,
>> Christian.
>>
>>> } else {
>>> bo_va = NULL;
>>> }
>