Hi Janusz, On Fri, May 08, 2026 at 02:23:51PM +0200, Janusz Krzysztofik wrote: > TLDR: The bo->ttm object might be changed by calling ttm_bo_validate(), > move casting it to an i915_tt object later to actually get the right > pointer. > > A user reported hitting the following bug under heavy use on DG2: > > [26620.095550] Oops: general protection fault, probably for non-canonical > address 0xa56b6b6b6b6b6b8b: 0000 1 SMP NOPTI > [26620.095556] CPU: 2 UID: 0 PID: 631 Comm: Xorg Not tainted 6.18.8 #1 > PREEMPT(lazy) > [26620.095558] Hardware name: ASRock B850M Steel Legend WiFi/B850M Steel > Legend WiFi, BIOS 3.50 09/18/2025 > [26620.095559] RIP: 0010:i915_ttm_purge+0x84/0x100 [i915] > [26620.095604] Code: 00 00 00 48 8d 54 24 10 48 89 e6 48 89 fb e8 83 aa ae ff > 85 c0 75 6f 48 83 bb a8 01 00 00 00 74 2c 48 8b 45 78 48 85 c0 74 23 <48> 8b > 78 20 48 c7 c2 ff ff ff ff 31 f6 e8 7a 73 e3 e0 48 8b 7d 78 > [26620.095605] RSP: 0018:ffffc90005fd7430 EFLAGS: 00010282 > [26620.095607] RAX: a56b6b6b6b6b6b6b RBX: ffff8881f46c3dc0 RCX: > 0000000000000000 > [26620.095608] RDX: 0000000000000000 RSI: 0000000000000246 RDI: > 00000000ffffffff > [26620.095609] RBP: ffff888289610f00 R08: 0000000000000001 R09: > ffff88823b022000 > [26620.095609] R10: ffff888103029b28 R11: ffff8881fc7f3800 R12: > ffff88810b6150d0 > [26620.095609] R13: ffff888289610f00 R14: 0000000000000000 R15: > ffff8881f46c3dc0 > [26620.095610] FS: 00007f1004d86900(0000) GS:ffff88901c858000(0000) > knlGS:0000000000000000 > [26620.095611] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [26620.095611] CR2: 00007f0fdf489000 CR3: 000000035b0c1000 CR4: > 0000000000750ef0 > [26620.095612] PKRU: 55555554 > [26620.095612] Call Trace: > [26620.095615] <TASK> > [26620.095615] i915_ttm_move+0x2b9/0x420 [i915] > [26620.095642] ? ttm_tt_init+0x65/0x80 [ttm] > [26620.095644] ? i915_ttm_tt_create+0xc6/0x150 [i915] > [26620.095667] ttm_bo_handle_move_mem+0xb6/0x160 [ttm] > [26620.095669] ttm_bo_evict+0x100/0x150 [ttm] > [26620.095671] ? preempt_count_add+0x64/0xa0 > [26620.095673] ? _raw_spin_lock+0xe/0x30 > [26620.095675] ? _raw_spin_unlock+0xd/0x30 > [26620.095675] ? i915_gem_object_evictable+0xb7/0xd0 [i915] > [26620.095704] ttm_bo_evict_cb+0x6e/0xd0 [ttm] > [26620.095705] ttm_lru_walk_for_evict+0xa6/0x200 [ttm] > [26620.095708] ttm_bo_alloc_resource+0x185/0x4f0 [ttm] > [26620.095709] ? init_object+0x62/0xd0 > [26620.095712] ttm_bo_validate+0x7a/0x180 [ttm] > [26620.095713] ? _raw_spin_unlock_irqrestore+0x16/0x30 > [26620.095714] __i915_ttm_get_pages+0xb0/0x170 [i915] > [26620.095737] i915_ttm_get_pages+0x9f/0x150 [i915] > [26620.095759] ? i915_gem_do_execbuffer+0xedc/0x2b40 [i915] > [26620.095786] ? alloc_debug_processing+0xd0/0x100 > [26620.095787] ? _raw_spin_unlock_irqrestore+0x16/0x30 > [26620.095788] ? i915_vma_instance+0xa0/0x4e0 [i915] > [26620.095822] __i915_gem_object_get_pages+0x2f/0x40 [i915] > [26620.095848] i915_vma_pin_ww+0x706/0x980 [i915] > [26620.095875] ? i915_gem_do_execbuffer+0xedc/0x2b40 [i915] > [26620.095904] eb_validate_vmas+0x170/0xa00 [i915] > [26620.095930] i915_gem_do_execbuffer+0x1201/0x2b40 [i915] > [26620.095953] ? alloc_debug_processing+0xd0/0x100 > [26620.095954] ? _raw_spin_unlock_irqrestore+0x16/0x30 > [26620.095955] ? i915_gem_execbuffer2_ioctl+0xc9/0x240 [i915] > [26620.095977] ? __wake_up_sync_key+0x32/0x50 > [26620.095979] ? i915_gem_execbuffer2_ioctl+0xc9/0x240 [i915] > [26620.096001] ? __slab_alloc.isra.0+0x67/0xc0 > [26620.096003] i915_gem_execbuffer2_ioctl+0x11a/0x240 [i915] > > Results from decode_stacktrace.sh pointed to dereference of a file pointer > field of a i915 TTM page vector container associated with an object being > purged on eviction. That path is taken when the object is marked as no > longer needed. > > Code analysis revealed a possibility of the i915 TTM page vector container > being replaced with a new instance inside a function that purges content > of the object, should it be still busy. That function is called, > indirectly via a more general function that changes the object's placement > and caching policy, before the problematic dereference, but still after > a pointer to the container is captured, rendering the pointer no longer > valid. > > Fix the issue by capturing the pointer to the container only after its > potential replacement. > > v2: Move the container_of() inside the if block (Sebastian), > - a simplified version of the commit description that explains briefly > why the change is necessary (Christian). > > Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/work_items/14882 > Fixes: 7ae034590ceae ("drm/i915/ttm: add tt shmem backend") > Signed-off-by: Janusz Krzysztofik <[email protected]> > Cc: [email protected] # v5.17+ > Cc: Matthew Auld <[email protected]> > Cc: "Thomas Hellström" <[email protected]> > Cc: Sebastian Brzezinka <[email protected]> > Cc: "Christian König" <[email protected]>
Reviewed-by: Andi Shyti <[email protected]> Thanks, Andi
