On 5/9/26 12:20, Chengjun Yao wrote:
> commit 9c85025c7ac2 ("drm/amdgpu: nuke amdgpu_userq_fence_slab v2") removed
> the dedicated slab for userq fences along with the rcu_barrier() call that
> was in amdgpu_userq_fence_slab_fini(). However, the amdgpu module still
> registers RCU callbacks via call_rcu() in amdgpu_userq_fence_release() and
> amdgpu_fence_release(). Without rcu_barrier(), pending RCU callbacks can
> reference freed module text after the module is unloaded, causing a page
> fault in rcu_do_batch():
>
> BUG: unable to handle page fault for address: ffffffffc115e910
> RIP: 0010:0xffffffffc115e910
> Call Trace:
> <IRQ>
> rcu_do_batch+0x1c4/0x7f0
> rcu_core+0x14d/0x330
> handle_softirqs+0xd0/0x2b0
>
> Add rcu_barrier() to amdgpu_exit() to ensure all pending RCU callbacks
> have completed before the module code pages are freed.
That is just papering over the fact that we don't support module unload with
our amd-staging-drm-next tree in the first place.
The fence code can crash even with that RCU barrier at the moment.
The patches to allow this are still not back merged from upstream since they
went into the Linux kernel through a different path.
Regards,
Christian.
>
> Fixes: 9c85025c7ac2 ("drm/amdgpu: nuke amdgpu_userq_fence_slab v2")
> Signed-off-by: Chengjun Yao <[email protected]>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 99688391e70b..e9681eea122c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -3193,6 +3193,7 @@ static void __exit amdgpu_exit(void)
> amdgpu_unregister_atpx_handler();
> amdgpu_acpi_release();
> amdgpu_sync_fini();
> + rcu_barrier();
> mmu_notifier_synchronize();
> amdgpu_xcp_drv_release();
> }