Thank you for your contribution! Sashiko AI review found 2 potential issue(s) 
to consider:
- [Critical] The newly registered dmem cgroup region is never assigned to 
`mgr->cg_region`, causing a memory leak and leaving dangling reclaim callbacks 
that lead to a Use-After-Free.
- [High] Calling `dmem_cgroup_unregister_region()` before 
`ttm_resource_manager_set_used(man, false)` creates a race window where 
concurrent allocations can dereference a freed cgroup region.
--

commit 02aedad5c82131466a2fb9b6bba851f053487de8
Author: Thomas Hellström <[email protected]>

drm/amdgpu: Wire up dmem cgroup reclaim for VRAM manager

Register the VRAM manager with the dmem cgroup reclaim infrastructure
so that lowering dmem.max below current VRAM usage triggers TTM
eviction rather than failing with -EBUSY.

> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> index 08f05c3aed1d5..ee98b963e84a0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> @@ -933,12 +938,15 @@ int amdgpu_vram_mgr_init(struct amdgpu_device *adev)
>       if (err)
>               return err;
>  
> -     man->cg = drmm_cgroup_register_region(adev_to_drm(adev), "vram",
> -                                           &(struct dmem_cgroup_init){
> -                                             .size = 
> adev->gmc.real_vram_size,
> -                                           });
> -     if (IS_ERR(man->cg))
> -             return PTR_ERR(man->cg);
> +     cg = dmem_cgroup_register_region(&(struct dmem_cgroup_init){
> +                                          .size = adev->gmc.real_vram_size,
> +                                          .ops = &amdgpu_vram_mgr_dmem_ops,
> +                                          .reclaim_priv = man,
> +                                      }, "vram");
> +     if (IS_ERR(cg))
> +             return PTR_ERR(cg);
> +
> +     ttm_resource_manager_set_dmem_region(man, cg);

[Severity: Critical]
Is cg missing an assignment to mgr->cg_region here?

Because mgr->cg_region is zero-initialized and never updated,
amdgpu_vram_mgr_fini() will later pass NULL to dmem_cgroup_unregister_region().
This causes the cgroup region to leak and remain on the global list.

If cgroup reclaim is triggered after the driver unbinds and the man pointer
is freed, will the callback execute and dereference reclaim_priv, resulting
in a use-after-free?

>       ttm_set_driver_manager(&adev->mman.bdev, TTM_PL_VRAM, &mgr->manager);
>       ttm_resource_manager_set_used(man, true);

[ ... ]

> @@ -960,6 +968,15 @@ void amdgpu_vram_mgr_fini(struct amdgpu_device *adev)
>       int ret;
>       struct amdgpu_vram_reservation *rsv, *temp;
>  
> +     /*
> +      * Drain any in-flight dmem cgroup reclaim callbacks and remove the
> +      * region from the global list before tearing down the manager.
> +      * This must happen first so no reclaim callback can access the
> +      * manager after this point.
> +      */
> +     dmem_cgroup_unregister_region(mgr->cg_region);
> +     mgr->cg_region = NULL;
> +
>       ttm_resource_manager_set_used(man, false);

[Severity: High]
Does unregistering the cgroup region before calling
ttm_resource_manager_set_used(man, false) create a race condition with
concurrent BO allocations?

Assuming mgr->cg_region was properly set, dmem_cgroup_unregister_region()
will remove the region from global lists and drop its references. Meanwhile,
the manager remains active (man->use_type == true).

Because amdgpu_vram_mgr_fini() runs during driver unbind where concurrent
ioctls may still be executing, could a concurrent BO allocation pass the check
in ttm_bo_mem_space() and call dmem_cgroup_try_charge(man->cg, ...)?

This would access the unregistered region and cause a use-after-free. Should
ttm_resource_manager_set_used(man, false) be executed before unregistering
the region?

-- 
Sashiko AI review · 
https://sashiko.dev/#/patchset/[email protected]?part=6

Reply via email to