On 3/23/2026 4:13 AM, Christian König wrote:
Hi Mario,

first of all please loop me in on TTM changes as maintainer explicitely. I 
don't see everything which flys by on dri-devel.

Sure. I was initially just looking for anyone comments on it, didn't think it was worth bubbling to top of your mailbox for an RFC.


Then changing the 50% limit is an absolutely NO-GO. It's completely irrelevant 
that AI wants to use more, HPC use cases complained about that for decades, but 
we simply can't do that reliable.

What does HPC do now when they need more? Tell people to put page limit on the kernel command line? This shouldn't be any different than status quo before - except that user intent can persist.


Regards,
Christian.

On 3/20/26 15:34, Mario Limonciello wrote:
I think there is actually a very easy way to trigger it and it's not obvious 
that a user messed it up.

Assume you're on a 128GB system with VRAM set to 512MB.
1) Set TTM page limit corresponding to 96GB
2) Use uma_carveout sysfs or BIOS to set VRAM to 96GB
3) Reboot system
4) Now VRAM is 96GB, but the page limit was a module parameter and will be 
wrong.

I actually /think/ that the RFC [1] I proposed a few weeks ago could be a good 
way to prevent this.  By using EFI variable instead, TTM could sanity check 
anything it reads at startup and save sane values to EFI for the next reboot 
(if they're insane).

https://lore.kernel.org/dri-devel/[email protected]/
 [1]

On 3/20/2026 9:28 AM, Zhang, Yifan wrote:
[AMD Official Use Only - AMD Internal Distribution Only]

Yes, I agree. I’ve just been notified that this memory configuration is a 
mistake rather than a valid user case. So the fix is low priority for now.

-----Original Message-----
From: Limonciello, Mario <[email protected]>
Sent: Friday, March 20, 2026 11:14 AM
To: Zhang, Yifan <[email protected]>; [email protected]
Cc: Deucher, Alexander <[email protected]>; Koenig, Christian 
<[email protected]>; Limonciello, Mario <[email protected]>; Yuan, Perry 
<[email protected]>
Subject: Re: [PATCH v2] drm/amdkfd: check system memory when set apu_prefer_gtt



On 3/19/2026 2:32 AM, Yifan Zhang wrote:
Current apu_prefer_gtt setting only check gtt_size, which could be set
by user to a larger than system memory value (via ttm modules
parameter pages_limit). E.g. carveout vram 32GB, gtt_size 50GB (via
ttm modules parameter pages_limit), system memory 31GB. In that case,
apu_prefer_gtt will be set incorrectly. Take system memory into
account when set apu_prefer_gtt.


Wouldn't it be cleaner to do this in TTM?  IE test that a bad option was set by 
user pages_limit value and then show something like:

if (user > possible) {
       pr_warn("Requested invalid %d pages, limiting to %d pages", user, 
possible);
       user = possible;
}

Then we can always trust what we get from TTM.

Signed-off-by: Yifan Zhang <[email protected]>
---
    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c       | 2 --
    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h       | 4 ++--
    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 6 ++++--
    drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c          | 7 ++++++-
    4 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 3bfd79c89df3..a6ee9d9bfafb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -170,8 +170,6 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
        int i;
        int last_valid_bit;

-     amdgpu_amdkfd_gpuvm_init_mem_limits();
-
        if (adev->kfd.dev) {
                struct kgd2kfd_shared_resources gpu_resources = {
                        .compute_vmid_bitmap =
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index cdbab7f8cee8..13cada7da4a9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -369,7 +369,7 @@ u64 amdgpu_amdkfd_xcp_memory_size(struct
amdgpu_device *adev, int xcp_id);


    #if IS_ENABLED(CONFIG_HSA_AMD)
-void amdgpu_amdkfd_gpuvm_init_mem_limits(void);
+uint64_t amdgpu_amdkfd_gpuvm_init_mem_limits(void);
    void amdgpu_amdkfd_gpuvm_destroy_cb(struct amdgpu_device *adev,
                                struct amdgpu_vm *vm);

@@ -382,7 +382,7 @@ void amdgpu_amdkfd_release_notify(struct amdgpu_bo *bo);
    void amdgpu_amdkfd_reserve_system_mem(uint64_t size);
    #else
    static inline
-void amdgpu_amdkfd_gpuvm_init_mem_limits(void)
+uint64_t amdgpu_amdkfd_gpuvm_init_mem_limits(void)
    {
    }

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 8a869fe41acd..4fba7d2f34a9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -109,13 +109,13 @@ static bool reuse_dmamap(struct amdgpu_device *adev, 
struct amdgpu_device *bo_ad
     *  System (TTM + userptr) memory - 15/16th System RAM
     *  TTM memory - 3/8th System RAM
     */
-void amdgpu_amdkfd_gpuvm_init_mem_limits(void)
+uint64_t amdgpu_amdkfd_gpuvm_init_mem_limits(void)
    {
        struct sysinfo si;
        uint64_t mem;

        if (kfd_mem_limit.max_system_mem_limit)
-             return;
+             return kfd_mem_limit.max_system_mem_limit;

        si_meminfo(&si);
        mem = si.totalram - si.totalhigh;
@@ -132,6 +132,8 @@ void amdgpu_amdkfd_gpuvm_init_mem_limits(void)
        pr_debug("Kernel memory limit %lluM, TTM limit %lluM\n",
                (kfd_mem_limit.max_system_mem_limit >> 20),
                (kfd_mem_limit.max_ttm_mem_limit >> 20));
+
+     return kfd_mem_limit.max_system_mem_limit;
    }

    void amdgpu_amdkfd_reserve_system_mem(uint64_t size) diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 714fd8d12ca5..df98ece071e1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -2071,6 +2071,7 @@ static void amdgpu_ttm_buffer_entity_fini(struct 
amdgpu_gtt_mgr *mgr,
    int amdgpu_ttm_init(struct amdgpu_device *adev)
    {
        uint64_t gtt_size;
+     uint64_t max_system_mem_limit;
        int r;

        dma_set_max_seg_size(adev->dev, UINT_MAX); @@ -2210,8 +2211,12 @@
int amdgpu_ttm_init(struct amdgpu_device *adev)
        dev_info(adev->dev, " %uM of GTT memory ready.\n",
                 (unsigned int)(gtt_size / (1024 * 1024)));

+
+     max_system_mem_limit = amdgpu_amdkfd_gpuvm_init_mem_limits();
+
        if (adev->flags & AMD_IS_APU) {
-             if (adev->gmc.real_vram_size < gtt_size)
+             if (adev->gmc.real_vram_size < gtt_size &&
+                     adev->gmc.real_vram_size < max_system_mem_limit)
                        adev->apu_prefer_gtt = true;
        }





Reply via email to