Hi Mario, On 3/23/26 13:56, Mario Limonciello wrote: > > > On 3/23/2026 4:13 AM, Christian König wrote: >> Hi Mario, >> >> first of all please loop me in on TTM changes as maintainer explicitely. I >> don't see everything which flys by on dri-devel. > > Sure. I was initially just looking for anyone comments on it, didn't think > it was worth bubbling to top of your mailbox for an RFC.
I usually completely miss such stuff otherwise. I'm not very proud of it, but I have a backlog of multiple thousands of mailing list mails I couldn't look into. >> >> Then changing the 50% limit is an absolutely NO-GO. It's completely >> irrelevant that AI wants to use more, HPC use cases complained about that >> for decades, but we simply can't do that reliable. > > What does HPC do now when they need more? Tell people to put page limit on > the kernel command line? Yes, either that or other similar workarounds. > This shouldn't be any different than status quo before - except that user >intent can persist. The key point is the system starts to become unstable when you go over 50%. We have tons of complains about that as well from HPC customers. The problem is that TTMs eviction code needs memory to swap GPU buffers out to disk, that's why we use the 50% limit here. Intel has been working on and provides an alternative shrinker callback (see drivers/gpu/drm/xe/xe_shrinker.c) to work around that and so lift the 50% limit. But so far that is only implemented for XE. If you want to fix this for amdgpu just take the xe_shrinker as an example and implement that same stuff for us as well. Regards, Christian. > >> >> Regards, >> Christian. >> >> On 3/20/26 15:34, Mario Limonciello wrote: >>> I think there is actually a very easy way to trigger it and it's not >>> obvious that a user messed it up. >>> >>> Assume you're on a 128GB system with VRAM set to 512MB. >>> 1) Set TTM page limit corresponding to 96GB >>> 2) Use uma_carveout sysfs or BIOS to set VRAM to 96GB >>> 3) Reboot system >>> 4) Now VRAM is 96GB, but the page limit was a module parameter and will be >>> wrong. >>> >>> I actually /think/ that the RFC [1] I proposed a few weeks ago could be a >>> good way to prevent this. By using EFI variable instead, TTM could sanity >>> check anything it reads at startup and save sane values to EFI for the next >>> reboot (if they're insane). >>> >>> https://lore.kernel.org/dri-devel/[email protected]/ >>> [1] >>> >>> On 3/20/2026 9:28 AM, Zhang, Yifan wrote: >>>> [AMD Official Use Only - AMD Internal Distribution Only] >>>> >>>> Yes, I agree. I’ve just been notified that this memory configuration is a >>>> mistake rather than a valid user case. So the fix is low priority for now. >>>> >>>> -----Original Message----- >>>> From: Limonciello, Mario <[email protected]> >>>> Sent: Friday, March 20, 2026 11:14 AM >>>> To: Zhang, Yifan <[email protected]>; [email protected] >>>> Cc: Deucher, Alexander <[email protected]>; Koenig, Christian >>>> <[email protected]>; Limonciello, Mario >>>> <[email protected]>; Yuan, Perry <[email protected]> >>>> Subject: Re: [PATCH v2] drm/amdkfd: check system memory when set >>>> apu_prefer_gtt >>>> >>>> >>>> >>>> On 3/19/2026 2:32 AM, Yifan Zhang wrote: >>>>> Current apu_prefer_gtt setting only check gtt_size, which could be set >>>>> by user to a larger than system memory value (via ttm modules >>>>> parameter pages_limit). E.g. carveout vram 32GB, gtt_size 50GB (via >>>>> ttm modules parameter pages_limit), system memory 31GB. In that case, >>>>> apu_prefer_gtt will be set incorrectly. Take system memory into >>>>> account when set apu_prefer_gtt. >>>>> >>>> >>>> Wouldn't it be cleaner to do this in TTM? IE test that a bad option was >>>> set by user pages_limit value and then show something like: >>>> >>>> if (user > possible) { >>>> pr_warn("Requested invalid %d pages, limiting to %d pages", user, >>>> possible); >>>> user = possible; >>>> } >>>> >>>> Then we can always trust what we get from TTM. >>>> >>>>> Signed-off-by: Yifan Zhang <[email protected]> >>>>> --- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 -- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 4 ++-- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 6 ++++-- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 7 ++++++- >>>>> 4 files changed, 12 insertions(+), 7 deletions(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c >>>>> index 3bfd79c89df3..a6ee9d9bfafb 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c >>>>> @@ -170,8 +170,6 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device >>>>> *adev) >>>>> int i; >>>>> int last_valid_bit; >>>>> >>>>> - amdgpu_amdkfd_gpuvm_init_mem_limits(); >>>>> - >>>>> if (adev->kfd.dev) { >>>>> struct kgd2kfd_shared_resources gpu_resources = { >>>>> .compute_vmid_bitmap = >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h >>>>> index cdbab7f8cee8..13cada7da4a9 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h >>>>> @@ -369,7 +369,7 @@ u64 amdgpu_amdkfd_xcp_memory_size(struct >>>>> amdgpu_device *adev, int xcp_id); >>>>> >>>>> >>>>> #if IS_ENABLED(CONFIG_HSA_AMD) >>>>> -void amdgpu_amdkfd_gpuvm_init_mem_limits(void); >>>>> +uint64_t amdgpu_amdkfd_gpuvm_init_mem_limits(void); >>>>> void amdgpu_amdkfd_gpuvm_destroy_cb(struct amdgpu_device *adev, >>>>> struct amdgpu_vm *vm); >>>>> >>>>> @@ -382,7 +382,7 @@ void amdgpu_amdkfd_release_notify(struct amdgpu_bo >>>>> *bo); >>>>> void amdgpu_amdkfd_reserve_system_mem(uint64_t size); >>>>> #else >>>>> static inline >>>>> -void amdgpu_amdkfd_gpuvm_init_mem_limits(void) >>>>> +uint64_t amdgpu_amdkfd_gpuvm_init_mem_limits(void) >>>>> { >>>>> } >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>>>> index 8a869fe41acd..4fba7d2f34a9 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>>>> @@ -109,13 +109,13 @@ static bool reuse_dmamap(struct amdgpu_device >>>>> *adev, struct amdgpu_device *bo_ad >>>>> * System (TTM + userptr) memory - 15/16th System RAM >>>>> * TTM memory - 3/8th System RAM >>>>> */ >>>>> -void amdgpu_amdkfd_gpuvm_init_mem_limits(void) >>>>> +uint64_t amdgpu_amdkfd_gpuvm_init_mem_limits(void) >>>>> { >>>>> struct sysinfo si; >>>>> uint64_t mem; >>>>> >>>>> if (kfd_mem_limit.max_system_mem_limit) >>>>> - return; >>>>> + return kfd_mem_limit.max_system_mem_limit; >>>>> >>>>> si_meminfo(&si); >>>>> mem = si.totalram - si.totalhigh; >>>>> @@ -132,6 +132,8 @@ void amdgpu_amdkfd_gpuvm_init_mem_limits(void) >>>>> pr_debug("Kernel memory limit %lluM, TTM limit %lluM\n", >>>>> (kfd_mem_limit.max_system_mem_limit >> 20), >>>>> (kfd_mem_limit.max_ttm_mem_limit >> 20)); >>>>> + >>>>> + return kfd_mem_limit.max_system_mem_limit; >>>>> } >>>>> >>>>> void amdgpu_amdkfd_reserve_system_mem(uint64_t size) diff --git >>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c >>>>> index 714fd8d12ca5..df98ece071e1 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c >>>>> @@ -2071,6 +2071,7 @@ static void amdgpu_ttm_buffer_entity_fini(struct >>>>> amdgpu_gtt_mgr *mgr, >>>>> int amdgpu_ttm_init(struct amdgpu_device *adev) >>>>> { >>>>> uint64_t gtt_size; >>>>> + uint64_t max_system_mem_limit; >>>>> int r; >>>>> >>>>> dma_set_max_seg_size(adev->dev, UINT_MAX); @@ -2210,8 +2211,12 @@ >>>>> int amdgpu_ttm_init(struct amdgpu_device *adev) >>>>> dev_info(adev->dev, " %uM of GTT memory ready.\n", >>>>> (unsigned int)(gtt_size / (1024 * 1024))); >>>>> >>>>> + >>>>> + max_system_mem_limit = amdgpu_amdkfd_gpuvm_init_mem_limits(); >>>>> + >>>>> if (adev->flags & AMD_IS_APU) { >>>>> - if (adev->gmc.real_vram_size < gtt_size) >>>>> + if (adev->gmc.real_vram_size < gtt_size && >>>>> + adev->gmc.real_vram_size < max_system_mem_limit) >>>>> adev->apu_prefer_gtt = true; >>>>> } >>>>> >>>> >>> >> >
