On 19.09.25 12:42, Tvrtko Ursulin wrote: > > On 19/09/2025 11:17, Christian König wrote: >> On 19.09.25 10:46, Tvrtko Ursulin wrote: >>> >>> On 19/09/2025 09:01, Christian König wrote: >>>> On 19.09.25 09:43, Tvrtko Ursulin wrote: >>>>> On 19/09/2025 07:46, Christian König wrote: >>>>>> On 18.09.25 22:09, Thadeu Lima de Souza Cascardo wrote: >>>>>>> On certain workloads, like on ChromeOS when opening multiple tabs and >>>>>>> windows, and switching desktops, memory pressure can build up and >>>>>>> latency >>>>>>> is observed as high order allocations result in memory reclaim. This was >>>>>>> observed when running on an amdgpu. >>>>>>> >>>>>>> This is caused by TTM pool allocations and turning off direct reclaim >>>>>>> when >>>>>>> doing those higher order allocations leads to lower memory pressure. >>>>>>> >>>>>>> Since turning direct reclaim off might also lead to lower throughput, >>>>>>> make it tunable, both as a module parameter that can be changed in sysfs >>>>>>> and as a flag when allocating a GEM object. >>>>>>> >>>>>>> A latency option will avoid direct reclaim for higher order allocations. >>>>>>> >>>>>>> The throughput option could be later used to more agressively compact >>>>>>> pages >>>>>>> or reclaim, by not using __GFP_NORETRY. >>>>>> >>>>>> Well I can only repeat it, at least for amdgpu that is a clear NAK from >>>>>> my side to this. >>>>>> >>>>>> The behavior to allocate huge pages is a must have for the driver. >>>>> >>>>> Disclaimer that I wouldn't go system-wide but per device - so somewhere >>>>> in sysfs rather than a modparam. That kind of a toggle would not sound >>>>> problematic to me since it leaves the policy outside the kernel and >>>>> allows people to tune to their liking. >>>> >>>> Yeah I've also wrote before when that is somehow beneficial for nouveau >>>> (for example) then I don't have any problem with making the policy device >>>> dependent. >>>> >>>> But for amdgpu we have so many so bad experiences with this approach that >>>> I absolutely can't accept that. >>>> >>>>> One side question thought - does AMD benefit from larger than 2MiB >>>>> contiguous blocks? IIUC the maximum PTE is 2MiB so maybe not? In which >>>>> case it may make sense to add some TTM API letting drivers tell the pool >>>>> allocator what is the maximum order to bother with. Larger than that may >>>>> have diminishing benefit for the disproportionate pressure on the memory >>>>> allocator and reclaim. >>>> >>>> Using 1GiB allocations would allow for the page tables to skip another >>>> layer on AMD GPUs, but the most benefit is between 4kiB and 2MiB since >>>> that can be handled more efficiently by the L1. Having 2MiB allocations >>>> then also has an additional benefit for L2. >>>> >>>> Apart from performance for AMD GPUs there are also some HW features which >>>> only work with huge pages, e.g. on some laptops you can get for example >>>> flickering on the display if the scanout buffer is back by to many small >>>> pages. >>>> >>>> NVidia used to work on 1GiB allocations which as far as I know was the >>>> kickoff for the whole ongoing switch to using folios instead of pages. And >>>> from reading public available documentation I have the impression that >>>> NVidia GPUs works more or less the same as AMD GPUs regarding the TLB. >>> >>> 1GiB is beyond the TTM pool allocator scope, right? >> >> Yes, on x86 64bit the pool allocator can allocate at maximum 2MiB by default >> IIRC. > > I think 10 is the max order so 4MiB. So it wouldn't be much relief to the > allocator but better than nothing(tm)?
Good point, that can certainly be. >>> From what you wrote it sounds like my idea would actually be okay. A very >>> gentle approach (minimal change in behaviour) to only disable direct >>> reclaim above the threshold set by the driver. >> >> Well the problem is that the threshold set by amdgpu would be 2MiB and by >> default there isn't anything above it on x86. So that would be a no-op. On >> ARM64 that idea could potentially help maybe. > > Some architectures appear to default to more than 10, and some offer Kconfig > to change the default. x86 also has an Kconfig option for that IIRC. So yeah, your idea is not that bad if somebody has adjusted that setting. > > I think this means in the patch I proposed I am missing a min(MAX_PAGE_ORDER, > max_beneficial_order) when setting the pool property. > >> I could look into the HW documentation again what we would need as minimum >> for functional correctness, but there are quite a number of use cases and >> lowering from 2MiB to something like 256KiB or 512KiB potentially won't >> really help and still cause a number of performance issues in the L2. > > It would be very good if you could check for requirements regarding > functional correctness. Could that also differ per generation/part, and if > so, maybe it should be made configurable in the ttm_pool API as well as an > order below it is better to fail instead of moving to a lower order? I do remember that it used to be 256KiB on some really old parts, but that is for >10 year old HW. For anything newer 2MiB has always been what we haven been testing with. Failing lower order allocation is also something we have tested, but that resulted in a lot of unhappy people as well. Regards, Christian. > > Regards, > > Tvrtko > >>> Along the lines of: >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c >>> index 428265046815..06b243f05edd 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c >>> @@ -1824,7 +1824,7 @@ static int amdgpu_ttm_pools_init(struct amdgpu_device >>> *adev) >>> for (i = 0; i < adev->gmc.num_mem_partitions; i++) { >>> ttm_pool_init(&adev->mman.ttm_pools[i], adev->dev, >>> adev->gmc.mem_partitions[i].numa.node, >>> - false, false); >>> + false, false, get_order(2 * SZ_1M)); >>> } >>> return 0; >>> } >>> @@ -1865,7 +1865,8 @@ int amdgpu_ttm_init(struct amdgpu_device *adev) >>> adev_to_drm(adev)->anon_inode->i_mapping, >>> adev_to_drm(adev)->vma_offset_manager, >>> adev->need_swiotlb, >>> - dma_addressing_limited(adev->dev)); >>> + dma_addressing_limited(adev->dev), >>> + get_order(2 * SZ_1M)); >>> if (r) { >>> dev_err(adev->dev, >>> "failed initializing buffer object driver(%d).\n", r); >>> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c >>> index baf27c70a419..5d54e8373230 100644 >>> --- a/drivers/gpu/drm/ttm/ttm_pool.c >>> +++ b/drivers/gpu/drm/ttm/ttm_pool.c >>> @@ -726,8 +726,12 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, >>> struct ttm_tt *tt, >>> >>> page_caching = tt->caching; >>> allow_pools = true; >>> - for (order = ttm_pool_alloc_find_order(MAX_PAGE_ORDER, alloc); >>> - alloc->remaining_pages; >>> + >>> + order = ttm_pool_alloc_find_order(MAX_PAGE_ORDER, alloc); >>> + if (order > pool->max_beneficial_order) >>> + gfp_flags &= ~__GFP_DIRECT_RECLAIM; >>> + >>> + for (; alloc->remaining_pages; >>> order = ttm_pool_alloc_find_order(order, alloc)) { >>> struct ttm_pool_type *pt; >>> >>> @@ -745,6 +749,8 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, >>> struct ttm_tt *tt, >>> if (!p) { >>> page_caching = ttm_cached; >>> allow_pools = false; >>> + if (order <= pool->max_beneficial_order) >>> + gfp_flags |= __GFP_DIRECT_RECLAIM; >>> p = ttm_pool_alloc_page(pool, gfp_flags, order); >>> } >>> /* If that fails, lower the order if possible and retry. */ >>> @@ -1064,7 +1070,8 @@ long ttm_pool_backup(struct ttm_pool *pool, struct >>> ttm_tt *tt, >>> * Initialize the pool and its pool types. >>> */ >>> void ttm_pool_init(struct ttm_pool *pool, struct device *dev, >>> - int nid, bool use_dma_alloc, bool use_dma32) >>> + int nid, bool use_dma_alloc, bool use_dma32, >>> + unsigned int max_beneficial_order) >>> { >>> unsigned int i, j; >>> >>> @@ -1074,6 +1081,7 @@ void ttm_pool_init(struct ttm_pool *pool, struct >>> device *dev, >>> pool->nid = nid; >>> pool->use_dma_alloc = use_dma_alloc; >>> pool->use_dma32 = use_dma32; >>> + pool->max_beneficial_order = max_beneficial_order; >>> >>> for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) { >>> for (j = 0; j < NR_PAGE_ORDERS; ++j) { >>> >>> >>> That should have the page allocator working less hard and lower the latency >>> with large buffers. >>> >>> Then a more aggressive change on top could be: >>> >>> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c >>> index 5d54e8373230..152164f79927 100644 >>> --- a/drivers/gpu/drm/ttm/ttm_pool.c >>> +++ b/drivers/gpu/drm/ttm/ttm_pool.c >>> @@ -726,12 +726,8 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, >>> struct ttm_tt *tt, >>> >>> page_caching = tt->caching; >>> allow_pools = true; >>> - >>> - order = ttm_pool_alloc_find_order(MAX_PAGE_ORDER, alloc); >>> - if (order > pool->max_beneficial_order) >>> - gfp_flags &= ~__GFP_DIRECT_RECLAIM; >>> - >>> - for (; alloc->remaining_pages; >>> + for (order = ttm_pool_alloc_find_order(pool->max_beneficial_order, >>> alloc); >>> + alloc->remaining_pages; >>> order = ttm_pool_alloc_find_order(order, alloc)) { >>> struct ttm_pool_type *pt; >>> >>> @@ -749,8 +745,6 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, >>> struct ttm_tt *tt, >>> if (!p) { >>> page_caching = ttm_cached; >>> allow_pools = false; >>> - if (order <= pool->max_beneficial_order) >>> - gfp_flags |= __GFP_DIRECT_RECLAIM; >>> p = ttm_pool_alloc_page(pool, gfp_flags, order); >>> } >>> /* If that fails, lower the order if possible and retry. */ >>> >>> Ie. don't even bother trying to allocate orders above what the driver says >>> is useful. Could be made a drivers choice as well. >>> >>> And all could be combined with some sort of a sysfs control, as Cascardo >>> was suggesting, to disable direct reclaim completely if someone wants that. >>> >>> Regards, >>> >>> Tvrtko >>> >>>> Another alternative would be that we add a WARN_ONCE() when we have to >>>> fallback to lower order pages, but that wouldn't help the end user either. >>>> It just makes it more obvious that you need more memory for a specific use >>>> case without triggering the OOM killer. >>>> >>>> Regards, >>>> Christian. >>>> >>>>> >>>>> Regards, >>>>> >>>>> Tvrtko >>>>> >>>>>> The alternative I can offer is to disable the fallback which in your >>>>>> case would trigger the OOM killer. >>>>>> >>>>>> Regards, >>>>>> Christian. >>>>>> >>>>>>> >>>>>>> Other drivers can later opt to use this mechanism too. >>>>>>> >>>>>>> Signed-off-by: Thadeu Lima de Souza Cascardo <casca...@igalia.com> >>>>>>> --- >>>>>>> Changes in v2: >>>>>>> - Make disabling direct reclaim an option. >>>>>>> - Link to v1: >>>>>>> https://lore.kernel.org/r/20250910-ttm_pool_no_direct_reclaim-v1-1-53b0fa7f8...@igalia.com >>>>>>> >>>>>>> --- >>>>>>> Thadeu Lima de Souza Cascardo (3): >>>>>>> ttm: pool: allow requests to prefer latency over throughput >>>>>>> ttm: pool: add a module parameter to set latency preference >>>>>>> drm/amdgpu: allow allocation preferences when creating GEM >>>>>>> object >>>>>>> >>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 3 ++- >>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++- >>>>>>> drivers/gpu/drm/ttm/ttm_pool.c | 23 >>>>>>> +++++++++++++++++------ >>>>>>> drivers/gpu/drm/ttm/ttm_tt.c | 2 +- >>>>>>> include/drm/ttm/ttm_bo.h | 5 +++++ >>>>>>> include/drm/ttm/ttm_pool.h | 2 +- >>>>>>> include/drm/ttm/ttm_tt.h | 2 +- >>>>>>> include/uapi/drm/amdgpu_drm.h | 9 +++++++++ >>>>>>> 8 files changed, 38 insertions(+), 11 deletions(-) >>>>>>> --- >>>>>>> base-commit: f83ec76bf285bea5727f478a68b894f5543ca76e >>>>>>> change-id: 20250909-ttm_pool_no_direct_reclaim-ee0807a2d3fe >>>>>>> >>>>>>> Best regards, >>>>>> >>>>> >>>> >>> >> >