On 19.09.25 12:42, Tvrtko Ursulin wrote:
> 
> On 19/09/2025 11:17, Christian König wrote:
>> On 19.09.25 10:46, Tvrtko Ursulin wrote:
>>>
>>> On 19/09/2025 09:01, Christian König wrote:
>>>> On 19.09.25 09:43, Tvrtko Ursulin wrote:
>>>>> On 19/09/2025 07:46, Christian König wrote:
>>>>>> On 18.09.25 22:09, Thadeu Lima de Souza Cascardo wrote:
>>>>>>> On certain workloads, like on ChromeOS when opening multiple tabs and
>>>>>>> windows, and switching desktops, memory pressure can build up and 
>>>>>>> latency
>>>>>>> is observed as high order allocations result in memory reclaim. This was
>>>>>>> observed when running on an amdgpu.
>>>>>>>
>>>>>>> This is caused by TTM pool allocations and turning off direct reclaim 
>>>>>>> when
>>>>>>> doing those higher order allocations leads to lower memory pressure.
>>>>>>>
>>>>>>> Since turning direct reclaim off might also lead to lower throughput,
>>>>>>> make it tunable, both as a module parameter that can be changed in sysfs
>>>>>>> and as a flag when allocating a GEM object.
>>>>>>>
>>>>>>> A latency option will avoid direct reclaim for higher order allocations.
>>>>>>>
>>>>>>> The throughput option could be later used to more agressively compact 
>>>>>>> pages
>>>>>>> or reclaim, by not using __GFP_NORETRY.
>>>>>>
>>>>>> Well I can only repeat it, at least for amdgpu that is a clear NAK from 
>>>>>> my side to this.
>>>>>>
>>>>>> The behavior to allocate huge pages is a must have for the driver.
>>>>>
>>>>> Disclaimer that I wouldn't go system-wide but per device - so somewhere 
>>>>> in sysfs rather than a modparam. That kind of a toggle would not sound 
>>>>> problematic to me since it leaves the policy outside the kernel and 
>>>>> allows people to tune to their liking.
>>>>
>>>> Yeah I've also wrote before when that is somehow beneficial for nouveau 
>>>> (for example) then I don't have any problem with making the policy device 
>>>> dependent.
>>>>
>>>> But for amdgpu we have so many so bad experiences with this approach that 
>>>> I absolutely can't accept that.
>>>>
>>>>> One side question thought - does AMD benefit from larger than 2MiB 
>>>>> contiguous blocks? IIUC the maximum PTE is 2MiB so maybe not? In which 
>>>>> case it may make sense to add some TTM API letting drivers tell the pool 
>>>>> allocator what is the maximum order to bother with. Larger than that may 
>>>>> have diminishing benefit for the disproportionate pressure on the memory 
>>>>> allocator and reclaim.
>>>>
>>>> Using 1GiB allocations would allow for the page tables to skip another 
>>>> layer on AMD GPUs, but the most benefit is between 4kiB and 2MiB since 
>>>> that can be handled more efficiently by the L1. Having 2MiB allocations 
>>>> then also has an additional benefit for L2.
>>>>
>>>> Apart from performance for AMD GPUs there are also some HW features which 
>>>> only work with huge pages, e.g. on some laptops you can get for example 
>>>> flickering on the display if the scanout buffer is back by to many small 
>>>> pages.
>>>>
>>>> NVidia used to work on 1GiB allocations which as far as I know was the 
>>>> kickoff for the whole ongoing switch to using folios instead of pages. And 
>>>> from reading public available documentation I have the impression that 
>>>> NVidia GPUs works more or less the same as AMD GPUs regarding the TLB.
>>>
>>> 1GiB is beyond the TTM pool allocator scope, right?
>>
>> Yes, on x86 64bit the pool allocator can allocate at maximum 2MiB by default 
>> IIRC.
> 
> I think 10 is the max order so 4MiB. So it wouldn't be much relief to the 
> allocator but better than nothing(tm)?

Good point, that can certainly be.

>>>  From what you wrote it sounds like my idea would actually be okay. A very 
>>> gentle approach (minimal change in behaviour) to only disable direct 
>>> reclaim above the threshold set by the driver.
>>
>> Well the problem is that the threshold set by amdgpu would be 2MiB and by 
>> default there isn't anything above it on x86. So that would be a no-op. On 
>> ARM64 that idea could potentially help maybe.
> 
> Some architectures appear to default to more than 10, and some offer Kconfig 
> to change the default.

x86 also has an Kconfig option for that IIRC. So yeah, your idea is not that 
bad if somebody has adjusted that setting.

> 
> I think this means in the patch I proposed I am missing a min(MAX_PAGE_ORDER, 
> max_beneficial_order) when setting the pool property.
> 
>> I could look into the HW documentation again what we would need as minimum 
>> for functional correctness, but there are quite a number of use cases and 
>> lowering from 2MiB to something like 256KiB or 512KiB potentially won't 
>> really help and still cause a number of performance issues in the L2.
> 
> It would be very good if you could check for requirements regarding 
> functional correctness. Could that also differ per generation/part, and if 
> so, maybe it should be made configurable in the ttm_pool API as well as an 
> order below it is better to fail instead of moving to a lower order?

I do remember that it used to be 256KiB on some really old parts, but that is 
for >10 year old HW. For anything newer 2MiB has always been what we haven been 
testing with.

Failing lower order allocation is also something we have tested, but that 
resulted in a lot of unhappy people as well.

Regards,
Christian.

> 
> Regards,
> 
> Tvrtko
> 
>>> Along the lines of:
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> index 428265046815..06b243f05edd 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> @@ -1824,7 +1824,7 @@ static int amdgpu_ttm_pools_init(struct amdgpu_device 
>>> *adev)
>>>       for (i = 0; i < adev->gmc.num_mem_partitions; i++) {
>>>           ttm_pool_init(&adev->mman.ttm_pools[i], adev->dev,
>>>                     adev->gmc.mem_partitions[i].numa.node,
>>> -                  false, false);
>>> +                  false, false, get_order(2 * SZ_1M));
>>>       }
>>>       return 0;
>>>   }
>>> @@ -1865,7 +1865,8 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)
>>>                      adev_to_drm(adev)->anon_inode->i_mapping,
>>>                      adev_to_drm(adev)->vma_offset_manager,
>>>                      adev->need_swiotlb,
>>> -                   dma_addressing_limited(adev->dev));
>>> +                   dma_addressing_limited(adev->dev),
>>> +                   get_order(2 * SZ_1M));
>>>       if (r) {
>>>           dev_err(adev->dev,
>>>               "failed initializing buffer object driver(%d).\n", r);
>>> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
>>> index baf27c70a419..5d54e8373230 100644
>>> --- a/drivers/gpu/drm/ttm/ttm_pool.c
>>> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
>>> @@ -726,8 +726,12 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, 
>>> struct ttm_tt *tt,
>>>
>>>       page_caching = tt->caching;
>>>       allow_pools = true;
>>> -    for (order = ttm_pool_alloc_find_order(MAX_PAGE_ORDER, alloc);
>>> -         alloc->remaining_pages;
>>> +
>>> +    order = ttm_pool_alloc_find_order(MAX_PAGE_ORDER, alloc);
>>> +    if (order > pool->max_beneficial_order)
>>> +        gfp_flags &= ~__GFP_DIRECT_RECLAIM;
>>> +
>>> +    for (; alloc->remaining_pages;
>>>            order = ttm_pool_alloc_find_order(order, alloc)) {
>>>           struct ttm_pool_type *pt;
>>>
>>> @@ -745,6 +749,8 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, 
>>> struct ttm_tt *tt,
>>>           if (!p) {
>>>               page_caching = ttm_cached;
>>>               allow_pools = false;
>>> +            if (order <= pool->max_beneficial_order)
>>> +                gfp_flags |= __GFP_DIRECT_RECLAIM;
>>>               p = ttm_pool_alloc_page(pool, gfp_flags, order);
>>>           }
>>>           /* If that fails, lower the order if possible and retry. */
>>> @@ -1064,7 +1070,8 @@ long ttm_pool_backup(struct ttm_pool *pool, struct 
>>> ttm_tt *tt,
>>>    * Initialize the pool and its pool types.
>>>    */
>>>   void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>>> -           int nid, bool use_dma_alloc, bool use_dma32)
>>> +           int nid, bool use_dma_alloc, bool use_dma32,
>>> +           unsigned int max_beneficial_order)
>>>   {
>>>       unsigned int i, j;
>>>
>>> @@ -1074,6 +1081,7 @@ void ttm_pool_init(struct ttm_pool *pool, struct 
>>> device *dev,
>>>       pool->nid = nid;
>>>       pool->use_dma_alloc = use_dma_alloc;
>>>       pool->use_dma32 = use_dma32;
>>> +    pool->max_beneficial_order = max_beneficial_order;
>>>
>>>       for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) {
>>>           for (j = 0; j < NR_PAGE_ORDERS; ++j) {
>>>
>>>
>>> That should have the page allocator working less hard and lower the latency 
>>> with large buffers.
>>>
>>> Then a more aggressive change on top could be:
>>>
>>> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
>>> index 5d54e8373230..152164f79927 100644
>>> --- a/drivers/gpu/drm/ttm/ttm_pool.c
>>> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
>>> @@ -726,12 +726,8 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, 
>>> struct ttm_tt *tt,
>>>
>>>       page_caching = tt->caching;
>>>       allow_pools = true;
>>> -
>>> -    order = ttm_pool_alloc_find_order(MAX_PAGE_ORDER, alloc);
>>> -    if (order > pool->max_beneficial_order)
>>> -        gfp_flags &= ~__GFP_DIRECT_RECLAIM;
>>> -
>>> -    for (; alloc->remaining_pages;
>>> +    for (order = ttm_pool_alloc_find_order(pool->max_beneficial_order, 
>>> alloc);
>>> +         alloc->remaining_pages;
>>>            order = ttm_pool_alloc_find_order(order, alloc)) {
>>>           struct ttm_pool_type *pt;
>>>
>>> @@ -749,8 +745,6 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, 
>>> struct ttm_tt *tt,
>>>           if (!p) {
>>>               page_caching = ttm_cached;
>>>               allow_pools = false;
>>> -            if (order <= pool->max_beneficial_order)
>>> -                gfp_flags |= __GFP_DIRECT_RECLAIM;
>>>               p = ttm_pool_alloc_page(pool, gfp_flags, order);
>>>           }
>>>           /* If that fails, lower the order if possible and retry. */
>>>
>>> Ie. don't even bother trying to allocate orders above what the driver says 
>>> is useful. Could be made a drivers choice as well.
>>>
>>> And all could be combined with some sort of a sysfs control, as Cascardo 
>>> was suggesting, to disable direct reclaim completely if someone wants that.
>>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>>> Another alternative would be that we add a WARN_ONCE() when we have to 
>>>> fallback to lower order pages, but that wouldn't help the end user either. 
>>>> It just makes it more obvious that you need more memory for a specific use 
>>>> case without triggering the OOM killer.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Tvrtko
>>>>>
>>>>>> The alternative I can offer is to disable the fallback which in your 
>>>>>> case would trigger the OOM killer.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>> Other drivers can later opt to use this mechanism too.
>>>>>>>
>>>>>>> Signed-off-by: Thadeu Lima de Souza Cascardo <casca...@igalia.com>
>>>>>>> ---
>>>>>>> Changes in v2:
>>>>>>> - Make disabling direct reclaim an option.
>>>>>>> - Link to v1: 
>>>>>>> https://lore.kernel.org/r/20250910-ttm_pool_no_direct_reclaim-v1-1-53b0fa7f8...@igalia.com
>>>>>>>
>>>>>>> ---
>>>>>>> Thadeu Lima de Souza Cascardo (3):
>>>>>>>          ttm: pool: allow requests to prefer latency over throughput
>>>>>>>          ttm: pool: add a module parameter to set latency preference
>>>>>>>          drm/amdgpu: allow allocation preferences when creating GEM 
>>>>>>> object
>>>>>>>
>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c    |  3 ++-
>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  3 ++-
>>>>>>>     drivers/gpu/drm/ttm/ttm_pool.c             | 23 
>>>>>>> +++++++++++++++++------
>>>>>>>     drivers/gpu/drm/ttm/ttm_tt.c               |  2 +-
>>>>>>>     include/drm/ttm/ttm_bo.h                   |  5 +++++
>>>>>>>     include/drm/ttm/ttm_pool.h                 |  2 +-
>>>>>>>     include/drm/ttm/ttm_tt.h                   |  2 +-
>>>>>>>     include/uapi/drm/amdgpu_drm.h              |  9 +++++++++
>>>>>>>     8 files changed, 38 insertions(+), 11 deletions(-)
>>>>>>> ---
>>>>>>> base-commit: f83ec76bf285bea5727f478a68b894f5543ca76e
>>>>>>> change-id: 20250909-ttm_pool_no_direct_reclaim-ee0807a2d3fe
>>>>>>>
>>>>>>> Best regards,
>>>>>>
>>>>>
>>>>
>>>
>>
> 

Reply via email to