On Wed, May 06, 2026 at 09:14:13AM -0700, Matthew Brost wrote: > On Wed, May 06, 2026 at 04:23:29PM +0200, Thomas Hellström wrote: > > Hi, Matt > > > > On Tue, 2026-05-05 at 13:04 -0700, Matthew Brost wrote: > > > ttm_pool_split_for_swap() splits high-order pool pages into order-0 > > > pages during backup so each 4K page can be released to the system as > > > soon as it has been written to shmem. While this minimizes the > > > allocator's working set during reclaim, it actively fragments memory: > > > every TTM-backed compound page that the shrinker touches is shattered > > > into order-0 pages, even when the rest of the system would prefer > > > that > > > the high-order block stay intact. Under sustained kswapd pressure > > > this > > > is enough to drive other parts of MM into recovery loops from which > > > they cannot easily escape, because the memory TTM just freed is no > > > longer contiguous. > > > > > > Stop unconditionally splitting on the backup path and back up each > > > compound at its native order in ttm_pool_backup(): > > > > > > - For each non-handle slot, read the order from the head page and > > > back up all 1<<order subpages to consecutive shmem indices, > > > writing the resulting handles into tt->pages[] as we go. > > > - On success, the compound is freed once at its native order. No > > > split_page(), no per-4K refcount juggling, no fragmentation > > > introduced from this path. > > > - Slots that already hold a backup handle from a previous partial > > > attempt are skipped. A compound that would extend past a > > > fault-injection-truncated num_pages is skipped rather than split. > > > > > > A per-subpage backup failure cannot be made fully atomic: backing up > > > a > > > subpage allocates a shmem folio before the source page can be > > > released, > > > so under true OOM any subpage in a compound (not just the first) may > > > fail to be backed up with the rest of the source compound still live > > > and contiguous. To make forward progress in that case, fall back to > > > splitting the source compound and backing up its remaining subpages > > > individually: > > > > > > - On the first per-subpage failure for a compound (and only if > > > order > 0), call ttm_pool_split_for_swap() to split the source > > > compound, release the subpages whose contents already live in > > > shmem (their handles in tt->pages stay valid), and retry the > > > failing subpage at order 0. > > > - Subsequent successful subpage backups in the now-split compound > > > free their source page individually as soon as the handle is > > > written. > > > - A second failure after splitting terminates the loop with partial > > > progress; the remaining order-0 subpages stay in tt->pages as > > > plain page pointers and are cleaned up by the normal > > > ttm_pool_drop_backed_up() / ttm_pool_free_range() paths. > > > > > > This restores the original split-on-OOM fallback behavior while > > > keeping the common, non-OOM case fragmentation-free. It also > > > preserves the "partial backup is allowed" contract: shrunken is > > > incremented per backed-up subpage so the caller still sees forward > > > progress when a compound only partially succeeds. > > > > > > The restore-side leftover-page branch in ttm_pool_restore_commit() is > > > left as-is for now: that path can still split a previously-retained > > > compound, but in practice it is unreachable under realistic workloads > > > (per profiling we have not been able to trigger it), so it is not > > > worth complicating the restore state machine to avoid the split > > > there. > > > If it ever becomes a problem in practice it can be addressed > > > independently. > > > > > > ttm_pool_split_for_swap() itself is retained both for the OOM > > > fallback above and for the restore path's remaining caller. The > > > DMA-mapped pre-backup unmap loop, the purge path, ttm_pool_free_*, > > > and ttm_pool_unmap_and_free() already operate at native order and > > > are unchanged. > > > > > > Cc: Christian Koenig <[email protected]> > > > Cc: Huang Rui <[email protected]> > > > Cc: Matthew Auld <[email protected]> > > > Cc: Maarten Lankhorst <[email protected]> > > > Cc: Maxime Ripard <[email protected]> > > > Cc: Thomas Zimmermann <[email protected]> > > > Cc: David Airlie <[email protected]> > > > Cc: Simona Vetter <[email protected]> > > > Cc: [email protected] > > > Cc: [email protected] > > > Cc: [email protected] > > > Fixes: b63d715b8090 ("drm/ttm/pool, drm/ttm/tt: Provide a helper to > > > shrink pages") > > > Suggested-by: Thomas Hellström <[email protected]> > > > Assisted-by: Claude:claude-opus-4.6 > > > Signed-off-by: Matthew Brost <[email protected]> > > > > > > --- > > > > > > A follow-up should attempt writeback to shmem at folio order as well, > > > but the API for doing so is unclear and may be incomplete. > > > > > > This patch is related to the pending series [1] and significantly > > > reduces the likelihood of Xe entering a kswapd loop under > > > fragmentation. > > > The kswapd → shrinker → Xe shrinker → TTM backup path is still > > > exercised; however, with this change the backup path no longer > > > worsens > > > fragmentation, which previously amplified reclaim pressure and > > > reinforced the kswapd loop. > > > > > > Nonetheless, the pathological case that [1] aims to address still > > > exists > > > and requires a proper solution. Even with this patch, a kswapd loop > > > due > > > to severe fragmentation can still be triggered, although it is now > > > substantially harder to reproduce. > > > > > > v2: > > > - Split pages and free immediately if backup fails are higher order > > > (Thomas) > > > v3: > > > - Skip handles in purge path (sashiko) > > > v5: > > > - Refactor into ttm_pool_backup_folio (Thomas) > > > > > > [1] https://patchwork.freedesktop.org/series/165330/ > > > --- > > > drivers/gpu/drm/ttm/ttm_pool.c | 110 ++++++++++++++++++++++++++++--- > > > -- > > > 1 file changed, 94 insertions(+), 16 deletions(-) > > > > > > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c > > > b/drivers/gpu/drm/ttm/ttm_pool.c > > > index d380a3c7fe40..78efc8524133 100644 > > > --- a/drivers/gpu/drm/ttm/ttm_pool.c > > > +++ b/drivers/gpu/drm/ttm/ttm_pool.c > > > @@ -1019,6 +1019,70 @@ void ttm_pool_drop_backed_up(struct ttm_tt > > > *tt) > > > ttm_pool_free_range(NULL, tt, ttm_cached, start_page, tt- > > > >num_pages); > > > } > > > > > > +static int ttm_pool_backup_folio(struct ttm_pool *pool, struct > > > ttm_tt *tt, > > > + struct file *backup, struct folio > > > *folio, > > > + unsigned int order, bool writeback, > > > + pgoff_t idx, gfp_t page_gfp, gfp_t > > > alloc_gfp) > > > > I don't really understand why we can't end up with a > > ttm_backup_backup_folio(), which I believe is the proper layering, > > already at this point? Please see a suggestion at > > > > https://gitlab.freedesktop.org/thomash/xe-vibe/-/commits/ttm_swapout?ref_type=heads > > > > Here the splitting logic is kept in the ttm_pool, but ttm_backup > > supports handing large folios to it. > > > > Although the cumulative diffstat becomes larger, the end code becomes > > smaller and IMO easier to read, and we don't need to introduce code > > that we immediately have to refactor. > > That version looks fine too. If that is preference no issue. > > My goal with this series is get something than can reasonably be > backported to LTS kernels so the desktop doesn't frequently enter kswapd > because of fragmentation. We now have at least 3 reports of this being > an issue. > > This is larger fix [1] which works in tandem but seemly unlikely to > backportable given it add new concepts to the core MM [1]. > > [1] https://patchwork.freedesktop.org/series/165329/ > > > > > But I'm starting to question the general approach: Even if the > > *shrinker* can recover from a total kernel memory reserve depletion, it > > can't really be considered a reasonable practice, since if we > > frequently deplete the reserves, *other* important allocations in the > > system like GFP_ATOMIC, PF_MEMALLOC may spuriously start to fail and > > people will have a hard time finding out why. > > > > Wouldn’t GFP_ATOMIC enter direct reclaim, hit our shrinker, and > eventually make progress—i.e., take the split path if needed? I’m not > 100% sure, but my initial reaction is that this concern may not be > valid; however, MM is hard to reason about. > > Again, FWIW, I’ve tried a lot of things to trigger OOM—for example, > running WebGL tabs and then kicking off various very memory-intensive > workloads from the CLI—and I still haven’t hit OOM or seen memory > allocation failures or warnings. > > > So I actually don't think we can be avoiding the splitting without > > direct insertion. FWIW, up until recently when shmem started supporting > > I agree direct insertion is better solution. Do you think this something > we could reasonably get working and backport? I haven't done any > research on direct insertion yet, thus why I'm asking. > > > huge page swapping, other GPU drivers basically also split pages at > > swapout. > > I wonder if other drivers have the same issue? The deadly combo is allow > GPUs to subscribe all of system memory, allocate THP pages (or higher > order pages), and split them in the shrinker. Xe might be the only > driver with right combo to hit this but not 100% sure without a deep > dive. >
+ For completeness the THP allocation must have GFP flags to enter reclaim. Matt > > > > Another idea for improving on the compaction loop, perhaps worth trying > > is this change, shamelessly stolen from i915: > > > > https://gitlab.freedesktop.org/thomash/xe-vibe/-/commits/shrinker_batch?ref_type=heads > > > > I'd have to give this a try - I'm quickly running out of time before I > leave for month though. > > Matt > > > /Thomas > > > > > > > +{ > > > + struct page *page = folio_page(folio, 0); > > > + int shrunken = 0, npages = 1UL << order, ret = 0, i; > > > + bool folio_has_been_split = false; > > > + > > > + for (i = 0; i < npages; ++i) { > > > + s64 shandle; > > > + > > > +try_again_after_split: > > > + if (IS_ENABLED(CONFIG_FAULT_INJECTION) && > > > + should_fail(&backup_fault_inject, 1)) > > > + shandle = -ENOMEM; > > > + else > > > + shandle = ttm_backup_backup_page(backup, > > > page + i, > > > + writeback, > > > idx + i, > > > + page_gfp, > > > alloc_gfp); > > > + > > > + if (shandle < 0 && !folio_has_been_split && order) { > > > + pgoff_t j; > > > + > > > + /* > > > + * True OOM: could not allocate a shmem > > > folio > > > + * for the next subpage. Fall back to > > > splitting > > > + * the source compound and backing up > > > subpages > > > + * individually. Release the already-backed- > > > up > > > + * subpages whose contents now live in > > > shmem; > > > + * any further failure terminates the loop > > > with > > > + * partial progress (handled by the caller). > > > + */ > > > + folio_has_been_split = true; > > > + ttm_pool_split_for_swap(pool, page); > > > + > > > + for (j = 0; j < i; ++j) { > > > + __free_pages_gpu_account(page + j, > > > 0, false); > > > + shrunken++; > > > + } > > > + > > > + goto try_again_after_split; > > > + } else if (shandle < 0) { > > > + ret = shandle; > > > + goto out; > > > + } else if (folio_has_been_split) { > > > + __free_pages_gpu_account(page + i, 0, > > > false); > > > + shrunken++; > > > + } > > > + > > > + tt->pages[idx + i] = > > > ttm_backup_handle_to_page_ptr(shandle); > > > + } > > > + > > > + if (!folio_has_been_split) { > > > + /* Compound fully backed up; free at native order. > > > */ > > > + page->private = 0; > > > + __free_pages_gpu_account(page, order, false); > > > + shrunken += npages; > > > + } > > > + > > > +out: > > > + return shrunken ? shrunken : ret; > > > +} > > > + > > > /** > > > * ttm_pool_backup() - Back up or purge a struct ttm_tt > > > * @pool: The pool used when allocating the struct ttm_tt. > > > @@ -1045,12 +1109,11 @@ long ttm_pool_backup(struct ttm_pool *pool, > > > struct ttm_tt *tt, > > > { > > > struct file *backup = tt->backup; > > > struct page *page; > > > - unsigned long handle; > > > gfp_t alloc_gfp; > > > gfp_t gfp; > > > int ret = 0; > > > pgoff_t shrunken = 0; > > > - pgoff_t i, num_pages; > > > + pgoff_t i, num_pages, npages; > > > > > > if (WARN_ON(ttm_tt_is_backed_up(tt))) > > > return -EINVAL; > > > @@ -1070,7 +1133,8 @@ long ttm_pool_backup(struct ttm_pool *pool, > > > struct ttm_tt *tt, > > > unsigned int order; > > > > > > page = tt->pages[i]; > > > - if (unlikely(!page)) { > > > + if (unlikely(!page || > > > + > > > ttm_backup_page_ptr_is_handle(page))) { > > > num_pages = 1; > > > continue; > > > } > > > @@ -1106,26 +1170,40 @@ long ttm_pool_backup(struct ttm_pool *pool, > > > struct ttm_tt *tt, > > > if (IS_ENABLED(CONFIG_FAULT_INJECTION) && > > > should_fail(&backup_fault_inject, 1)) > > > num_pages = DIV_ROUND_UP(num_pages, 2); > > > > > > - for (i = 0; i < num_pages; ++i) { > > > - s64 shandle; > > > + for (i = 0; i < num_pages; i += npages) { > > > + unsigned int order; > > > > > > + npages = 1; > > > page = tt->pages[i]; > > > if (unlikely(!page)) > > > continue; > > > > > > - ttm_pool_split_for_swap(pool, page); > > > + /* Already-handled entry from a previous attempt. */ > > > + if (unlikely(ttm_backup_page_ptr_is_handle(page))) > > > + continue; > > > > > > - shandle = ttm_backup_backup_page(backup, page, > > > flags->writeback, i, > > > - gfp, alloc_gfp); > > > - if (shandle < 0) { > > > - /* We allow partially shrunken tts */ > > > - ret = shandle; > > > + order = ttm_pool_page_order(pool, page); > > > + npages = 1UL << order; > > > + > > > + /* > > > + * Back up the compound atomically at its native > > > order. If > > > + * fault injection truncated num_pages mid-compound, > > > skip > > > + * the partial tail rather than splitting. > > > + */ > > > + if (unlikely(i + npages > num_pages)) > > > + break; > > > + > > > + ret = ttm_pool_backup_folio(pool, tt, backup, > > > page_folio(page), > > > + order, flags->writeback, > > > i, gfp, > > > + alloc_gfp); > > > + if (unlikely(ret < 0)) > > > + break; > > > + > > > + shrunken += ret; > > > + > > > + /* partial backup */ > > > + if (unlikely(ret != npages)) > > > break; > > > - } > > > - handle = shandle; > > > - tt->pages[i] = > > > ttm_backup_handle_to_page_ptr(handle); > > > - __free_pages_gpu_account(page, 0, false); > > > - shrunken++; > > > } > > > > > > return shrunken ? shrunken : ret;
