force_merge with decoupled clear tracker

Matthew Auld Fri, 29 May 2026 10:41:16 -0700

Hi,

On 27/05/2026 12:29, Arunpravin Paneer Selvam wrote:

The current buddy allocator maintains separate clear_tree[] and
dirty_tree[] rbtrees per order, preventing coalescing between cleared
and dirty buddies. Under mixed workloads, this creates a merge barrier:
adjacent buddies frequently end up split across trees, forcing reliance
on __force_merge() during allocation.


__force_merge() performs an O(N x max_order) scan under the VRAM manager
lock, leading to allocation stalls and failures for large contiguous
requests even when sufficient total free memory is available.


So is this contig with non power-of-two sizes?

Do we know if we could force_merge everything in one go or somehow bemore aggressive and do more than needed now, at the first sign ofcontention here, instead of doing it piecemeal? Downside would be losingmore of the clear tracking, when this happens, but more re-merging.

Could we have another per-order list, of all blocks that we failed tomerge, when we did the free step? When doing the force merge step, wemaybe don't need to search blindly and can focus instead on the stufftracked in those lists? Maybe it doesn't need to be a list, but could beanother rb-tree?

We know the size of the total allocation, if we trigger force_merge,could we try to merge enough in one go for the entire allocation,instead of restarting the entire thing on the next iteration? Would thathelp at all?

But I guess these are more for the stalling side, and won't help muchwith the contig angle?

For the extent idea, is there any merit in maybe doing this for allcontig blobs, and not just cleared stuff? Or is the workload you areseeing only benefit users that want cleared stuff? Wondering if thiswould benefit all users that want contig? Like if we hypothetically keptclear and dirty separate, like we do now, but with an improvedforce_merge, and then have extent tracking for all contig blobs andreplace the try_harder stuff? When you do a contig alloc, the individualclear/dirty is still all there within the range, so you can skipre-clearing in some cases. I guess downside is overall more fuzzy contig+ clear/free path, but I guess you would never get allocation failures,when there is sufficient contig space?


Solution

Replace the dual-tree design with:
- A single free_tree[order] rbtree for dirty and mixed free blocks
   (fully cleared free blocks float outside this tree)
- A lightweight out-of-band clear tracker (gpu_clear_tracker)

Fully cleared free blocks are tracked outside the buddy trees using an
augmented interval rbtree, enabling O(log E) lookup of the largest
cleared extents.

Buddy coalescing is now unconditional in __gpu_buddy_free(), regardless
of clear/dirty state. This removes the merge barrier and eliminates the
need for __force_merge().

Benefits

- Correct high-order allocations after mixed clear/dirty workloads
- Elimination of O(N x max_order) merge cost from the allocation path
- O(log E) cleared-extent lookup replacing O(N) scans
- Predictable allocation latency under fragmentation
- Reduced complexity with a single tree per order

Since there is no separate tracking for dirty stuff, is the non-clearedalloc path a bit more "fuzzy" now, with it potentially stealing clearedmemory, or is it the same behaviour still?

For drivers that don't use free tracking, is there some benefit? Arethere any downsides there? I assume that clear tracker is always empty.

Re: [PATCH v4 1/2] gpu/buddy: replace dual-tree/force_merge with decoupled clear tracker

Reply via email to