On 5/29/2026 11:11 PM, Matthew Auld wrote:
Hi,
On 27/05/2026 12:29, Arunpravin Paneer Selvam wrote:
The current buddy allocator maintains separate clear_tree[] and
dirty_tree[] rbtrees per order, preventing coalescing between cleared
and dirty buddies. Under mixed workloads, this creates a merge barrier:
adjacent buddies frequently end up split across trees, forcing reliance
on __force_merge() during allocation.
__force_merge() performs an O(N x max_order) scan under the VRAM manager
lock, leading to allocation stalls and failures for large contiguous
requests even when sufficient total free memory is available.
So is this contig with non power-of-two sizes?
Both power-of-two and non-power-of-two contiguous requests are affected
- in either case, the required higher-order block can't form when its
lower-order buddies are separated by clear/dirty state across the dual
trees. But the core issue we are seeing is VRAM fragmentation caused by
massive small allocations (e.g., thousands of 4 KiB–8 KiB buffers) that
end up split across clear and dirty trees, preventing buddy coalescing.
This leads to allocation failures and OOM in later workloads even when
sufficient total free VRAM is available.
Do we know if we could force_merge everything in one go or somehow be
more aggressive and do more than needed now, at the first sign of
contention here, instead of doing it piecemeal? Downside would be
losing more of the clear tracking, when this happens, but more
re-merging.
Could we have another per-order list, of all blocks that we failed to
merge, when we did the free step? When doing the force merge step, we
maybe don't need to search blindly and can focus instead on the stuff
tracked in those lists? Maybe it doesn't need to be a list, but could
be another rb-tree?
We know the size of the total allocation, if we trigger force_merge,
could we try to merge enough in one go for the entire allocation,
instead of restarting the entire thing on the next iteration? Would
that help at all?
But I guess these are more for the stalling side, and won't help much
with the contig angle?
The memory is highly fragmented into mostly 4 KiB chunks and small
scattered blocks across the dual trees, so although total free memory
exists, it is split into low-order fragments. The workload then requests
very large contiguous allocations (tens of GBs, e.g., ~64 GiB), which
fail with OOM because the allocator cannot form sufficiently large
high-order blocks from the fragmented space. We could go with more
aggressive merging or merge-in-one-go approaches, but this might waste
more cleared memory. I think fundamentally the buddy allocator should be
allowed to merge unconditionally - the single-tree approach with
unconditional coalescing would improve the fragmentation and benefit
contiguous allocations along with addressing the stalling and latency
issues.
For the extent idea, is there any merit in maybe doing this for all
contig blobs, and not just cleared stuff? Or is the workload you are
seeing only benefit users that want cleared stuff? Wondering if this
would benefit all users that want contig? Like if we hypothetically
kept clear and dirty separate, like we do now, but with an improved
force_merge, and then have extent tracking for all contig blobs and
replace the try_harder stuff? When you do a contig alloc, the
individual clear/dirty is still all there within the range, so you can
skip re-clearing in some cases. I guess downside is overall more fuzzy
contig + clear/free path, but I guess you would never get allocation
failures, when there is sufficient contig space?
Yes, extending extent tracking to all contig allocations has merit, but
the core problem remains - with the dual-tree design, we still need
force_merge to undo the clear/dirty split before those extents can form.
In cases like heavy small-allocation workloads (thousands of 4 KiB
buffers) running first, the memory ends up massively fragmented across
both trees. When a very large contiguous allocation (e.g., ~64 GiB)
comes in later, the allocator fails with OOM even though sufficient
total free memory exists, because the extent tracker can't find a
contiguous range that was never allowed to merge in the first place. I
think the dirty/clear split is fundamentally the problem - allowing the
buddy allocator to merge unconditionally removes this barrier, and the
clear tracker can then be layered on top as an optimization without
blocking coalescing.
Solution
Replace the dual-tree design with:
- A single free_tree[order] rbtree for dirty and mixed free blocks
(fully cleared free blocks float outside this tree)
- A lightweight out-of-band clear tracker (gpu_clear_tracker)
Fully cleared free blocks are tracked outside the buddy trees using an
augmented interval rbtree, enabling O(log E) lookup of the largest
cleared extents.
Buddy coalescing is now unconditional in __gpu_buddy_free(), regardless
of clear/dirty state. This removes the merge barrier and eliminates the
need for __force_merge().
Benefits
- Correct high-order allocations after mixed clear/dirty workloads
- Elimination of O(N x max_order) merge cost from the allocation path
- O(log E) cleared-extent lookup replacing O(N) scans
- Predictable allocation latency under fragmentation
- Reduced complexity with a single tree per order
Since there is no separate tracking for dirty stuff, is the
non-cleared alloc path a bit more "fuzzy" now, with it potentially
stealing cleared memory, or is it the same behaviour still?
Right, on v4, the dirty and mixed (partially cleared) blocks are
allocated for the non-cleared alloc path, which can end up stealing
cleared memory. On v5, I plan to address this with a three-tier dirty
allocation fallback: dirty → mixed → clear, driven by rbtree augment
bits (subtree_has_dirty, subtree_has_mixed), each pass O(log N). The
split-descent also applies the same preference at every level when
carving a higher-order block, so cleared memory is preserved as much as
possible and only used as a last resort.
Thoughts ?
For drivers that don't use free tracking, is there some benefit? Are
there any downsides there? I assume that clear tracker is always empty.
Correct, for drivers that don't clear memory, the clear tracker is
always empty and they simply allocate from the free_tree[]. Benefits:
Single tree per order instead of dual trees (fewer rbtree operations)
No force_merge path at all (unconditional coalescing at free time)
Simpler code path overall
No real downsides - the clear tracker adds zero overhead when empty, and
the augment bits would simply show all blocks as dirty, so the walk
degenerates to a normal rbtree lookup with no extra cost.
Regards,
Arun.