buddy: Improve offset-aligned allocation handling

Matthew Auld Tue, 17 Feb 2026 02:01:56 -0800

On 17/02/2026 06:03, Arunpravin Paneer Selvam wrote:

Hi Matthew,
On 2/10/2026 9:56 PM, Matthew Auld wrote:
On 09/02/2026 08:30, Arunpravin Paneer Selvam wrote:
Large alignment requests previously forced the buddy allocator tosearch byalignment order, which often caused higher-order free blocks to besplit evenwhen a suitably aligned smaller region already existed within them.This ledto excessive fragmentation, especially for workloads requesting smallsizes
with large alignment constraints.
This change prioritizes the requested allocation size during thesearch anduses an augmented RB-tree field (subtree_max_alignment) toefficiently locatefree blocks that satisfy both size and offset-alignment requirements.As a
result, the allocator can directly select an aligned sub-region without
splitting larger blocks unnecessarily.

A practical example is the VKCTS test
dEQP-VK.memory.allocation.basic.size_8KiB.reverse.count_4000, whichrepeatedlyallocates 8 KiB buffers with a 256 KiB alignment. Previously, suchallocationscaused large blocks to be split aggressively, despite smaller alignedregionsbeing sufficient. With this change, those aligned regions are reuseddirectly,
significantly reducing fragmentation.

This improvement is visible in the amdgpu VRAM buddy allocator state
(/sys/kernel/debug/dri/1/amdgpu_vram_mm). After the change, higher-order blocksare preserved and the number of low-order fragments is substantiallyreduced.
Before:
   order- 5 free: 1936 MiB, blocks: 15490
   order- 4 free:  967 MiB, blocks: 15486
   order- 3 free:  483 MiB, blocks: 15485
   order- 2 free:  241 MiB, blocks: 15486
   order- 1 free:  241 MiB, blocks: 30948

After:
   order- 5 free:  493 MiB, blocks:  3941
   order- 4 free:  246 MiB, blocks:  3943
   order- 3 free:  123 MiB, blocks:  4101
   order- 2 free:   61 MiB, blocks:  4101
   order- 1 free:   61 MiB, blocks:  8018
By avoiding unnecessary splits, this change improves allocatorefficiency and
helps maintain larger contiguous free regions under heavy offset-aligned
allocation workloads.

v2:(Matthew)
   - Update augmented information along the path to the inserted node.

v3:
   - Move the patch to gpu/buddy.c file.
Signed-off-by: Arunpravin Paneer Selvam<[email protected]>
Suggested-by: Christian König <[email protected]>
---
  drivers/gpu/buddy.c       | 271 +++++++++++++++++++++++++++++++-------
  include/linux/gpu_buddy.h |   2 +
  2 files changed, 228 insertions(+), 45 deletions(-)

diff --git a/drivers/gpu/buddy.c b/drivers/gpu/buddy.c
index 603c59a2013a..3a25eed050ba 100644
--- a/drivers/gpu/buddy.c
+++ b/drivers/gpu/buddy.c
@@ -14,6 +14,16 @@
    static struct kmem_cache *slab_blocks;
+static unsigned int gpu_buddy_block_offset_alignment(structgpu_buddy_block *block)
+{
+    return __ffs(gpu_buddy_block_offset(block));
__ffs() will be undefined for offset zero it seems, so might blow upin some strange way. I guess just return the max possible alignmenthere if offset is zero? Also are we meant to use __ffs64() here?
Yes, I had the same concern about __ffs() being undefined when theoffset is zero. My initial thought was to derive the maximum possiblealignment from the allocator size using ilog2(mm->size) and return thatvalue for the zero-offset case.
But, RB_DECLARE_CALLBACKS_MAX() requires the compute callback(gpu_buddy_block_offset_alignment()) to accept only a single structgpu_buddy_block * argument. It does not provide a mechanism to passadditional context such as the associated struct gpu_buddy *mm. As aresult, deriving the alignment from allocator state (e.g., via ilog2(mm->size)) is not directly feasible within this callback. When I testedthe zero-offset case locally, __ffs() returned 64, which effectivelycorresponds to the maximum alignment for a u64 offset. Based on thatobservation, I initially left the __ffs() call unchanged for the zerocase as well.
One possible alternative would be to store a pointer to struct gpu_buddyinside each gpu_buddy_block.
All other review comments have been addressed, and I will send a v4 oncethis point is clarified.

Yeah, I was thinking we just return the max theoretical value, so 64, orperhaps 64+1. It just needs to be a value that will be larger than anyother possible alignment, since zero is special. It shouldn't matter ifthat is larger than the actual real max for the region, I think.


if (!offset)
        return 64 + 1;

return __ffs64(offset);

?


Regards,
Arun.

+}
+
+RB_DECLARE_CALLBACKS_MAX(static, gpu_buddy_augment_cb,
+             struct gpu_buddy_block, rb,
+             unsigned int, subtree_max_alignment,
+             gpu_buddy_block_offset_alignment);
+
  static struct gpu_buddy_block *gpu_block_alloc(struct gpu_buddy *mm,
                             struct gpu_buddy_block *parent,
                             unsigned int order,

@@ -31,6 +41,9 @@ static struct gpu_buddy_block*gpu_block_alloc(struct gpu_buddy *mm,

      block->header |= order;
      block->parent = parent;
  +    block->subtree_max_alignment =
+        gpu_buddy_block_offset_alignment(block);
+
      RB_CLEAR_NODE(&block->rb);
        BUG_ON(block->header & GPU_BUDDY_HEADER_UNUSED);
@@ -67,26 +80,42 @@ static bool rbtree_is_empty(struct rb_root *root)
      return RB_EMPTY_ROOT(root);
  }

-static bool gpu_buddy_block_offset_less(const structgpu_buddy_block *block,

-                    const struct gpu_buddy_block *node)
-{

- return gpu_buddy_block_offset(block) <gpu_buddy_block_offset(node);

-}
-
-static bool rbtree_block_offset_less(struct rb_node *block,
-                     const struct rb_node *node)
-{
-    return gpu_buddy_block_offset_less(rbtree_get_free_block(block),
-                       rbtree_get_free_block(node));
-}
-
  static void rbtree_insert(struct gpu_buddy *mm,
                struct gpu_buddy_block *block,
                enum gpu_buddy_free_tree tree)
  {
-    rb_add(&block->rb,
- &mm->free_trees[tree][gpu_buddy_block_order(block)],
-           rbtree_block_offset_less);
+    struct rb_node **link, *parent = NULL;
+    unsigned int block_alignment, order;
+    struct gpu_buddy_block *node;
+    struct rb_root *root;
+
+    order = gpu_buddy_block_order(block);
+    block_alignment = gpu_buddy_block_offset_alignment(block);
+
+    root = &mm->free_trees[tree][order];
+    link = &root->rb_node;
+
+    while (*link) {
+        parent = *link;
+        node = rbtree_get_free_block(parent);
+        /*

+ * Manual augmentation update during insertion traversal.Required+ * because rb_insert_augmented() only calls rotate callbackduring+ * rotations. This ensures all ancestors on the insertionpath have

+         * correct subtree_max_alignment values.
+         */
+        if (node->subtree_max_alignment < block_alignment)
+            node->subtree_max_alignment = block_alignment;
+

+ if (gpu_buddy_block_offset(block) <gpu_buddy_block_offset(node))

+            link = &parent->rb_left;
+        else
+            link = &parent->rb_right;
+    }
+
+    block->subtree_max_alignment = block_alignment;
+    rb_link_node(&block->rb, parent, link);
+    rb_insert_augmented(&block->rb, root, &gpu_buddy_augment_cb);
  }
    static void rbtree_remove(struct gpu_buddy *mm,
@@ -99,7 +128,7 @@ static void rbtree_remove(struct gpu_buddy *mm,
      tree = get_block_tree(block);
      root = &mm->free_trees[tree][order];
  -    rb_erase(&block->rb, root);
+    rb_erase_augmented(&block->rb, root, &gpu_buddy_augment_cb);
      RB_CLEAR_NODE(&block->rb);
  }
  @@ -790,6 +819,132 @@ alloc_from_freetree(struct gpu_buddy *mm,
      return ERR_PTR(err);
  }
  +static bool
+gpu_buddy_can_offset_align(u64 size, u64 min_block_size)
+{
+    return size < min_block_size && is_power_of_2(size);
+}
+
+static bool gpu_buddy_subtree_can_satisfy(struct rb_node *node,
+                      unsigned int alignment)
+{
+    struct gpu_buddy_block *block;
+
+    if (!node)
+        return false;

All callers seem to handle null case already, so could potentiallydrop this?

+
+    block = rbtree_get_free_block(node);
+    return block->subtree_max_alignment >= alignment;
+}
+
+static struct gpu_buddy_block *
+gpu_buddy_find_block_aligned(struct gpu_buddy *mm,
+                 enum gpu_buddy_free_tree tree,
+                 unsigned int order,
+                 unsigned int tmp,
+                 unsigned int alignment,
+                 unsigned long flags)
+{
+    struct rb_root *root = &mm->free_trees[tree][tmp];
+    struct rb_node *rb = root->rb_node;
+
+    while (rb) {
+        struct gpu_buddy_block *block = rbtree_get_free_block(rb);

+ struct rb_node *left_node = rb->rb_left, *right_node = rb->rb_right;

+
+        if (right_node) {
+            if (gpu_buddy_subtree_can_satisfy(right_node, alignment)) {
+                rb = right_node;
+                continue;
+            }
+        }
+
+        if (gpu_buddy_block_order(block) >= order &&

Is this not always true? With that we can drop order, or better yet s/tmp/order/ ?

+ __ffs(gpu_buddy_block_offset(block)) >= alignment)


Same here with undefined offset zero case. I guess also use the helper.

+            return block;
+
+        if (left_node) {
+            if (gpu_buddy_subtree_can_satisfy(left_node, alignment)) {
+                rb = left_node;
+                continue;
+            }
+        }
+
+        break;
+    }
+
+    return NULL;
+}
+
+static struct gpu_buddy_block *
+gpu_buddy_offset_aligned_allocation(struct gpu_buddy *mm,
+                    u64 size,
+                    u64 min_block_size,
+                    unsigned long flags)
+{
+    struct gpu_buddy_block *block = NULL;
+    unsigned int order, tmp, alignment;
+    struct gpu_buddy_block *buddy;
+    enum gpu_buddy_free_tree tree;
+    unsigned long pages;
+    int err;
+
+    alignment = ilog2(min_block_size);
+    pages = size >> ilog2(mm->chunk_size);
+    order = fls(pages) - 1;
+
+    tree = (flags & GPU_BUDDY_CLEAR_ALLOCATION) ?
+        GPU_BUDDY_CLEAR_TREE : GPU_BUDDY_DIRTY_TREE;
+
+    for (tmp = order; tmp <= mm->max_order; ++tmp) {
+        block = gpu_buddy_find_block_aligned(mm, tree, order,
+                             tmp, alignment, flags);
+        if (!block) {
+            tree = (tree == GPU_BUDDY_CLEAR_TREE) ?
+                GPU_BUDDY_DIRTY_TREE : GPU_BUDDY_CLEAR_TREE;
+            block = gpu_buddy_find_block_aligned(mm, tree, order,
+                                 tmp, alignment, flags);
+        }
+
+        if (block)
+            break;
+    }
+
+    if (!block)
+        return ERR_PTR(-ENOSPC);
+
+    while (gpu_buddy_block_order(block) > order) {
+        struct gpu_buddy_block *left, *right;
+
+        err = split_block(mm, block);
+        if (unlikely(err))
+            goto err_undo;
+
+        left  = block->left;
+        right = block->right;
+
+        if (__ffs(gpu_buddy_block_offset(right)) >= alignment)


Might be better to use the helper for this?

+            block = right;
+        else
+            block = left;
+    }
+
+    return block;
+
+err_undo:
+    /*

+ * We really don't want to leave around a bunch of split blocks,since+ * bigger is better, so make sure we merge everything backbefore we

+     * free the allocated blocks.
+     */
+    buddy = __get_buddy(block);
+    if (buddy &&
+        (gpu_buddy_block_is_free(block) &&
+         gpu_buddy_block_is_free(buddy)))
+        __gpu_buddy_free(mm, block, false);
+    return ERR_PTR(err);
+}
+
  static int __alloc_range(struct gpu_buddy *mm,
               struct list_head *dfs,
               u64 start, u64 size,
@@ -1059,6 +1214,7 @@ EXPORT_SYMBOL(gpu_buddy_block_trim);
  static struct gpu_buddy_block *
  __gpu_buddy_alloc_blocks(struct gpu_buddy *mm,
               u64 start, u64 end,
+             u64 size, u64 min_block_size,
               unsigned int order,
               unsigned long flags)
  {
@@ -1066,6 +1222,11 @@ __gpu_buddy_alloc_blocks(struct gpu_buddy *mm,
          /* Allocate traversing within the range */
          return  __gpu_buddy_alloc_range_bias(mm, start, end,
                               order, flags);
+    else if (size < min_block_size)

+ /* Allocate from an offset-aligned region without sizerounding */

+        return gpu_buddy_offset_aligned_allocation(mm, size,
+                               min_block_size,
+                               flags);
      else
          /* Allocate from freetree */
          return alloc_from_freetree(mm, order, flags);
@@ -1137,8 +1298,11 @@ int gpu_buddy_alloc_blocks(struct gpu_buddy *mm,
      if (flags & GPU_BUDDY_CONTIGUOUS_ALLOCATION) {
          size = roundup_pow_of_two(size);
          min_block_size = size;
-    /* Align size value to min_block_size */
-    } else if (!IS_ALIGNED(size, min_block_size)) {
+        /*

+ * Normalize the requested size to min_block_size forregular allocations.

+         * Offset-aligned allocations intentionally skip size rounding.
+         */
+    } else if (!gpu_buddy_can_offset_align(size, min_block_size)) {
          size = round_up(size, min_block_size);
      }

@@ -1158,43 +1322,60 @@ int gpu_buddy_alloc_blocks(struct gpu_buddy*mm,

      do {
          order = min(order, (unsigned int)fls(pages) - 1);
          BUG_ON(order > mm->max_order);
-        BUG_ON(order < min_order);
+        /*

+ * Regular allocations must not allocate blocks smaller thanmin_block_size.+ * Offset-aligned allocations deliberately bypass thisconstraint.

+         */
+        BUG_ON(size >= min_block_size && order < min_order);
            do {
+            unsigned int fallback_order;
+
              block = __gpu_buddy_alloc_blocks(mm, start,
                               end,
+                             size,
+                             min_block_size,
                               order,
                               flags);
              if (!IS_ERR(block))
                  break;
  -            if (order-- == min_order) {
-                /* Try allocation through force merge method */
-                if (mm->clear_avail &&
-                    !__force_merge(mm, start, end, min_order)) {
-                    block = __gpu_buddy_alloc_blocks(mm, start,
-                                     end,
-                                     min_order,
-                                     flags);
-                    if (!IS_ERR(block)) {
-                        order = min_order;
-                        break;
-                    }
-                }
+            if (size < min_block_size) {
+                fallback_order = order;
+            } else if (order == min_order) {
+                fallback_order = min_order;
+            } else {
+                order--;
+                continue;
+            }
  -                /*
-                 * Try contiguous block allocation through
-                 * try harder method.
-                 */
-                if (flags & GPU_BUDDY_CONTIGUOUS_ALLOCATION &&
-                    !(flags & GPU_BUDDY_RANGE_ALLOCATION))
-                    return __alloc_contig_try_harder(mm,
-                                     original_size,
-                                     original_min_size,
-                                     blocks);
-                err = -ENOSPC;
-                goto err_free;
+            /* Try allocation through force merge method */
+            if (mm->clear_avail &&
+                !__force_merge(mm, start, end, fallback_order)) {
+                block = __gpu_buddy_alloc_blocks(mm, start,
+                                 end,
+                                 size,
+                                 min_block_size,
+                                 fallback_order,
+                                 flags);
+                if (!IS_ERR(block)) {
+                    order = fallback_order;
+                    break;
+                }
              }
+
+            /*
+             * Try contiguous block allocation through
+             * try harder method.
+             */
+            if (flags & GPU_BUDDY_CONTIGUOUS_ALLOCATION &&
+                !(flags & GPU_BUDDY_RANGE_ALLOCATION))
+                return __alloc_contig_try_harder(mm,
+                                 original_size,
+                                 original_min_size,
+                                 blocks);
+            err = -ENOSPC;
+            goto err_free;
          } while (1);
            mark_allocated(mm, block);
diff --git a/include/linux/gpu_buddy.h b/include/linux/gpu_buddy.h
index 07ac65db6d2e..7ad817c69ec6 100644
--- a/include/linux/gpu_buddy.h
+++ b/include/linux/gpu_buddy.h
@@ -11,6 +11,7 @@
  #include <linux/slab.h>
  #include <linux/sched.h>
  #include <linux/rbtree.h>
+#include <linux/rbtree_augmented.h>
    #define GPU_BUDDY_RANGE_ALLOCATION        BIT(0)
  #define GPU_BUDDY_TOPDOWN_ALLOCATION        BIT(1)
@@ -58,6 +59,7 @@ struct gpu_buddy_block {
      };
        struct list_head tmp_link;
+    unsigned int subtree_max_alignment;
  };
    /* Order-zero must be at least SZ_4K */

base-commit: 9d757669b2b22cd224c334924f798393ffca537c

Re: [PATCH v3 1/2] drm/buddy: Improve offset-aligned allocation handling

Reply via email to