Re: [RFC PATCH v1 7/8] amdgpu: Align ctl_stack_size and wg_data_size to GPU page size instead of CPU page size

Donet Tom Tue, 06 Jan 2026 05:01:40 -0800


On 12/19/25 3:57 PM, Donet Tom wrote:

On 12/12/25 2:34 PM, Christian König wrote:
On 12/12/25 07:40, Donet Tom wrote:
The ctl_stack_size and wg_data_size values are used to compute thetotalcontext save/restore buffer size and the control stack size. Thesebuffersare programmed into the GPU and are used to store the queue stateduring
context save and restore.

Currently, both ctl_stack_size and wg_data_size are aligned to the CPU
PAGE_SIZE. On systems with a non-4K CPU page size, this causesunnecessarymemory waste because the GPU internally calculates and uses buffersizes
aligned to a fixed 4K GPU page size.
Since the control stack and context save/restore buffers areconsumed bythe GPU, their sizes should be aligned to the GPU page size (4K),not the
CPU page size. This patch updates the alignment of ctl_stack_size and
wg_data_size to prevent over-allocation on systems with larger CPU page
sizes.
As far as I know the problem is that the debugger needs to consumethat stuff on the CPU side as well.
Thank you for your help.
As mentioned earlier, we were observing some queue preemption and GPUhang issues. To address this, we introduced a patch, and afterapplying the 7/8 and 8/8 patches, those issues have not been seen anymore
While debugging the GPU hang issue, I made some additional observations.
On my system, I booted a kernel with a 4 KB system page size andmodified both the ROCR runtime and the GPU driver to set the controlstack size to 64 KB. Even on a 4 KB page-size system, using a 64 KBcontrol stack size reliably reproduces the queue preemption failurewhen running RCCL unit tests on 8 GPUs. This suggests that the issueis not related to the system page size, but rather to the controlstack size being exactly 64 KB.
When the control stack size is set to 64 KB ± 4 KB, the tests pass onboth 4 KB and 64 KB system page-size configurations.
For gfxv9, is there any documented hardware limitation on the controlstack size? Specifically, is it valid to use a control stack size ofexactly 64 KB?

I have one more question based on my understanding of the code. Thecontrol stack size depends on the number of CUs and waves. For GFXv9,what is the maximum possible control stack size? Can it reach 64K?

For GFX10, I’ve seen that the control stack size must be less than orequal to 0x7000. Is there a similar limitation for GFXv9?

I’m asking because, with both 4K and 64K page sizes, I’m seeing queuepreemption failures on GFXv9 when the control stack size is set to 64K.

I need to double check that, but I think the alignment is correct asit is.
The control stack is part of the context save-restore buffer, and weconfigure it on the GPU as shown below:
m->cp_hqd_ctx_save_base_addr_lo =lower_32_bits(q->ctx_save_restore_area_address);m->cp_hqd_ctx_save_base_addr_hi =upper_32_bits(q->ctx_save_restore_area_address);
m->cp_hqd_ctx_save_size = q->ctx_save_restore_area_size;
m->cp_hqd_cntl_stack_size = q->ctl_stack_size;
m->cp_hqd_cntl_stack_offset = q->ctl_stack_size;
m->cp_hqd_wg_state_offset = q->ctl_stack_size;
The control stack occupies the region from cp_hqd_cntl_stack_offsetdown to 0 within the ctx save restore area, and the remaining space isused for WG state. This buffer is fully managed by the GPU duringpreemption and restore operations.The control stack size is calculated based on hardware configuration(CU count and wave count). For example, on gfxv9, the size istypically around 32 KB. If we align this size to the system page size(e.g., 64 KB), two issues arise:
1. Unnecessary memory overhead.
2. Potential queue preemption issues.
On the CPU side, we copy the control stack contents to other buffersfor processing. Since the control stack size is derived from hardwareconfiguration, aligning it to the GPU page size seems moreappropriate. Aligning to the system page size would waste memorywithout adding value. Using GPU page size alignment ensuresconsistency with hardware and avoids unnecessary overhead.
Would you agree that aligning the control stack size to the GPU pagesize is the right approach? Or do you see any concerns with this method?
Regards,
Christian.
Signed-off-by: Donet Tom <[email protected]>
---
  drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 7 ++++---
  1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.cb/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
index dc857450fa16..00ab941c3e86 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
@@ -445,10 +445,11 @@ void kfd_queue_ctx_save_restore_size(structkfd_topology_device *dev) min(cu_num * 40, props->array_count /props->simd_arrays_per_engine * 512)
              : cu_num * 32;
- wg_data_size = ALIGN(cu_num *WG_CONTEXT_DATA_SIZE_PER_CU(gfxv, props), PAGE_SIZE);+ wg_data_size = ALIGN(cu_num * WG_CONTEXT_DATA_SIZE_PER_CU(gfxv,props),
+                AMDGPU_GPU_PAGE_SIZE);
      ctl_stack_size = wave_num * CNTL_STACK_BYTES_PER_WAVE(gfxv) + 8;
ctl_stack_size =ALIGN(SIZEOF_HSA_USER_CONTEXT_SAVE_AREA_HEADER + ctl_stack_size,
-                   PAGE_SIZE);
+                   AMDGPU_GPU_PAGE_SIZE);
        if ((gfxv / 10000 * 10000) == 100000) {
          /* HW design limits control stack size to 0x7000.
@@ -460,7 +461,7 @@ void kfd_queue_ctx_save_restore_size(structkfd_topology_device *dev)
        props->ctl_stack_size = ctl_stack_size;
props->debug_memory_size = ALIGN(wave_num *DEBUGGER_BYTES_PER_WAVE, DEBUGGER_BYTES_ALIGN);
-    props->cwsr_size = ctl_stack_size + wg_data_size;
+ props->cwsr_size = ALIGN(ctl_stack_size + wg_data_size,PAGE_SIZE);
        if (gfxv == 80002)    /* GFX_VERSION_TONGA */
          props->eop_buffer_size = 0x8000;

Re: [RFC PATCH v1 7/8] amdgpu: Align ctl_stack_size and wg_data_size to GPU page size instead of CPU page size

Reply via email to