On 12/19/25 3:57 PM, Donet Tom wrote:
On 12/12/25 2:34 PM, Christian König wrote:
On 12/12/25 07:40, Donet Tom wrote:
The ctl_stack_size and wg_data_size values are used to compute the
total
context save/restore buffer size and the control stack size. These
buffers
are programmed into the GPU and are used to store the queue state
during
context save and restore.
Currently, both ctl_stack_size and wg_data_size are aligned to the CPU
PAGE_SIZE. On systems with a non-4K CPU page size, this causes
unnecessary
memory waste because the GPU internally calculates and uses buffer
sizes
aligned to a fixed 4K GPU page size.
Since the control stack and context save/restore buffers are
consumed by
the GPU, their sizes should be aligned to the GPU page size (4K),
not the
CPU page size. This patch updates the alignment of ctl_stack_size and
wg_data_size to prevent over-allocation on systems with larger CPU page
sizes.
As far as I know the problem is that the debugger needs to consume
that stuff on the CPU side as well.
Thank you for your help.
As mentioned earlier, we were observing some queue preemption and GPU
hang issues. To address this, we introduced a patch, and after
applying the 7/8 and 8/8 patches, those issues have not been seen anymore
While debugging the GPU hang issue, I made some additional observations.
On my system, I booted a kernel with a 4 KB system page size and
modified both the ROCR runtime and the GPU driver to set the control
stack size to 64 KB. Even on a 4 KB page-size system, using a 64 KB
control stack size reliably reproduces the queue preemption failure
when running RCCL unit tests on 8 GPUs. This suggests that the issue
is not related to the system page size, but rather to the control
stack size being exactly 64 KB.
When the control stack size is set to 64 KB ± 4 KB, the tests pass on
both 4 KB and 64 KB system page-size configurations.
For gfxv9, is there any documented hardware limitation on the control
stack size? Specifically, is it valid to use a control stack size of
exactly 64 KB?
I have one more question based on my understanding of the code. The
control stack size depends on the number of CUs and waves. For GFXv9,
what is the maximum possible control stack size? Can it reach 64K?
For GFX10, I’ve seen that the control stack size must be less than or
equal to 0x7000. Is there a similar limitation for GFXv9?
I’m asking because, with both 4K and 64K page sizes, I’m seeing queue
preemption failures on GFXv9 when the control stack size is set to 64K.
I need to double check that, but I think the alignment is correct as
it is.
The control stack is part of the context save-restore buffer, and we
configure it on the GPU as shown below:
m->cp_hqd_ctx_save_base_addr_lo =
lower_32_bits(q->ctx_save_restore_area_address);
m->cp_hqd_ctx_save_base_addr_hi =
upper_32_bits(q->ctx_save_restore_area_address);
m->cp_hqd_ctx_save_size = q->ctx_save_restore_area_size;
m->cp_hqd_cntl_stack_size = q->ctl_stack_size;
m->cp_hqd_cntl_stack_offset = q->ctl_stack_size;
m->cp_hqd_wg_state_offset = q->ctl_stack_size;
The control stack occupies the region from cp_hqd_cntl_stack_offset
down to 0 within the ctx save restore area, and the remaining space is
used for WG state. This buffer is fully managed by the GPU during
preemption and restore operations.
The control stack size is calculated based on hardware configuration
(CU count and wave count). For example, on gfxv9, the size is
typically around 32 KB. If we align this size to the system page size
(e.g., 64 KB), two issues arise:
1. Unnecessary memory overhead.
2. Potential queue preemption issues.
On the CPU side, we copy the control stack contents to other buffers
for processing. Since the control stack size is derived from hardware
configuration, aligning it to the GPU page size seems more
appropriate. Aligning to the system page size would waste memory
without adding value. Using GPU page size alignment ensures
consistency with hardware and avoids unnecessary overhead.
Would you agree that aligning the control stack size to the GPU page
size is the right approach? Or do you see any concerns with this method?
Regards,
Christian.
Signed-off-by: Donet Tom <[email protected]>
---
drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
index dc857450fa16..00ab941c3e86 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
@@ -445,10 +445,11 @@ void kfd_queue_ctx_save_restore_size(struct
kfd_topology_device *dev)
min(cu_num * 40, props->array_count /
props->simd_arrays_per_engine * 512)
: cu_num * 32;
- wg_data_size = ALIGN(cu_num *
WG_CONTEXT_DATA_SIZE_PER_CU(gfxv, props), PAGE_SIZE);
+ wg_data_size = ALIGN(cu_num * WG_CONTEXT_DATA_SIZE_PER_CU(gfxv,
props),
+ AMDGPU_GPU_PAGE_SIZE);
ctl_stack_size = wave_num * CNTL_STACK_BYTES_PER_WAVE(gfxv) + 8;
ctl_stack_size =
ALIGN(SIZEOF_HSA_USER_CONTEXT_SAVE_AREA_HEADER + ctl_stack_size,
- PAGE_SIZE);
+ AMDGPU_GPU_PAGE_SIZE);
if ((gfxv / 10000 * 10000) == 100000) {
/* HW design limits control stack size to 0x7000.
@@ -460,7 +461,7 @@ void kfd_queue_ctx_save_restore_size(struct
kfd_topology_device *dev)
props->ctl_stack_size = ctl_stack_size;
props->debug_memory_size = ALIGN(wave_num *
DEBUGGER_BYTES_PER_WAVE, DEBUGGER_BYTES_ALIGN);
- props->cwsr_size = ctl_stack_size + wg_data_size;
+ props->cwsr_size = ALIGN(ctl_stack_size + wg_data_size,
PAGE_SIZE);
if (gfxv == 80002) /* GFX_VERSION_TONGA */
props->eop_buffer_size = 0x8000;