Maybe we can turn this check into a warnings if, and only if the
exp_hw_support module param is set. That way we don't water down the
checks on the production code path but allow experimental setups to run
without a seat belt.
Regards,
Felix
On 2024-08-09 01:39, Zhang, Yifan wrote:
[AMD Official Use Only - AMD Internal Distribution Only]
Yes, I think we need that change for a normal code path, but this case is introduced only
with the HSA_OVERRIDE_GFX_VERSION environment setting, which implies that "the
override ASIC is compatible with the real ASIC." It is intended for experimental
purposes. When a user is using HSA_OVERRIDE_GFX_VERSION, they should be aware of the
potential risks it may bring. Usually, HSA_OVERRIDE_GFX_VERSION is used to force an
unsupported APU to be recognized as a ROCm-supported high-end dGPU, which has a large
cwsr save area, making the operation safe. This check was added to KFD two weeks ago, the
HSA_OVERRIDE_GFX_VERSION environment had been working fine before that.
Best Regards,
Yifan
-----Original Message-----
From: Kasiviswanathan, Harish <[email protected]>
Sent: Thursday, August 8, 2024 10:46 PM
To: Zhang, Yifan <[email protected]>; [email protected]
Cc: Kuehling, Felix <[email protected]>; Yang, Philip <[email protected]>;
Zhang, Yifan <[email protected]>
Subject: RE: [PATCH] drm/amdkfd: keep create queue success if cwsr save area
doesn't match
[AMD Official Use Only - AMD Internal Distribution Only]
In this case, shouldn't larger of two sizes be used. Also, we should have an
upper bound check.
-----Original Message-----
From: amd-gfx <[email protected]> On Behalf Of Yifan Zhang
Sent: Thursday, August 8, 2024 4:44 AM
To: [email protected]
Cc: Kuehling, Felix <[email protected]>; Yang, Philip <[email protected]>;
Zhang, Yifan <[email protected]>
Subject: [PATCH] drm/amdkfd: keep create queue success if cwsr save area
doesn't match
If HSA_OVERRIDE_GFX_VERSION is used in ROCm workload, user space and kernel use
different spec to calculate cwsr save area, current check may fail create queue
ioctl. Change error to warn to make create queue succeed in that case.
Signed-off-by: Yifan Zhang <[email protected]>
---
drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
index e0a073ae4a49..9f283aff057a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
@@ -295,11 +295,9 @@ int kfd_queue_acquire_buffers(struct kfd_process_device
*pdd, struct queue_prope
}
if (properties->ctx_save_restore_area_size !=
topo_dev->node_props.cwsr_size) {
- pr_debug("queue cwsr size 0x%x not equal to node cwsr size
0x%x\n",
+ pr_warn("queue cwsr size 0x%x not equal to node cwsr
+ size 0x%x\n",
properties->ctx_save_restore_area_size,
topo_dev->node_props.cwsr_size);
- err = -EINVAL;
- goto out_err_unreserve;
}
total_cwsr_size = (topo_dev->node_props.cwsr_size +
topo_dev->node_props.debug_memory_size)
--
2.37.3