Hang on a second. If there are production GPUs that only work with 
HSA_OVERRIDE_GFX_VERSION right now, then we should make those GPUs properly 
supported. I thought this was only used internally for bring-up or maybe 
externally as a short-term solution before we upstream proper support for new 
GPUs.

Regards,
  Felix


On 2024-08-11 22:10, Zhang, Yifan wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> I agree that adding exp_hw_support is a safer approach. My concern is that 
> HSA_OVERRIDE_GFX_VERSION has been used for a while and has become a status 
> quo for running ROCm on unsupported APUs. I'm not sure if this approach will 
> be a burden for APU end users. Adding driver load parameters is more 
> complicated than simply adding an environment variable on consumer PCs.
> 
> Best Regards,
> Yifan
> 
> -----Original Message-----
> From: Kuehling, Felix <felix.kuehl...@amd.com>
> Sent: Saturday, August 10, 2024 7:37 AM
> To: Zhang, Yifan <yifan1.zh...@amd.com>; Kasiviswanathan, Harish 
> <harish.kasiviswanat...@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Yang, Philip <philip.y...@amd.com>
> Subject: Re: [PATCH] drm/amdkfd: keep create queue success if cwsr save area 
> doesn't match
> 
> Maybe we can turn this check into a warnings if, and only if the 
> exp_hw_support module param is set. That way we don't water down the checks 
> on the production code path but allow experimental setups to run without a 
> seat belt.
> 
> Regards,
>    Felix
> 
> 
> On 2024-08-09 01:39, Zhang, Yifan wrote:
>> [AMD Official Use Only - AMD Internal Distribution Only]
>>
>> Yes, I think we need that change for a normal code path, but this case is 
>> introduced only with the HSA_OVERRIDE_GFX_VERSION environment setting, which 
>> implies that "the override ASIC is compatible with the real ASIC." It is 
>> intended for experimental purposes. When a user is using 
>> HSA_OVERRIDE_GFX_VERSION, they should be aware of the potential risks it may 
>> bring. Usually, HSA_OVERRIDE_GFX_VERSION is used to force an unsupported APU 
>> to be recognized as a ROCm-supported high-end dGPU, which has a large cwsr 
>> save area, making the operation safe. This check was added to KFD two weeks 
>> ago, the HSA_OVERRIDE_GFX_VERSION environment had been working fine before 
>> that.
>>
>> Best Regards,
>> Yifan
>>
>> -----Original Message-----
>> From: Kasiviswanathan, Harish <harish.kasiviswanat...@amd.com>
>> Sent: Thursday, August 8, 2024 10:46 PM
>> To: Zhang, Yifan <yifan1.zh...@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Kuehling, Felix <felix.kuehl...@amd.com>; Yang, Philip
>> <philip.y...@amd.com>; Zhang, Yifan <yifan1.zh...@amd.com>
>> Subject: RE: [PATCH] drm/amdkfd: keep create queue success if cwsr
>> save area doesn't match
>>
>> [AMD Official Use Only - AMD Internal Distribution Only]
>>
>> In this case, shouldn't larger of two sizes be used. Also, we should have an 
>> upper bound check.
>>
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of
>> Yifan Zhang
>> Sent: Thursday, August 8, 2024 4:44 AM
>> To: amd-gfx@lists.freedesktop.org
>> Cc: Kuehling, Felix <felix.kuehl...@amd.com>; Yang, Philip
>> <philip.y...@amd.com>; Zhang, Yifan <yifan1.zh...@amd.com>
>> Subject: [PATCH] drm/amdkfd: keep create queue success if cwsr save
>> area doesn't match
>>
>> If HSA_OVERRIDE_GFX_VERSION is used in ROCm workload, user space and kernel 
>> use different spec to calculate cwsr save area, current check may fail 
>> create queue ioctl. Change error to warn to make create queue succeed in 
>> that case.
>>
>> Signed-off-by: Yifan Zhang <yifan1.zh...@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 4 +---
>>   1 file changed, 1 insertion(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>> b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>> index e0a073ae4a49..9f283aff057a 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>> @@ -295,11 +295,9 @@ int kfd_queue_acquire_buffers(struct kfd_process_device 
>> *pdd, struct queue_prope
>>          }
>>
>>          if (properties->ctx_save_restore_area_size != 
>> topo_dev->node_props.cwsr_size) {
>> -               pr_debug("queue cwsr size 0x%x not equal to node cwsr size 
>> 0x%x\n",
>> +               pr_warn("queue cwsr size 0x%x not equal to node cwsr
>> + size 0x%x\n",
>>                          properties->ctx_save_restore_area_size,
>>                          topo_dev->node_props.cwsr_size);
>> -               err = -EINVAL;
>> -               goto out_err_unreserve;
>>          }
>>
>>          total_cwsr_size = (topo_dev->node_props.cwsr_size +
>> topo_dev->node_props.debug_memory_size)
>> --
>> 2.37.3
>>
>>

Reply via email to