Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per queue reset

Lazar, Lijo Wed, 15 Jan 2025 00:32:45 -0800

On 1/15/2025 1:59 PM, Lazar, Lijo wrote:
> 
> 
> On 1/14/2025 10:51 PM, Kim, Jonathan wrote:
>> [Public]
>>
>>> -----Original Message-----
>>> From: Lazar, Lijo <[email protected]>
>>> Sent: Friday, January 10, 2025 10:37 PM
>>> To: Kim, Jonathan <[email protected]>; [email protected]
>>> Cc: Kasiviswanathan, Harish <[email protected]>
>>> Subject: Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per queue 
>>> reset
>>>
>>>
>>>
>>> On 1/11/2025 2:53 AM, Kim, Jonathan wrote:
>>>> [Public]
>>>>
>>>>> -----Original Message-----
>>>>> From: Lazar, Lijo <[email protected]>
>>>>> Sent: Friday, January 10, 2025 11:29 AM
>>>>> To: Kim, Jonathan <[email protected]>; [email protected]
>>>>> Cc: Kasiviswanathan, Harish <[email protected]>
>>>>> Subject: Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per queue 
>>>>> reset
>>>>>
>>>>>
>>>>>
>>>>> On 1/10/2025 9:43 PM, Kim, Jonathan wrote:
>>>>>> [Public]
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Lazar, Lijo <[email protected]>
>>>>>>> Sent: Thursday, January 9, 2025 10:39 PM
>>>>>>> To: Kim, Jonathan <[email protected]>; [email protected]
>>>>>>> Cc: Kasiviswanathan, Harish <[email protected]>
>>>>>>> Subject: Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per queue
>>> reset
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 1/9/2025 8:27 PM, Kim, Jonathan wrote:
>>>>>>>> [Public]
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Lazar, Lijo <[email protected]>
>>>>>>>>> Sent: Thursday, January 9, 2025 1:14 AM
>>>>>>>>> To: Kim, Jonathan <[email protected]>; amd-
>>> [email protected]
>>>>>>>>> Cc: Kasiviswanathan, Harish <[email protected]>
>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per 
>>>>>>>>> queue
>>>>> reset
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 1/9/2025 1:31 AM, Jonathan Kim wrote:
>>>>>>>>>> Per queue reset should be bypassed when gpu recovery is disabled
>>>>>>>>>> with module parameter.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Jonathan Kim <[email protected]>
>>>>>>>>>> ---
>>>>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c | 6 ++++++
>>>>>>>>>>  1 file changed, 6 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
>>>>>>>>>> index cc66ebb7bae1..441568163e20 100644
>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
>>>>>>>>>> @@ -1131,6 +1131,9 @@ uint64_t kgd_gfx_v9_hqd_get_pq_addr(struct
>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>     uint32_t low, high;
>>>>>>>>>>     uint64_t queue_addr = 0;
>>>>>>>>>>
>>>>>>>>>> +   if (!amdgpu_gpu_recovery)
>>>>>>>>>> +           return 0;
>>>>>>>>>> +
>>>>>>>>>>     kgd_gfx_v9_acquire_queue(adev, pipe_id, queue_id, inst);
>>>>>>>>>>     amdgpu_gfx_rlc_enter_safe_mode(adev, inst);
>>>>>>>>>>
>>>>>>>>>> @@ -1179,6 +1182,9 @@ uint64_t kgd_gfx_v9_hqd_reset(struct
>>>>>>> amdgpu_device
>>>>>>>>> *adev,
>>>>>>>>>>     uint32_t low, high, pipe_reset_data = 0;
>>>>>>>>>>     uint64_t queue_addr = 0;
>>>>>>>>>>
>>>>>>>>>> +   if (!amdgpu_gpu_recovery)
>>>>>>>>>> +           return 0;
>>>>>>>>>> +
>>>>>>>>>
>>>>>>>>> I think the right place for this check is not inside callback, should 
>>>>>>>>> be
>>>>>>>>> from the place where the callback gets called.
>>>>>>>>
>>>>>>>> I don't think it really matters.  We're going to have different reset 
>>>>>>>> types in the
>>>>> future
>>>>>>> that my come from different callers.
>>>>>>>> It's probably easier to remember to put the bypass where the reset is 
>>>>>>>> actually
>>>>>>> happening.
>>>>>>>>
>>>>>>>
>>>>>>> If I want to define something like amdgpu_gpu_recovery=2 (don't do queue
>>>>>>> reset but perform other resets), then it matters.
>>>>>>
>>>>>> I don't get why that matters.
>>>>>> This callback alone, for example, calls 2 types of resets within itself 
>>>>>> (queue then
>>>>> pipe).
>>>>>> If you wanted partial resets between queue and pipe in this case, you'd 
>>>>>> have to
>>>>> branch test within the callback itself.
>>>>>> GPU reset bypass checks are invisible to the KFD section of the code as 
>>>>>> well.
>>>>>>
>>>>>>>
>>>>>>> Since this is a callback, keeping it at the wrapper place may be more
>>>>>>> maintainable rather than keeping the check for gfx10/11/12 etc.
>>>>>>
>>>>>> Maybe not.  MES is preemption checks are not like MEC preemption checks 
>>>>>> at
>>> all.
>>>>>> And we probably don't want to jam other future IP resets into a single 
>>>>>> caller.
>>>>>> If you look at the kfd2kgd callbacks, most are pretty much copy and 
>>>>>> paste from
>>> one
>>>>> generation to another.
>>>>>> I don't see how putting the test in the callback makes it less 
>>>>>> maintainable.
>>>>>>
>>>>>
>>>>> My thought process was this could be put in - reset_queues_on_hws_hang
>>>>> and anything similar handles MES based queue resets. What you are saying
>>>>> there won't be anything like reset_queue_by_mes() for MES based resets.
>>>>> Is that correct?
>>>>
>>>> No the opposite.  But now we'd have to remember to put it in 2 places where
>>> there's still no visible test for gpu reset bypass in the same file.
>>>> SDMA resets are also being implemented and will probably have to be called 
>>>> in
>>> different places in the KFD as well.
>>>> We can look at consolidating this later as more devices and IPs get done 
>>>> if it
>>> makes sense to abstract this stuff.
>>>> My point is, the callback does the reset and returns a reset status.
>>>> Bypassing by fail return seems easier to remember and leverage.
>>>
>>> Ok, we have SDMA queue reset called from job timeouts. If it's getting
>>> called from KFD too, will look at consolidating that one.
>>>
>>> BTW, if this is returning a fake success, won't it result in a print
>>> like queue reset succeeded which gives the false impression that queue
>>> reset happened? Or, should it return a different error code like
>>> 'ECANCELED' since operation is intentionally skipped through module param
>>
>> Well ... the call is supposed to return an address of which queue got reset 
>> where a null return indicates "no queue got reset".
> 
> We should also somehow indicate to the user that the queue indeed ran
> into a reset situation. Not sure if that is taken care if address is
> returned as NULL.
> 
>> I'm thinking to make this simpler, maybe we change reset_queues_on_hws into 
>> a wrapper that takes in a queue type input (compute, sdma etc) and branches 
>> to the right reset call based on input type.
>> That way we only need 1 place to do the gpu_recovery enablement check in the 
>> KFD, and the KFD has the flexibility to call this wrapper where ever it 
>> wants to without having to worry about the module parameter status in the 
>> future.
> 
> Yes, this was the intention of original comment - to add at caller place
> rather than inside callback. It provides a one-place (or fewer places)
> control of the usage.
> 
> While at it, suggest to use amdgpu_device_should_recover_gpu(). This
> will give an info message if recovery is disabled, and we could control
> different meanings of this param (Ex: gpu_recovery = 2, avoid sdma queue
> reset alone) through the same function.
> 

As an afterthought, realized that this function doesn't have a param for
the type of recovery attempted. This may be added later if at all we
have a situation to assign new param values like skip reset for sdma
alone. Would still recommend to use the same function instead of direct
value check of param.

Thanks,
Lijo

> Thanks,
> Lijo
> 
>>
>> Jon
>>
>>>
>>> Thanks,
>>> Lijo
>>>
>>>> That being said, putting the test in hqd_get_pq_addr was probably 
>>>> overkill, but I
>>> don't think anyone really cares to use it with gpu recovery off on its own 
>>> at the
>>> moment.
>>>>
>>>> Jon
>>>>
>>>>>
>>>>> Thanks,
>>>>> Lijo
>>>>>
>>>>>> Jon
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Lijo
>>>>>>>
>>>>>>>> Jon
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Lijo
>>>>>>>>>
>>>>>>>>>>     kgd_gfx_v9_acquire_queue(adev, pipe_id, queue_id, inst);
>>>>>>>>>>     amdgpu_gfx_rlc_enter_safe_mode(adev, inst);
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>
>
Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per queue reset

Reply via email to