virt: Drop concurrent GPU reset protection for SRIOV

Liu, Shaoyun Tue, 04 Jan 2022 16:01:12 -0800

[AMD Official Use Only]

I see, I didn't notice you already have  this implemented . so the flr_work 
routine itself is synced now, in this case , I  agree it should be safe to 
remove the in_gpu_reset and  reset_semm in the flr_work.


Regards
Shaoyun.liu

-----Original Message-----
From: Grodzovsky, Andrey <[email protected]> 
Sent: Tuesday, January 4, 2022 3:55 PM
To: Liu, Shaoyun <[email protected]>; Koenig, Christian 
<[email protected]>; Liu, Monk <[email protected]>; Chen, JingWen 
<[email protected]>; Christian König <[email protected]>; 
Deng, Emily <[email protected]>; [email protected]; 
[email protected]; Chen, Horace <[email protected]>
Cc: [email protected]
Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection 
for SRIOV

On 2022-01-04 12:13 p.m., Liu, Shaoyun wrote:

> [AMD Official Use Only]
>
> I mostly agree with the sequences Christian  described .  Just one  thing 
> might need to  discuss here.  For FLR notified from host,  in new sequenceas 
> described  , driver need to reply the  READY_TO_RESET in the  workitem  from 
> a reset  work queue which means inside flr_work, driver can not directly 
> reply to host but need to queue another workqueue .


Can you clarify why 'driver can not directly reply to host but need to queue 
another workqueue' ? To my understating all steps 3-6 in Christian's 
description happen from the same single wq thread serially.


>   For current  code ,  the flr_work for sriov itself is a work queue queued 
> from ISR .  I think we should try to response to the host driver as soon as 
> possible.  Queue another workqueue  inside  the workqueue  doesn't sounds 
> efficient to me.


Check patch 5 please [1] - I just substituted
schedule_work(&adev->virt.flr_work) for
queue_work(adev->reset_domain.wq,&adev->virt.flr_work) so no extra requeue 
here, just instead of sending to system_wq it's sent to dedicated reset wq

[1] -
https://lore.kernel.org/all/[email protected]/

Andrey


> Anyway, what we need is a working  solution for our project.  So if we need 
> to change the sequence, we  need to make sure it's been tested first and 
> won't break the functionality before the code is landed in the branch .
>
> Regards
> Shaoyun.liu
>
>
> -----Original Message-----
> From: amd-gfx <[email protected]> On Behalf Of 
> Christian König
> Sent: Tuesday, January 4, 2022 6:36 AM
> To: Liu, Monk <[email protected]>; Chen, JingWen 
> <[email protected]>; Christian König 
> <[email protected]>; Grodzovsky, Andrey 
> <[email protected]>; Deng, Emily <[email protected]>; 
> [email protected]; [email protected]; Chen, 
> Horace <[email protected]>
> Cc: [email protected]
> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
> protection for SRIOV
>
> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>> [AMD Official Use Only]
>>
>>>> See the FLR request from the hypervisor is just another source of 
>>>> signaling the need for a reset, similar to each job timeout on each queue. 
>>>> Otherwise you have a race condition between the hypervisor and the 
>>>> scheduler.
>> No it's not, FLR from hypervisor is just to notify guest the hw VF 
>> FLR is about to start or was already executed, but host will do FLR 
>> anyway without waiting for guest too long
>>
> Then we have a major design issue in the SRIOV protocol and really need to 
> question this.
>
> How do you want to prevent a race between the hypervisor resetting the 
> hardware and the client trying the same because of a timeout?
>
> As far as I can see the procedure should be:
> 1. We detect that a reset is necessary, either because of a fault a timeout 
> or signal from hypervisor.
> 2. For each of those potential reset sources a work item is send to the 
> single workqueue.
> 3. One of those work items execute first and prepares the reset.
> 4. We either do the reset our self or notify the hypervisor that we are ready 
> for the reset.
> 5. Cleanup after the reset, eventually resubmit jobs etc..
> 6. Cancel work items which might have been scheduled from other reset sources.
>
> It does make sense that the hypervisor resets the hardware without waiting 
> for the clients for too long, but if we don't follow this general steps we 
> will always have a race between the different components.
>
> Regards,
> Christian.
>
> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>> [AMD Official Use Only]
>>
>>>> See the FLR request from the hypervisor is just another source of 
>>>> signaling the need for a reset, similar to each job timeout on each queue. 
>>>> Otherwise you have a race condition between the hypervisor and the 
>>>> scheduler.
>> No it's not, FLR from hypervisor is just to notify guest the hw VF 
>> FLR is about to start or was already executed, but host will do FLR 
>> anyway without waiting for guest too long
>>
>>>> In other words I strongly think that the current SRIOV reset 
>>>> implementation is severely broken and what Andrey is doing is actually 
>>>> fixing it.
>> It makes the code to crash ... how could it be a fix ?
>>
>> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do 
>> not ruin the logic, Andry or jingwen can try it if needed.
>>
>> Thanks
>> -------------------------------------------------------------------
>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>> -------------------------------------------------------------------
>> we are hiring software manager for CVS core team
>> -------------------------------------------------------------------
>>
>> -----Original Message-----
>> From: Koenig, Christian <[email protected]>
>> Sent: Tuesday, January 4, 2022 6:19 PM
>> To: Chen, JingWen <[email protected]>; Christian König 
>> <[email protected]>; Grodzovsky, Andrey 
>> <[email protected]>; Deng, Emily <[email protected]>; Liu, 
>> Monk <[email protected]>; [email protected]; 
>> [email protected]; Chen, Horace <[email protected]>; 
>> Chen, JingWen <[email protected]>
>> Cc: [email protected]
>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
>> protection for SRIOV
>>
>> Hi Jingwen,
>>
>> well what I mean is that we need to adjust the implementation in amdgpu to 
>> actually match the requirements.
>>
>> Could be that the reset sequence is questionable in general, but I doubt so 
>> at least for now.
>>
>> See the FLR request from the hypervisor is just another source of signaling 
>> the need for a reset, similar to each job timeout on each queue. Otherwise 
>> you have a race condition between the hypervisor and the scheduler.
>>
>> Properly setting in_gpu_reset is indeed mandatory, but should happen at a 
>> central place and not in the SRIOV specific code.
>>
>> In other words I strongly think that the current SRIOV reset implementation 
>> is severely broken and what Andrey is doing is actually fixing it.
>>
>> Regards,
>> Christian.
>>
>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>> Hi Christian,
>>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>>
>>> Do you mean we should change the reset sequence in SRIOV? This will be a 
>>> huge change for our SRIOV solution.
>>>
>>>    From my point of view, we can directly use 
>>> amdgpu_device_lock_adev and amdgpu_device_unlock_adev in flr_work instead 
>>> of try_lock since no one will conflict with this thread with reset_domain 
>>> introduced.
>>> But we do need the reset_sem and adev->in_gpu_reset to keep device 
>>> untouched via user space.
>>>
>>> Best Regards,
>>> Jingwen Chen
>>>
>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>>
>>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>>
>>>> Christian.
>>>>
>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>> Sure, I guess i can drop this patch then.
>>>>>
>>>>> Andrey
>>>>>
>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, 
>>>>>> and do the flr, guest side thread may not know this and still try to 
>>>>>> access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to 
>>>>>> identify the reset status). And this may lead to very bad result.
>>>>>>
>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>> These patches look good to me. JingWen will pull these patches and do 
>>>>>>> some basic TDR test on sriov environment, and give feedback.
>>>>>>>
>>>>>>> Best wishes
>>>>>>> Emily Deng
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Liu, Monk <[email protected]>
>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>> To: Koenig, Christian <[email protected]>; Grodzovsky, 
>>>>>>>> Andrey <[email protected]>; 
>>>>>>>> [email protected]; amd- 
>>>>>>>> [email protected]; Chen, Horace <[email protected]>; 
>>>>>>>> Chen, JingWen <[email protected]>; Deng, Emily 
>>>>>>>> <[email protected]>
>>>>>>>> Cc: [email protected]
>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU 
>>>>>>>> reset protection for SRIOV
>>>>>>>>
>>>>>>>> [AMD Official Use Only]
>>>>>>>>
>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>
>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> ---------------------------------------------------------------
>>>>>>>> -
>>>>>>>> -
>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>> ---------------------------------------------------------------
>>>>>>>> -
>>>>>>>> -
>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>> ---------------------------------------------------------------
>>>>>>>> -
>>>>>>>> -
>>>>>>>> --
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Koenig, Christian <[email protected]>
>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>> To: Grodzovsky, Andrey <[email protected]>; dri- 
>>>>>>>> [email protected]; [email protected]
>>>>>>>> Cc: [email protected]; Liu, Monk <[email protected]>; Chen, Horace 
>>>>>>>> <[email protected]>
>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU 
>>>>>>>> reset protection for SRIOV
>>>>>>>>
>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>> Since now flr work is serialized against  GPU resets there is 
>>>>>>>>> no need for this.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Andrey Grodzovsky <[email protected]>
>>>>>>>> Acked-by: Christian König <[email protected]>
>>>>>>>>
>>>>>>>>> ---
>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>       2 files changed, 22 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> @@ -248,15 +248,7 @@ static void 
>>>>>>>>> xgpu_ai_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, 
>>>>>>>>> struct
>>>>>>>> amdgpu_device, virt);
>>>>>>>>>           int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>
>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE 
>>>>>>>>> received,
>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>> -     * the VF FLR.
>>>>>>>>> -     */
>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>> -        return;
>>>>>>>>> -
>>>>>>>>>           amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>
>>>>>>>>>           xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 
>>>>>>>>> 0, 0, 0);
>>>>>>>>>
>>>>>>>>> @@ -269,9 +261,6 @@ static void 
>>>>>>>>> xgpu_ai_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>
>>>>>>>>>       flr_done:
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>> -
>>>>>>>>>           /* Trigger recovery for world switch failure if no 
>>>>>>>>> TDR */
>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) || diff 
>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> @@ -277,15 +277,7 @@ static void 
>>>>>>>>> xgpu_nv_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, 
>>>>>>>>> struct
>>>>>>>> amdgpu_device, virt);
>>>>>>>>>           int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>
>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE 
>>>>>>>>> received,
>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>> -     * the VF FLR.
>>>>>>>>> -     */
>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>> -        return;
>>>>>>>>> -
>>>>>>>>>           amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>
>>>>>>>>>           xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 
>>>>>>>>> 0, 0, 0);
>>>>>>>>>
>>>>>>>>> @@ -298,9 +290,6 @@ static void 
>>>>>>>>> xgpu_nv_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>
>>>>>>>>>       flr_done:
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>> -
>>>>>>>>>           /* Trigger recovery for world switch failure if no 
>>>>>>>>> TDR */
>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) ||

RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

Reply via email to