On 2022/1/6 上午2:24, Andrey Grodzovsky wrote: > > On 2022-01-05 2:59 a.m., Christian König wrote: >> Am 05.01.22 um 08:34 schrieb JingWen Chen: >>> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote: >>>> On 2022-01-04 6:36 a.m., Christian König wrote: >>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk: >>>>>> [AMD Official Use Only] >>>>>> >>>>>>>> See the FLR request from the hypervisor is just another source of >>>>>>>> signaling the need for a reset, similar to each job timeout on each >>>>>>>> queue. Otherwise you have a race condition between the hypervisor and >>>>>>>> the scheduler. >>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR >>>>>> is about to start or was already executed, but host will do FLR anyway >>>>>> without waiting for guest too long >>>>>> >>>>> Then we have a major design issue in the SRIOV protocol and really need >>>>> to question this. >>>>> >>>>> How do you want to prevent a race between the hypervisor resetting the >>>>> hardware and the client trying the same because of a timeout? >>>>> >>>>> As far as I can see the procedure should be: >>>>> 1. We detect that a reset is necessary, either because of a fault a >>>>> timeout or signal from hypervisor. >>>>> 2. For each of those potential reset sources a work item is send to the >>>>> single workqueue. >>>>> 3. One of those work items execute first and prepares the reset. >>>>> 4. We either do the reset our self or notify the hypervisor that we are >>>>> ready for the reset. >>>>> 5. Cleanup after the reset, eventually resubmit jobs etc.. >>>>> 6. Cancel work items which might have been scheduled from other reset >>>>> sources. >>>>> >>>>> It does make sense that the hypervisor resets the hardware without >>>>> waiting for the clients for too long, but if we don't follow this general >>>>> steps we will always have a race between the different components. >>>> >>>> Monk, just to add to this - if indeed as you say that 'FLR from hypervisor >>>> is just to notify guest the hw VF FLR is about to start or was already >>>> executed, but host will do FLR anyway without waiting for guest too long' >>>> and there is no strict waiting from the hypervisor for IDH_READY_TO_RESET >>>> to be recived from guest before starting the reset then setting >>>> in_gpu_reset and locking reset_sem from guest side is not really full proof >>>> protection from MMIO accesses by the guest - it only truly helps if >>>> hypervisor waits for that message before initiation of HW reset. >>>> >>> Hi Andrey, this cannot be done. If somehow guest kernel hangs and never has >>> the chance to send the response back, then other VFs will have to wait it >>> reset. All the vfs will hang in this case. Or sometimes the mailbox has >>> some delay and other VFs will also wait. The user of other VFs will be >>> affected in this case. >> >> Yeah, agree completely with JingWen. The hypervisor is the one in charge >> here, not the guest. >> >> What the hypervisor should do (and it already seems to be designed that way) >> is to send the guest a message that a reset is about to happen and give it >> some time to response appropriately. >> >> The guest on the other hand then tells the hypervisor that all processing >> has stopped and it is ready to restart. If that doesn't happen in time the >> hypervisor should eliminate the guest probably trigger even more severe >> consequences, e.g. restart the whole VM etc... >> >> Christian. > > > So what's the end conclusion here regarding dropping this particular patch ? > Seems to me we still need to drop it to prevent driver's MMIO access > to the GPU during reset from various places in the code. > > Andrey > Hi Andrey & Christian,
I have ported your patch(drop the reset_sem and in_gpu_reset in flr work) and run some tests. If a engine hang during an OCL benchmark(using kfd), we can see the logs below: [ 397.190727] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. [ 397.301496] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. [ 397.406601] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. [ 397.532343] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. [ 397.642251] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. [ 397.746634] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. [ 397.850761] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. [ 397.960544] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. [ 398.065218] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. [ 398.182173] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. [ 398.288264] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. [ 398.394712] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. [ 428.400582] [drm] clean up the vf2pf work item [ 428.500528] amdgpu 0000:00:07.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:8 pasid:32771, for process xgemmStandalone pid 3557 thread xgemmStandalone pid 3557) [ 428.527576] amdgpu 0000:00:07.0: amdgpu: in page starting at address 0x00007fc991c04000 from client 0x1b (UTCL2) [ 437.531392] amdgpu: qcm fence wait loop timeout expired [ 437.535738] amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption [ 437.537191] amdgpu 0000:00:07.0: amdgpu: GPU reset begin! [ 438.087443] [drm] RE-INIT-early: nv_common succeeded As kfd relies on these to check if GPU is in reset, dropping it will hit some page fault and fence error very easily. > >> >>>> Andrey >>>> >>>> >>>>> Regards, >>>>> Christian. >>>>> >>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk: >>>>>> [AMD Official Use Only] >>>>>> >>>>>>>> See the FLR request from the hypervisor is just another source of >>>>>>>> signaling the need for a reset, similar to each job timeout on each >>>>>>>> queue. Otherwise you have a race condition between the hypervisor and >>>>>>>> the scheduler. >>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR >>>>>> is about to start or was already executed, but host will do FLR anyway >>>>>> without waiting for guest too long >>>>>> >>>>>>>> In other words I strongly think that the current SRIOV reset >>>>>>>> implementation is severely broken and what Andrey is doing is actually >>>>>>>> fixing it. >>>>>> It makes the code to crash ... how could it be a fix ? >>>>>> >>>>>> I'm afraid the patch is NAK from me, but it is welcome if the cleanup >>>>>> do not ruin the logic, Andry or jingwen can try it if needed. >>>>>> >>>>>> Thanks >>>>>> ------------------------------------------------------------------- >>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD >>>>>> ------------------------------------------------------------------- >>>>>> we are hiring software manager for CVS core team >>>>>> ------------------------------------------------------------------- >>>>>> >>>>>> -----Original Message----- >>>>>> From: Koenig, Christian <christian.koe...@amd.com> >>>>>> Sent: Tuesday, January 4, 2022 6:19 PM >>>>>> To: Chen, JingWen <jingwen.ch...@amd.com>; Christian König >>>>>> <ckoenig.leichtzumer...@gmail.com>; Grodzovsky, Andrey >>>>>> <andrey.grodzov...@amd.com>; Deng, Emily <emily.d...@amd.com>; Liu, Monk >>>>>> <monk....@amd.com>; dri-devel@lists.freedesktop.org; >>>>>> amd-...@lists.freedesktop.org; Chen, Horace <horace.c...@amd.com>; Chen, >>>>>> JingWen <jingwen.ch...@amd.com> >>>>>> Cc: dan...@ffwll.ch >>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset >>>>>> protection for SRIOV >>>>>> >>>>>> Hi Jingwen, >>>>>> >>>>>> well what I mean is that we need to adjust the implementation in amdgpu >>>>>> to actually match the requirements. >>>>>> >>>>>> Could be that the reset sequence is questionable in general, but I doubt >>>>>> so at least for now. >>>>>> >>>>>> See the FLR request from the hypervisor is just another source of >>>>>> signaling the need for a reset, similar to each job timeout on each >>>>>> queue. Otherwise you have a race condition between the hypervisor and >>>>>> the scheduler. >>>>>> >>>>>> Properly setting in_gpu_reset is indeed mandatory, but should happen at >>>>>> a central place and not in the SRIOV specific code. >>>>>> >>>>>> In other words I strongly think that the current SRIOV reset >>>>>> implementation is severely broken and what Andrey is doing is actually >>>>>> fixing it. >>>>>> >>>>>> Regards, >>>>>> Christian. >>>>>> >>>>>> Am 04.01.22 um 10:07 schrieb JingWen Chen: >>>>>>> Hi Christian, >>>>>>> I'm not sure what do you mean by "we need to change SRIOV not the >>>>>>> driver". >>>>>>> >>>>>>> Do you mean we should change the reset sequence in SRIOV? This will be >>>>>>> a huge change for our SRIOV solution. >>>>>>> >>>>>>> From my point of view, we can directly use amdgpu_device_lock_adev >>>>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no >>>>>>> one will conflict with this thread with reset_domain introduced. >>>>>>> But we do need the reset_sem and adev->in_gpu_reset to keep device >>>>>>> untouched via user space. >>>>>>> >>>>>>> Best Regards, >>>>>>> Jingwen Chen >>>>>>> >>>>>>> On 2022/1/3 下午6:17, Christian König wrote: >>>>>>>> Please don't. This patch is vital to the cleanup of the reset >>>>>>>> procedure. >>>>>>>> >>>>>>>> If SRIOV doesn't work with that we need to change SRIOV and not the >>>>>>>> driver. >>>>>>>> >>>>>>>> Christian. >>>>>>>> >>>>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky: >>>>>>>>> Sure, I guess i can drop this patch then. >>>>>>>>> >>>>>>>>> Andrey >>>>>>>>> >>>>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote: >>>>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs >>>>>>>>>> first, and do the flr, guest side thread may not know this and still >>>>>>>>>> try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and >>>>>>>>>> reset_sem to identify the reset status). And this may lead to very >>>>>>>>>> bad result. >>>>>>>>>> >>>>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote: >>>>>>>>>>> These patches look good to me. JingWen will pull these patches and >>>>>>>>>>> do some basic TDR test on sriov environment, and give feedback. >>>>>>>>>>> >>>>>>>>>>> Best wishes >>>>>>>>>>> Emily Deng >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>> From: Liu, Monk <monk....@amd.com> >>>>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM >>>>>>>>>>>> To: Koenig, Christian <christian.koe...@amd.com>; Grodzovsky, >>>>>>>>>>>> Andrey <andrey.grodzov...@amd.com>; >>>>>>>>>>>> dri-devel@lists.freedesktop.org; amd- g...@lists.freedesktop.org; >>>>>>>>>>>> Chen, Horace <horace.c...@amd.com>; Chen, JingWen >>>>>>>>>>>> <jingwen.ch...@amd.com>; Deng, Emily <emily.d...@amd.com> >>>>>>>>>>>> Cc: dan...@ffwll.ch >>>>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset >>>>>>>>>>>> protection for SRIOV >>>>>>>>>>>> >>>>>>>>>>>> [AMD Official Use Only] >>>>>>>>>>>> >>>>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily >>>>>>>>>>>> >>>>>>>>>>>> Please take a review on Andrey's patch >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> ----------------------------------------------------------------- >>>>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD >>>>>>>>>>>> ----------------------------------------------------------------- >>>>>>>>>>>> -- we are hiring software manager for CVS core team >>>>>>>>>>>> ----------------------------------------------------------------- >>>>>>>>>>>> -- >>>>>>>>>>>> >>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>> From: Koenig, Christian <christian.koe...@amd.com> >>>>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM >>>>>>>>>>>> To: Grodzovsky, Andrey <andrey.grodzov...@amd.com>; dri- >>>>>>>>>>>> de...@lists.freedesktop.org; amd-...@lists.freedesktop.org >>>>>>>>>>>> Cc: dan...@ffwll.ch; Liu, Monk <monk....@amd.com>; Chen, Horace >>>>>>>>>>>> <horace.c...@amd.com> >>>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset >>>>>>>>>>>> protection for SRIOV >>>>>>>>>>>> >>>>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky: >>>>>>>>>>>>> Since now flr work is serialized against GPU resets there is no >>>>>>>>>>>>> need for this. >>>>>>>>>>>>> >>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com> >>>>>>>>>>>> Acked-by: Christian König <christian.koe...@amd.com> >>>>>>>>>>>> >>>>>>>>>>>>> --- >>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 ----------- >>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 ----------- >>>>>>>>>>>>> 2 files changed, 22 deletions(-) >>>>>>>>>>>>> >>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c >>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c >>>>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644 >>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c >>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c >>>>>>>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct >>>>>>>>>>>> work_struct *work) >>>>>>>>>>>>> struct amdgpu_device *adev = container_of(virt, struct >>>>>>>>>>>> amdgpu_device, virt); >>>>>>>>>>>>> int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT; >>>>>>>>>>>>> >>>>>>>>>>>>> - /* block amdgpu_gpu_recover till msg FLR COMPLETE received, >>>>>>>>>>>>> - * otherwise the mailbox msg will be ruined/reseted by >>>>>>>>>>>>> - * the VF FLR. >>>>>>>>>>>>> - */ >>>>>>>>>>>>> - if (!down_write_trylock(&adev->reset_sem)) >>>>>>>>>>>>> - return; >>>>>>>>>>>>> - >>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev); >>>>>>>>>>>>> - atomic_set(&adev->in_gpu_reset, 1); >>>>>>>>>>>>> >>>>>>>>>>>>> xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, >>>>>>>>>>>>> 0, 0); >>>>>>>>>>>>> >>>>>>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct >>>>>>>>>>>> work_struct *work) >>>>>>>>>>>>> } while (timeout > 1); >>>>>>>>>>>>> >>>>>>>>>>>>> flr_done: >>>>>>>>>>>>> - atomic_set(&adev->in_gpu_reset, 0); >>>>>>>>>>>>> - up_write(&adev->reset_sem); >>>>>>>>>>>>> - >>>>>>>>>>>>> /* Trigger recovery for world switch failure if no TDR >>>>>>>>>>>>> */ >>>>>>>>>>>>> if (amdgpu_device_should_recover_gpu(adev) >>>>>>>>>>>>> && (!amdgpu_device_has_job_running(adev) || diff >>>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c >>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c >>>>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644 >>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c >>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c >>>>>>>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct >>>>>>>>>>>> work_struct *work) >>>>>>>>>>>>> struct amdgpu_device *adev = container_of(virt, struct >>>>>>>>>>>> amdgpu_device, virt); >>>>>>>>>>>>> int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT; >>>>>>>>>>>>> >>>>>>>>>>>>> - /* block amdgpu_gpu_recover till msg FLR COMPLETE received, >>>>>>>>>>>>> - * otherwise the mailbox msg will be ruined/reseted by >>>>>>>>>>>>> - * the VF FLR. >>>>>>>>>>>>> - */ >>>>>>>>>>>>> - if (!down_write_trylock(&adev->reset_sem)) >>>>>>>>>>>>> - return; >>>>>>>>>>>>> - >>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev); >>>>>>>>>>>>> - atomic_set(&adev->in_gpu_reset, 1); >>>>>>>>>>>>> >>>>>>>>>>>>> xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, >>>>>>>>>>>>> 0, 0); >>>>>>>>>>>>> >>>>>>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct >>>>>>>>>>>> work_struct *work) >>>>>>>>>>>>> } while (timeout > 1); >>>>>>>>>>>>> >>>>>>>>>>>>> flr_done: >>>>>>>>>>>>> - atomic_set(&adev->in_gpu_reset, 0); >>>>>>>>>>>>> - up_write(&adev->reset_sem); >>>>>>>>>>>>> - >>>>>>>>>>>>> /* Trigger recovery for world switch failure if no TDR >>>>>>>>>>>>> */ >>>>>>>>>>>>> if (amdgpu_device_should_recover_gpu(adev) >>>>>>>>>>>>> && (!amdgpu_device_has_job_running(adev) || >>