amdgpu: move enable irq later to avoid race with ih resume

Zhao, Victor Thu, 15 Sep 2022 03:11:35 -0700

[AMD Official Use Only - General]

Hi Lijo,


Yes, this is what I observed in sienna cichlid. 


Thanks,
Victor



-----Original Message-----
From: Lazar, Lijo <lijo.la...@amd.com> 
Sent: Thursday, September 15, 2022 4:00 PM
To: Zhao, Victor <victor.z...@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Deng, Emily <emily.d...@amd.com>; Grodzovsky, Andrey 
<andrey.grodzov...@amd.com>
Subject: Re: [PATCH 2/2] drm/amdgpu: move enable irq later to avoid race with 
ih resume



On 9/15/2022 12:08 PM, Zhao, Victor wrote:
> [AMD Official Use Only - General]
> 
> Hi Lijo,
> 
> IH resume was added to resolve an issue found during mode2 bring up on sienna 
> cichlid:
> - close down mode2 reset and do a mode1 reset first
> - open mode2 reset and do a mode2 reset. Mode2 reset was found fail in this 
> case.
> 
> Resume IH helps in this case
> 

Sorry, what do you mean by 'close down' mode2 /'open mode2 reset'? Do you mean 
if mode-1 reset is done first, a subsequent mode-2 reset doesn't work without 
IH resume?

Thanks,
Lijo

> 
> Thanks,
> Victor
> 
> 
> 
> -----Original Message-----
> From: Lazar, Lijo <lijo.la...@amd.com>
> Sent: Thursday, September 15, 2022 1:58 PM
> To: Zhao, Victor <victor.z...@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Deng, Emily <emily.d...@amd.com>; Grodzovsky, Andrey 
> <andrey.grodzov...@amd.com>
> Subject: Re: [PATCH 2/2] drm/amdgpu: move enable irq later to avoid 
> race with ih resume
> 
> 
> 
> On 9/14/2022 3:40 PM, Victor Zhao wrote:
>> [background]
>> On current sienna cichlid mode2 reset, on the slow job hang cases, 
>> since page table context was reverted to completely stop gpu, it will 
>> generate page fault interrupt.
>>
>> Since the irq are open during recovery stage, during ih resume step, 
>> if this interrupt was in processing, which increased ih ring rptr, 
>> and ih resume meanwhile will set rptr and wptr to 0. This may cause
> 
> AFAIK, only GFX/SDMA are affected by mode-2. IH is not suspended before 
> mode-2. Why do you resume IH after mode-2 when it is not suspended? Is it a 
> special case for virtualization?
> 
> Thanks,
> Lijo
> 
>> rptr greater than wptr. Such case was not handled in ih process, and 
>> it will cause rptr continue increasing util reaches the max.
>> Such case will make fence fallback situation happen.
>>
>> [how]
>> Move the enable of irq after ih resumed and before ib test.
>> Adjusting the position of enable irq on other reset paths accordingly.
>>
>> Signed-off-by: Victor Zhao <victor.z...@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  | 8 ++++----
>>    drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c | 1 +
>>    2 files changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index c0cfae52f12b..0b658225e9ef 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4625,8 +4625,6 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device 
>> *adev,
>>              amdgpu_fence_driver_force_completion(ring);
>>      }
>>    
>> -    amdgpu_fence_driver_isr_toggle(adev, false);
>> -
>>      if (job && job->vm)
>>              drm_sched_increase_karma(&job->base);
>>    
>> @@ -4758,6 +4756,10 @@ int amdgpu_do_asic_reset(struct list_head 
>> *device_list_handle,
>>              test_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
>>      skip_hw_reset = test_bit(AMDGPU_SKIP_HW_RESET, 
>> &reset_context->flags);
>>    
>> +    list_for_each_entry (tmp_adev, device_list_handle, reset_list) {
>> +            amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>> +    }
>> +
>>      /*
>>       * ASIC reset has to be done on all XGMI hive nodes ASAP
>>       * to allow proper links negotiation in FW (within 1 sec) @@
>> -5031,8 +5033,6 @@ static void amdgpu_device_recheck_guilty_jobs(
>>                      /* Clear this failed job from fence array */
>>                      amdgpu_fence_driver_clear_job_fences(ring);
>>    
>> -                    amdgpu_fence_driver_isr_toggle(adev, false);
>> -
>>                      /* Since the job won't signal and we go for
>>                       * another resubmit drop this parent pointer
>>                       */
>> diff --git a/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
>> b/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
>> index 7aa570c1ce4a..953036482d1f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
>> @@ -240,6 +240,7 @@ sienna_cichlid_mode2_restore_hwcontext(struct 
>> amdgpu_reset_control *reset_ctl,
>>      * Add this ASIC as tracked as reset was already
>>      * complete successfully.
>>      */
>> +    amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>      amdgpu_register_gpu_instance(tmp_adev);
>>    
>>      /* Resume RAS */
>>

RE: [PATCH 2/2] drm/amdgpu: move enable irq later to avoid race with ih resume

Reply via email to