Re: [PATCH] drm/amdgpu: amdgpu_device_recover_vram always failed if only one node in shadow_list

Koenig, Christian Tue, 02 Apr 2019 03:34:58 -0700

Am 02.04.19 um 11:23 schrieb Lou, Wentao:
> Thanks Christian.
> You mean msecs_to_jiffies(8000) should be time to complete whole loop, not 
> each loop.


Yeah, a normal desktop system can easily have more than 10000 BOs in 
that list.

So the total timeout could be more than a day, which is a bit long :)

> Just sent out another patch for review.

Going to take a look.

Regards,
Christian.

> Thanks.
>
> BR,
> Wentao
>
>
> -----Original Message-----
> From: Koenig, Christian <[email protected]>
> Sent: Tuesday, April 2, 2019 3:39 PM
> To: Lou, Wentao <[email protected]>; Deng, Emily <[email protected]>; 
> [email protected]
> Subject: Re: [PATCH] drm/amdgpu: amdgpu_device_recover_vram always failed if 
> only one node in shadow_list
>
> Yeah, exactly that's what should happen here.
>
> The value of tmo SHOULD be changed, otherwise we wait tmo jiffies on each 
> loop.
>
> Christian.
>
> Am 02.04.19 um 09:29 schrieb Lou, Wentao:
>> Hi Christian,
>>
>> If " tmo = dma_fence_wait_timeout(fence, false, tmo); " was executed inside 
>> list_for_each_entry, the value of tmo might be changed.
>> But " tmo = dma_fence_wait_timeout(fence, false, tmo); " might be called 
>> after exiting the loop of list_for_each_entry.
>> It might pass a different value of tmo into dma_fence_wait_timeout.
>>
>> BR,
>> Wentao
>>
>>
>> -----Original Message-----
>> From: Christian König <[email protected]>
>> Sent: Tuesday, April 2, 2019 3:01 PM
>> To: Deng, Emily <[email protected]>; Lou, Wentao
>> <[email protected]>; [email protected]
>> Subject: Re: [PATCH] drm/amdgpu: amdgpu_device_recover_vram always
>> failed if only one node in shadow_list
>>
>> Yeah, agree that is much closer to a correct solution.
>>
>> But even better would be to correctly update the timeout as well, e.g:
>>
>> tmo = dma_fence_wait_timeout(fence, false, tmo); dma_fence_put(fence); fence 
>> = next; if (tmo == 0)
>>        r = -ETIMEDOUT;
>>        break
>> } else if (tmo < 0) {
>>        r = tmo;
>>        break;
>> }
>>
>> That we restart the timeout for each wait looks like a rather problematic 
>> bug to me as well.
>>
>> Christian.
>>
>> Am 02.04.19 um 06:03 schrieb Deng, Emily:
>>> Maybe it will be better to add follow check, and change “if (r <= 0 || tmo 
>>> <= 0) " to "if (r <0 || tmo <= 0)".
>>>     r = dma_fence_wait_timeout(f, false, timeout);
>>>     if (r == 0) {
>>>             r = -ETIMEDOUT;
>>>             break;
>>>     } else if (r < 0) {
>>>             break;
>>>     }
>>>
>>> Best wishes
>>> Emily Deng
>>>
>>>
>>>> -----Original Message-----
>>>> From: amd-gfx <[email protected]> On Behalf Of
>>>> wentalou
>>>> Sent: Monday, April 1, 2019 4:59 PM
>>>> To: [email protected]
>>>> Cc: Lou, Wentao <[email protected]>
>>>> Subject: [PATCH] drm/amdgpu: amdgpu_device_recover_vram always
>>>> failed if only one node in shadow_list
>>>>
>>>> amdgpu_bo_restore_shadow would assign zero to r if succeeded.
>>>> r would remain zero if there is only one node in shadow_list.
>>>> current code would always return failure when r <= 0.
>>>>
>>>> Change-Id: Iae6880e7c78b71fde6a6754c69665c2e312a80a5
>>>> Signed-off-by: Wentao Lou <[email protected]>
>>>> ---
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 +++++-
>>>> 1 file changed, 5 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index c4c61e9..5cf21a4 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -3171,6 +3171,7 @@ static int amdgpu_device_recover_vram(struct
>>>> amdgpu_device *adev)
>>>>    struct dma_fence *fence = NULL, *next = NULL;
>>>>    struct amdgpu_bo *shadow;
>>>>    long r = 1, tmo;
>>>> +  bool single_shadow = false;
>>>>
>>>>    if (amdgpu_sriov_runtime(adev))
>>>>            tmo = msecs_to_jiffies(8000);
>>>> @@ -3194,10 +3195,12 @@ static int amdgpu_device_recover_vram(struct
>>>> amdgpu_device *adev)
>>>>                    r = dma_fence_wait_timeout(fence, false, tmo);
>>>>                    dma_fence_put(fence);
>>>>                    fence = next;
>>>> +                  single_shadow = false;
>>>>                    if (r <= 0)
>>>>                            break;
>>>>            } else {
>>>>                    fence = next;
>>>> +                  single_shadow = true;
>>>>            }
>>>>    }
>>>>    mutex_unlock(&adev->shadow_list_lock);
>>>> @@ -3206,7 +3209,8 @@ static int amdgpu_device_recover_vram(struct
>>>> amdgpu_device *adev)
>>>>            tmo = dma_fence_wait_timeout(fence, false, tmo);
>>>>    dma_fence_put(fence);
>>>>
>>>> -  if (r <= 0 || tmo <= 0) {
>>>> +  /* r would be zero even if amdgpu_bo_restore_shadow succeeded when
>>>> single shadow in list */
>>>> +  if (r < 0 || (r == 0 && !single_shadow) || tmo <= 0) {
>>>>            DRM_ERROR("recover vram bo from shadow failed\n");
>>>>            return -EIO;
>>>>    }
>>>> --
>>>> 2.7.4
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> [email protected]
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>> _______________________________________________
>>> amd-gfx mailing list
>>> [email protected]
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu: amdgpu_device_recover_vram always failed if only one node in shadow_list

Reply via email to