On 2/13/19 2:16 PM, Kazlauskas, Nicholas wrote:
> On 2/13/19 2:10 PM, Grodzovsky, Andrey wrote:
>> On 2/13/19 2:00 PM, Kazlauskas, Nicholas wrote:
>>> On 2/13/19 1:58 PM, Andrey Grodzovsky wrote:
>>>> When ring hang happens amdgpu_dm_commit_planes during flip is holding
>>>> the BO reserved and then stack waiting for fences to signal in
>>>> reservation_object_wait_timeout_rcu (which won't signal because there
>>>> was a hnag). Then when we try to shutdown display block during reset
>>>> recovery from drm_atomic_helper_suspend we also try to reserve the BO
>>>> from dm_plane_helper_cleanup_fb ending in deadlock.
>>>> Also remove useless WARN_ON
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com>
>>>> ---
>>>>      drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 19 
>>>> +++++++++++++------
>>>>      1 file changed, 13 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
>>>> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>>>> index acc4ff8..f8dec36 100644
>>>> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>>>> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>>>> @@ -4802,14 +4802,21 @@ static void amdgpu_dm_commit_planes(struct 
>>>> drm_atomic_state *state,
>>>>                             */
>>>>                            abo = gem_to_amdgpu_bo(fb->obj[0]);
>>>>                            r = amdgpu_bo_reserve(abo, true);
>>>> -                  if (unlikely(r != 0)) {
>>>> +                  if (unlikely(r != 0))
>>>>                                    DRM_ERROR("failed to reserve buffer 
>>>> before flip\n");
>>>> -                          WARN_ON(1);
>>>> -                  }
>>>>      
>>>> -                  /* Wait for all fences on this FB */
>>>> -                  
>>>> WARN_ON(reservation_object_wait_timeout_rcu(abo->tbo.resv, true, false,
>>>> -                                                                          
>>>>     MAX_SCHEDULE_TIMEOUT) < 0);
>>>> +                  /*
>>>> +                   * Wait for all fences on this FB. Do limited wait to 
>>>> avoid
>>>> +                   * deadlock during GPU reset when this fence will not 
>>>> signal
>>>> +                   * but we hold reservation lock for the BO.
>>>> +                   */
>>>> +                  r = reservation_object_wait_timeout_rcu(abo->tbo.resv,
>>>> +                                                          true, false,
>>>> +                                                          
>>>> msecs_to_jiffies(5000));
>>>> +                  if (unlikely(r == 0))
>>>> +                          DRM_ERROR("Waiting for fences timed out.");
>>>> +
>>>> +
>>>>      
>>>>                            amdgpu_bo_get_tiling_flags(abo, &tiling_flags);
>>>>      
>>>>
>>> Is it safe that we're just continuing like this? It's probably better to
>>> just unreserve the buffer and continue to the next plane, no?
>>>
>>> Nicholas Kazlauskas
>> As far as I see it should be safe as you are simply flipping to a buffer
>> for which rendering hasn't finished (or stack actually in this case) so
>> you might see visual corruption but that the least of your problems if
>> after 5s the BO still not finalized for presentation, the system is
>> already probably in very bad shape. Also, in case we do want to  do
>> error handling we should also take care of  amdgpu_bo_reserve failure
>> just before that.
>>
>> Andrey
>>
>>
> Yeah, I guess this whole blocks needs to be cleaned up in that case.
> This is a good first step at least. Technically
> reservation_object_wait_timeout_rcu will return < 0 when it's been
> interrupted too as an error code but I guess that will just be silently
> ignored here.
>
> If you want you can change the condition to:
>
> if (unlikely(r >= 0))
>        DRM_ERROR("Waiting for FB fence failed: id=%d res=%d\n",
> plane->id, r);


Note that reservation_object_wait_timeout_rcu has a flag 'bool intr: if 
true, do interruptible wait', we set it to false since the code in case 
of flip runs form the kernel worker thread and not from IOCTL meaning we 
are not in user mode context and hence are not going to recieve user 
signals (cannot be interrupted). So the only values we can recieve here 
are either 0 for time out or val > 0 for wait that completed before time 
out value.

Andrey

>
> But with or without that change this patch is:
>
> Reviewed-by: Nicholas Kazlauskas <nicholas.kazlaus...@amd.com>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to