RE: [PATCH] drm/gpu-sched: fix force APP kill hang

Liu, Monk Thu, 29 Mar 2018 04:52:33 -0700

But the fence hook to PD's reservation only remember the last one, and the last 
one will be fake signaled ...


Did I wrong on some concept ??

/Monk

-----Original Message-----
From: Christian König [mailto:[email protected]] 
Sent: 2018年3月29日 19:47
To: Liu, Monk <[email protected]>; Koenig, Christian <[email protected]>; 
Deng, Emily <[email protected]>; [email protected]
Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang

> you will fake signal all fences eventually ,
That is not correct. We fake signal all *pending* fences, e.g. fences which are 
not pushed to the hardware ring.

Fences/jobs which are already running on the hardware ring are not touched in 
any way.

Regards,
Christian.

Am 29.03.2018 um 13:31 schrieb Liu, Monk:
> This sdma job is running, and you will fake signal all fences 
> eventually , Which lead to this process's PD reservation object free 
> and amdgpu_bo_undef() can call amdgpu_gart_unbind() on the shadow 
> buffer's ttm And lead to sdma job hit VMC page fault on vmid0
>
> /Monk
>
>
>
>
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2018年3月29日 19:24
> To: Liu, Monk <[email protected]>; Deng, Emily <[email protected]>; 
> [email protected]
> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang
>
> Am 29.03.2018 um 13:14 schrieb Liu, Monk:
>> Hi Christian
>>
>>> This way we won't even start running the unmapping/clear-pte job in the 
>>> first place.
>> What if there is already an unmapping/clear-pte job running before 
>> you kill app ? like app is naturally release some resource And by 
>> coincidence you meanwhile kill the app ?
> At least nothing problematic. Since the job is already running we won't do 
> anything with its fence.
>
> Christian.
>
>> /Monk
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Christian König [mailto:[email protected]]
>> Sent: 2018年3月29日 18:16
>> To: Liu, Monk <[email protected]>; Koenig, Christian 
>> <[email protected]>; Deng, Emily <[email protected]>; 
>> [email protected]
>> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang
>>
>> Hi Monk,
>>
>> well that isn't a problem.
>>
>> The idea is that we first stop the ALL entities and then mark all fences as 
>> signaled with error. This way we won't even start running the 
>> unmapping/clear-pte job in the first place.
>>
>> I mean as I wrote when the process is killed we should cancel ALL still 
>> pending jobs of that process including pending submissions and page table 
>> updates.
>>
>> Regards,
>> Christian.
>>
>> Am 29.03.2018 um 12:11 schrieb Liu, Monk:
>>> First let's consider the shadow buffer case:
>>>
>>> After you signal all jobs with an error code, e.g. you signals an 
>>> unmapping/clear-pte job on sdma ring (it is running on sdma), the 
>>> reservation are then all cleared, this way during amdgpu_bo_undef() 
>>> on the SHADOW BUFFER, above sdma job would hit VMC PAGE FAULT
>>>
>>> /Monk
>>> -----Original Message-----
>>> From: Koenig, Christian
>>> Sent: 2018年3月29日 17:05
>>> To: Liu, Monk <[email protected]>; Deng, Emily <[email protected]>; 
>>> [email protected]
>>> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang
>>>
>>> I think the core of the problem is that we don't abort all entities of the 
>>> process at the same time.
>>>
>>> How about splitting drm_sched_entity_fini() into two functions?
>>>
>>> The first one is does the waiting, removes the entity from the runqueue and 
>>> returns an error when the process was killed.
>>>
>>> During destruction this one is called first for all contexts as well as the 
>>> entity for VM updates.
>>>
>>> The second one then goes over the entity and signals all jobs with an error 
>>> code.
>>>
>>> This way no VM updates should be executed any longer after the process is 
>>> killed (something which doesn't makes sense anyway and just costs us time).
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 29.03.2018 um 06:14 schrieb Liu, Monk:
>>>>> 2)if a fence signaled and try to clear some entity's dependency, 
>>>>> should set this entity guilty to prevent its job really run since 
>>>>> the dependency is fake signaled.
>>>> Well, that is a clear NAK. It would mean that you mark things like the X 
>>>> server or Wayland queue is marked guilty because some client is killed.
>>>>
>>>> And since unmapping/clear jobs don't have a guilty pointer it should 
>>>> actually not have any effect on the issue.
>>>>
>>>>
>>>> [ML], yeah that's a good point, how about this way: if a fence is 
>>>> fake signaled and try to clear other entity's dependency we only allow 
>>>> entity marked as guilty If this entity share the same ctx (or even 
>>>> process?) of the entity of the job of that fake signaled fence ?
>>>> This way for a certain process a faked signaled GFX fence won't be 
>>>> able to wake another VCE/SDMA job
>>>>
>>>> /Monk
>>>>
>>>> -----Original Message-----
>>>> From: Christian König [mailto:[email protected]]
>>>> Sent: 2018年3月28日 19:57
>>>> To: Deng, Emily <[email protected]>; [email protected]
>>>> Cc: Liu, Monk <[email protected]>
>>>> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang
>>>>
>>>> Am 28.03.2018 um 10:07 schrieb Emily Deng:
>>>>> issue:
>>>>> there are VMC page fault occured if force APP kill during 3dmark 
>>>>> test, the cause is in entity_fini we manually signal all those 
>>>>> jobs in entity's queue which confuse the sync/dep
>>>>> mechanism:
>>>>>
>>>>> 1)page fault occured in sdma's clear job which operate on shadow 
>>>>> buffer, and shadow buffer's Gart table is cleaned by 
>>>>> ttm_bo_release since the fence in its reservation was fake 
>>>>> signaled by
>>>>> entity_fini() under the case of SIGKILL received.
>>>>>
>>>>> 2)page fault occured in gfx' job because during the lifetime of 
>>>>> gfx job we manually fake signal all jobs from its entity in 
>>>>> entity_fini(), thus the unmapping/clear PTE job depend on those 
>>>>> result fence is satisfied and sdma start clearing the PTE and lead to GFX 
>>>>> page fault.
>>>> Nice catch, but the fixes won't work like this.
>>>>
>>>>> fix:
>>>>> 1)should at least wait all jobs already scheduled complete in
>>>>> entity_fini() if SIGKILL is the case.
>>>> Well that is not a good idea because when we kill a process we actually 
>>>> want to tear down the task as fast as possible and not wait for anything. 
>>>> That is actually the reason why we have this handling.
>>>>
>>>>> 2)if a fence signaled and try to clear some entity's dependency, 
>>>>> should set this entity guilty to prevent its job really run since 
>>>>> the dependency is fake signaled.
>>>> Well, that is a clear NAK. It would mean that you mark things like the X 
>>>> server or Wayland queue is marked guilty because some client is killed.
>>>>
>>>> And since unmapping/clear jobs don't have a guilty pointer it should 
>>>> actually not have any effect on the issue.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>
>>>>> related issue ticket:
>>>>> http://ontrack-internal.amd.com/browse/SWDEV-147564?filter=-1
>>>>>
>>>>> Signed-off-by: Monk Liu <[email protected]>
>>>>> ---
>>>>>       drivers/gpu/drm/scheduler/gpu_scheduler.c | 36 
>>>>> +++++++++++++++++++++++++++++++
>>>>>       1 file changed, 36 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>> b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>> index 2bd69c4..9b306d3 100644
>>>>> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>> @@ -198,6 +198,28 @@ static bool drm_sched_entity_is_ready(struct 
>>>>> drm_sched_entity *entity)
>>>>>           return true;
>>>>>       }
>>>>>       
>>>>> +static void drm_sched_entity_wait_otf_signal(struct drm_gpu_scheduler 
>>>>> *sched,
>>>>> +                         struct drm_sched_entity *entity) {
>>>>> + struct drm_sched_job *last;
>>>>> + signed long r;
>>>>> +
>>>>> + spin_lock(&sched->job_list_lock);
>>>>> + list_for_each_entry_reverse(last, &sched->ring_mirror_list, node)
>>>>> +         if (last->s_fence->scheduled.context == entity->fence_context) {
>>>>> +                 dma_fence_get(&last->s_fence->finished);
>>>>> +                 break;
>>>>> +         }
>>>>> + spin_unlock(&sched->job_list_lock);
>>>>> +
>>>>> + if (&last->node != &sched->ring_mirror_list) {
>>>>> +         r = dma_fence_wait_timeout(&last->s_fence->finished, false, 
>>>>> msecs_to_jiffies(500));
>>>>> +         if (r == 0)
>>>>> +                 DRM_WARN("wait on the fly job timeout\n");
>>>>> +         dma_fence_put(&last->s_fence->finished);
>>>>> + }
>>>>> +}
>>>>> +
>>>>>       /**
>>>>>        * Destroy a context entity
>>>>>        *
>>>>> @@ -238,6 +260,12 @@ void drm_sched_entity_fini(struct drm_gpu_scheduler 
>>>>> *sched,
>>>>>                           entity->dependency = NULL;
>>>>>                   }
>>>>>       
>>>>> +         /* Wait till all jobs from this entity really finished 
>>>>> otherwise below
>>>>> +          * fake signaling would kickstart sdma's clear PTE jobs and 
>>>>> lead to
>>>>> +          * vm fault
>>>>> +          */
>>>>> +         drm_sched_entity_wait_otf_signal(sched, entity);
>>>>> +
>>>>>                   while ((job = 
>>>>> to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
>>>>>                           struct drm_sched_fence *s_fence = job->s_fence;
>>>>>                           drm_sched_fence_scheduled(s_fence);
>>>>> @@ -255,6 +283,14 @@ static void drm_sched_entity_wakeup(struct dma_fence 
>>>>> *f, struct dma_fence_cb *cb
>>>>>       {
>>>>>           struct drm_sched_entity *entity =
>>>>>                   container_of(cb, struct drm_sched_entity, cb);
>>>>> +
>>>>> + /* set the entity guity since its dependency is
>>>>> +  * not really cleared but fake signaled (by SIGKILL
>>>>> +  * or GPU recover)
>>>>> +  */
>>>>> + if (f->error && entity->guilty)
>>>>> +         atomic_set(entity->guilty, 1);
>>>>> +
>>>>>           entity->dependency = NULL;
>>>>>           dma_fence_put(f);
>>>>>           drm_sched_wakeup(entity->sched);
>>> _______________________________________________
>>> amd-gfx mailing list
>>> [email protected]
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] drm/gpu-sched: fix force APP kill hang

Reply via email to