But the fence hook to PD's reservation only remember the last one, and the last one will be fake signaled ...
Did I wrong on some concept ?? /Monk -----Original Message----- From: Christian König [mailto:[email protected]] Sent: 2018年3月29日 19:47 To: Liu, Monk <[email protected]>; Koenig, Christian <[email protected]>; Deng, Emily <[email protected]>; [email protected] Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang > you will fake signal all fences eventually , That is not correct. We fake signal all *pending* fences, e.g. fences which are not pushed to the hardware ring. Fences/jobs which are already running on the hardware ring are not touched in any way. Regards, Christian. Am 29.03.2018 um 13:31 schrieb Liu, Monk: > This sdma job is running, and you will fake signal all fences > eventually , Which lead to this process's PD reservation object free > and amdgpu_bo_undef() can call amdgpu_gart_unbind() on the shadow > buffer's ttm And lead to sdma job hit VMC page fault on vmid0 > > /Monk > > > > > > -----Original Message----- > From: Koenig, Christian > Sent: 2018年3月29日 19:24 > To: Liu, Monk <[email protected]>; Deng, Emily <[email protected]>; > [email protected] > Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang > > Am 29.03.2018 um 13:14 schrieb Liu, Monk: >> Hi Christian >> >>> This way we won't even start running the unmapping/clear-pte job in the >>> first place. >> What if there is already an unmapping/clear-pte job running before >> you kill app ? like app is naturally release some resource And by >> coincidence you meanwhile kill the app ? > At least nothing problematic. Since the job is already running we won't do > anything with its fence. > > Christian. > >> /Monk >> >> >> >> >> >> -----Original Message----- >> From: Christian König [mailto:[email protected]] >> Sent: 2018年3月29日 18:16 >> To: Liu, Monk <[email protected]>; Koenig, Christian >> <[email protected]>; Deng, Emily <[email protected]>; >> [email protected] >> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang >> >> Hi Monk, >> >> well that isn't a problem. >> >> The idea is that we first stop the ALL entities and then mark all fences as >> signaled with error. This way we won't even start running the >> unmapping/clear-pte job in the first place. >> >> I mean as I wrote when the process is killed we should cancel ALL still >> pending jobs of that process including pending submissions and page table >> updates. >> >> Regards, >> Christian. >> >> Am 29.03.2018 um 12:11 schrieb Liu, Monk: >>> First let's consider the shadow buffer case: >>> >>> After you signal all jobs with an error code, e.g. you signals an >>> unmapping/clear-pte job on sdma ring (it is running on sdma), the >>> reservation are then all cleared, this way during amdgpu_bo_undef() >>> on the SHADOW BUFFER, above sdma job would hit VMC PAGE FAULT >>> >>> /Monk >>> -----Original Message----- >>> From: Koenig, Christian >>> Sent: 2018年3月29日 17:05 >>> To: Liu, Monk <[email protected]>; Deng, Emily <[email protected]>; >>> [email protected] >>> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang >>> >>> I think the core of the problem is that we don't abort all entities of the >>> process at the same time. >>> >>> How about splitting drm_sched_entity_fini() into two functions? >>> >>> The first one is does the waiting, removes the entity from the runqueue and >>> returns an error when the process was killed. >>> >>> During destruction this one is called first for all contexts as well as the >>> entity for VM updates. >>> >>> The second one then goes over the entity and signals all jobs with an error >>> code. >>> >>> This way no VM updates should be executed any longer after the process is >>> killed (something which doesn't makes sense anyway and just costs us time). >>> >>> Regards, >>> Christian. >>> >>> Am 29.03.2018 um 06:14 schrieb Liu, Monk: >>>>> 2)if a fence signaled and try to clear some entity's dependency, >>>>> should set this entity guilty to prevent its job really run since >>>>> the dependency is fake signaled. >>>> Well, that is a clear NAK. It would mean that you mark things like the X >>>> server or Wayland queue is marked guilty because some client is killed. >>>> >>>> And since unmapping/clear jobs don't have a guilty pointer it should >>>> actually not have any effect on the issue. >>>> >>>> >>>> [ML], yeah that's a good point, how about this way: if a fence is >>>> fake signaled and try to clear other entity's dependency we only allow >>>> entity marked as guilty If this entity share the same ctx (or even >>>> process?) of the entity of the job of that fake signaled fence ? >>>> This way for a certain process a faked signaled GFX fence won't be >>>> able to wake another VCE/SDMA job >>>> >>>> /Monk >>>> >>>> -----Original Message----- >>>> From: Christian König [mailto:[email protected]] >>>> Sent: 2018年3月28日 19:57 >>>> To: Deng, Emily <[email protected]>; [email protected] >>>> Cc: Liu, Monk <[email protected]> >>>> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang >>>> >>>> Am 28.03.2018 um 10:07 schrieb Emily Deng: >>>>> issue: >>>>> there are VMC page fault occured if force APP kill during 3dmark >>>>> test, the cause is in entity_fini we manually signal all those >>>>> jobs in entity's queue which confuse the sync/dep >>>>> mechanism: >>>>> >>>>> 1)page fault occured in sdma's clear job which operate on shadow >>>>> buffer, and shadow buffer's Gart table is cleaned by >>>>> ttm_bo_release since the fence in its reservation was fake >>>>> signaled by >>>>> entity_fini() under the case of SIGKILL received. >>>>> >>>>> 2)page fault occured in gfx' job because during the lifetime of >>>>> gfx job we manually fake signal all jobs from its entity in >>>>> entity_fini(), thus the unmapping/clear PTE job depend on those >>>>> result fence is satisfied and sdma start clearing the PTE and lead to GFX >>>>> page fault. >>>> Nice catch, but the fixes won't work like this. >>>> >>>>> fix: >>>>> 1)should at least wait all jobs already scheduled complete in >>>>> entity_fini() if SIGKILL is the case. >>>> Well that is not a good idea because when we kill a process we actually >>>> want to tear down the task as fast as possible and not wait for anything. >>>> That is actually the reason why we have this handling. >>>> >>>>> 2)if a fence signaled and try to clear some entity's dependency, >>>>> should set this entity guilty to prevent its job really run since >>>>> the dependency is fake signaled. >>>> Well, that is a clear NAK. It would mean that you mark things like the X >>>> server or Wayland queue is marked guilty because some client is killed. >>>> >>>> And since unmapping/clear jobs don't have a guilty pointer it should >>>> actually not have any effect on the issue. >>>> >>>> Regards, >>>> Christian. >>>> >>>> >>>>> related issue ticket: >>>>> http://ontrack-internal.amd.com/browse/SWDEV-147564?filter=-1 >>>>> >>>>> Signed-off-by: Monk Liu <[email protected]> >>>>> --- >>>>> drivers/gpu/drm/scheduler/gpu_scheduler.c | 36 >>>>> +++++++++++++++++++++++++++++++ >>>>> 1 file changed, 36 insertions(+) >>>>> >>>>> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c >>>>> b/drivers/gpu/drm/scheduler/gpu_scheduler.c >>>>> index 2bd69c4..9b306d3 100644 >>>>> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c >>>>> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c >>>>> @@ -198,6 +198,28 @@ static bool drm_sched_entity_is_ready(struct >>>>> drm_sched_entity *entity) >>>>> return true; >>>>> } >>>>> >>>>> +static void drm_sched_entity_wait_otf_signal(struct drm_gpu_scheduler >>>>> *sched, >>>>> + struct drm_sched_entity *entity) { >>>>> + struct drm_sched_job *last; >>>>> + signed long r; >>>>> + >>>>> + spin_lock(&sched->job_list_lock); >>>>> + list_for_each_entry_reverse(last, &sched->ring_mirror_list, node) >>>>> + if (last->s_fence->scheduled.context == entity->fence_context) { >>>>> + dma_fence_get(&last->s_fence->finished); >>>>> + break; >>>>> + } >>>>> + spin_unlock(&sched->job_list_lock); >>>>> + >>>>> + if (&last->node != &sched->ring_mirror_list) { >>>>> + r = dma_fence_wait_timeout(&last->s_fence->finished, false, >>>>> msecs_to_jiffies(500)); >>>>> + if (r == 0) >>>>> + DRM_WARN("wait on the fly job timeout\n"); >>>>> + dma_fence_put(&last->s_fence->finished); >>>>> + } >>>>> +} >>>>> + >>>>> /** >>>>> * Destroy a context entity >>>>> * >>>>> @@ -238,6 +260,12 @@ void drm_sched_entity_fini(struct drm_gpu_scheduler >>>>> *sched, >>>>> entity->dependency = NULL; >>>>> } >>>>> >>>>> + /* Wait till all jobs from this entity really finished >>>>> otherwise below >>>>> + * fake signaling would kickstart sdma's clear PTE jobs and >>>>> lead to >>>>> + * vm fault >>>>> + */ >>>>> + drm_sched_entity_wait_otf_signal(sched, entity); >>>>> + >>>>> while ((job = >>>>> to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) { >>>>> struct drm_sched_fence *s_fence = job->s_fence; >>>>> drm_sched_fence_scheduled(s_fence); >>>>> @@ -255,6 +283,14 @@ static void drm_sched_entity_wakeup(struct dma_fence >>>>> *f, struct dma_fence_cb *cb >>>>> { >>>>> struct drm_sched_entity *entity = >>>>> container_of(cb, struct drm_sched_entity, cb); >>>>> + >>>>> + /* set the entity guity since its dependency is >>>>> + * not really cleared but fake signaled (by SIGKILL >>>>> + * or GPU recover) >>>>> + */ >>>>> + if (f->error && entity->guilty) >>>>> + atomic_set(entity->guilty, 1); >>>>> + >>>>> entity->dependency = NULL; >>>>> dma_fence_put(f); >>>>> drm_sched_wakeup(entity->sched); >>> _______________________________________________ >>> amd-gfx mailing list >>> [email protected] >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > _______________________________________________ > amd-gfx mailing list > [email protected] > https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/amd-gfx
