Even we release the ctx as usual way, Can we guarantee the pde/pte and PRT/CSA are all alive (BO, mappings) when resubmitting the timeout job (assume this time out job can signal after the resubmit)?
You know App can submit a command a release all BO and free_ctx, close FD/VM, and exit very soon, it just doesn't wait for the fence signal BR Monk -----Original Message----- From: Christian König [mailto:[email protected]] Sent: Wednesday, May 03, 2017 8:50 PM To: Liu, Monk <[email protected]>; [email protected] Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished > and the ironic thing is I want to alive as well (especially CSA, PTR) Yes, and exactly that is the danger I was talking about. We messed up the tear down oder with that and try to access resources which are already freed when the job is now scheduled. I would rather say we should get completely rid of the ctx kref counting, that was a rather bad idea in the first place. Regards, Christian. Am 03.05.2017 um 11:36 schrieb Liu, Monk: > Since I get one more kref for ctx when creating jobs, so > amdgpu_ctx_mgr_fini(&fpriv->ctx_mgr) here won't actually waiting ... because > the " amdgpu_ctx_do_release" > Won't going to run (kref > 0 before all job signaled). > > That way amdgpu_driver_postclose_kms() can continue go on , So > actually " UVD and VCE handle, the PRT VAs, the CSA and even the whole > VM structure" won't be kept alive , and the ironic thing is I want to > alive as well (especially CSA, PTR) > > > BR Monk > > > > > -----Original Message----- > From: Christian König [mailto:[email protected]] > Sent: Wednesday, May 03, 2017 5:19 PM > To: Liu, Monk <[email protected]>; [email protected] > Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job > finished > >>> I'm afraid not: CSA is gone with the VM, and VM is gone after app close >>> our FD, I don't see amdgpu_vm_fini() is depended on context living or not >>> ... > See the teardown order in amdgpu_driver_postclose_kms(): >> amdgpu_ctx_mgr_fini(&fpriv->ctx_mgr); >> >> amdgpu_uvd_free_handles(adev, file_priv); >> amdgpu_vce_free_handles(adev, file_priv); >> >> amdgpu_vm_bo_rmv(adev, fpriv->prt_va); >> >> if (amdgpu_sriov_vf(adev)) { >> /* TODO: how to handle reserve failure */ >> BUG_ON(amdgpu_bo_reserve(adev->virt.csa_obj, false)); >> amdgpu_vm_bo_rmv(adev, fpriv->vm.csa_bo_va); >> fpriv->vm.csa_bo_va = NULL; >> amdgpu_bo_unreserve(adev->virt.csa_obj); >> } >> >> amdgpu_vm_fini(adev, &fpriv->vm); > amdgpu_ctx_mgr_fini() waits for scheduling to finish and releases all > contexts of the current fd. > > If we don't release the context here because some jobs are still executed we > need to keep the UVD and VCE handle, the PRT VAs, the CSA and even the whole > VM structure alive. > >> I'll see if dma_fence could solve my issue, but I wish you can give >> me your detail idea > Please take a look at David's idea of using the fence_context to find which > jobs and entity to skip, that is even better than mine about the fence status > and should be trivial to implement because all the data is already present we > just need to use it. > > Regards, > Christian. > > Am 03.05.2017 um 11:08 schrieb Liu, Monk: >> 1,My idea is that userspace should rather gather the feedback during the >> next command submission. This has the advantage that you don't need to keep >> userspace alive till all jobs are done. >> >>> No, we need to clean the hw ring (cherry-pick out guilty entities' job in >>> all rings) after gpu reset, and we need fackly signal all sched_fence in >>> the guity entity as well, and we need mark context as guilty so the next >>> IOCTL on it will return -ENODEV. >>> I don't understand how your idea can solve my request ... >> 2,You need to keep quite a bunch of stuff alive (VM, CSA) when you don't >> tear down the ctx immediately. >> >>> I'm afraid not: CSA is gone with the VM, and VM is gone after app close >>> our FD, I don't see amdgpu_vm_fini() is depended on context living or not >>> ... >> 3, struct fence was renamed to struct dma_fence on newer kernels and a >> status field added to exactly this purpose. >> >> The Intel guys did this because they ran into the exactly same problem. >> >>> I'll see if dma_fence could solve my issue, but I wish you can give >>> me your detail idea >> BR Monk >> >> >> >> -----Original Message----- >> From: Christian König [mailto:[email protected]] >> Sent: Wednesday, May 03, 2017 4:59 PM >> To: Liu, Monk <[email protected]>; [email protected] >> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job >> finished >> >>> 1, This is necessary otherwise how can I access entity pointer after >>> a job timedout >> No that isn't necessary. >> >> The problem with your idea is that you want to actively push the >> feedback/status from the job execution back to userspace when an >> error >> (timeout) happens. >> >> My idea is that userspace should rather gather the feedback during the next >> command submission. This has the advantage that you don't need to keep >> userspace alive till all jobs are done. >> >>> , and why it is dangerous ? >> You need to keep quite a bunch of stuff alive (VM, CSA) when you don't tear >> down the ctx immediately. >> >> We could split ctx tear down into freeing the resources and freeing the >> structure, but I think just gathering the information needed on CS is easier >> to do. >> >>> 2, what's the status field in the fences you were referring to ? I >>> need to judge if it could satisfy my requirement >> struct fence was renamed to struct dma_fence on newer kernels and a status >> field added to exactly this purpose. >> >> The Intel guys did this because they ran into the exactly same problem. >> >> Regards, >> Christian. >> >> Am 03.05.2017 um 05:30 schrieb Liu, Monk: >>> 1, This is necessary otherwise how can I access entity pointer after a job >>> timedout , and why it is dangerous ? >>> 2, what's the status field in the fences you were referring to ? I >>> need to judge if it could satisfy my requirement >>> >>> >>> >>> -----Original Message----- >>> From: Christian König [mailto:[email protected]] >>> Sent: Monday, May 01, 2017 10:48 PM >>> To: Liu, Monk <[email protected]>; [email protected] >>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job >>> finished >>> >>> Am 01.05.2017 um 09:22 schrieb Monk Liu: >>>> for TDR guilty context feature, we need access ctx/s_entity field >>>> member through sched job pointer,so ctx must keep alive till all >>>> job from it signaled. >>> NAK, that is unnecessary and quite dangerous. >>> >>> Instead we have the designed status field in the fences which should be >>> checked for that. >>> >>> Regards, >>> Christian. >>> >>>> Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43 >>>> Signed-off-by: Monk Liu <[email protected]> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 6 +++++- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 9 +++++++++ >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 9 +++++++-- >>>> drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------ >>>> drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 + >>>> 6 files changed, 23 insertions(+), 10 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>> index e330009..8e031d6 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>> @@ -760,10 +760,12 @@ struct amdgpu_ib { >>>> uint32_t flags; >>>> }; >>>> >>>> +struct amdgpu_ctx; >>>> + >>>> extern const struct amd_sched_backend_ops amdgpu_sched_ops; >>>> >>>> int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs, >>>> - struct amdgpu_job **job, struct amdgpu_vm *vm); >>>> + struct amdgpu_job **job, struct amdgpu_vm *vm, struct >>>> +amdgpu_ctx *ctx); >>>> int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned >>>> size, >>>> struct amdgpu_job **job); >>>> >>>> @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr { >>>> >>>> struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv *fpriv, >>>> uint32_t id); >>>> int amdgpu_ctx_put(struct amdgpu_ctx *ctx); >>>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx); >>>> >>>> uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx, struct >>>> amdgpu_ring *ring, >>>> struct fence *fence); @@ -1129,6 +1132,7 >>>> @@ struct >>>> amdgpu_job { >>>> struct amdgpu_sync sync; >>>> struct amdgpu_ib *ibs; >>>> struct fence *fence; /* the hw fence */ >>>> + struct amdgpu_ctx *ctx; >>>> uint32_t preamble_status; >>>> uint32_t num_ibs; >>>> void *owner; >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c >>>> index 699f5fe..267fb65 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c >>>> @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, >>>> void *data) >>>> } >>>> } >>>> >>>> - ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm); >>>> + ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm, p->ctx); >>>> if (ret) >>>> goto free_all_kdata; >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >>>> index b4bbbb3..81438af 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >>>> @@ -25,6 +25,13 @@ >>>> #include <drm/drmP.h> >>>> #include "amdgpu.h" >>>> >>>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx) { >>>> + if (ctx) >>>> + kref_get(&ctx->refcount); >>>> + return ctx; >>>> +} >>>> + >>>> static int amdgpu_ctx_init(struct amdgpu_device *adev, struct >>>> amdgpu_ctx *ctx) >>>> { >>>> unsigned i, j; >>>> @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev, >>>> struct amdgpu_ctx *ctx) >>>> rq, amdgpu_sched_jobs); >>>> if (r) >>>> goto failed; >>>> + >>>> + ctx->rings[i].entity.ptr_guilty = &ctx->guilty; /* kernel >>>> entity >>>> +doesn't have ptr_guilty */ >>>> } >>>> >>>> return 0; >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>>> index 690ef3d..208da11 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>>> @@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct amd_sched_job >>>> *s_job) >>>> } >>>> >>>> int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs, >>>> - struct amdgpu_job **job, struct amdgpu_vm *vm) >>>> + struct amdgpu_job **job, struct amdgpu_vm *vm, struct >>>> +amdgpu_ctx *ctx) >>>> { >>>> size_t size = sizeof(struct amdgpu_job); >>>> >>>> @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, >>>> unsigned num_ibs, >>>> (*job)->vm = vm; >>>> (*job)->ibs = (void *)&(*job)[1]; >>>> (*job)->num_ibs = num_ibs; >>>> + (*job)->ctx = amdgpu_ctx_kref_get(ctx); >>>> >>>> amdgpu_sync_create(&(*job)->sync); >>>> >>>> @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, >>>> unsigned size, >>>> { >>>> int r; >>>> >>>> - r = amdgpu_job_alloc(adev, 1, job, NULL); >>>> + r = amdgpu_job_alloc(adev, 1, job, NULL, NULL); >>>> if (r) >>>> return r; >>>> >>>> @@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct amdgpu_job *job) >>>> static void amdgpu_job_free_cb(struct amd_sched_job *s_job) >>>> { >>>> struct amdgpu_job *job = container_of(s_job, struct >>>> amdgpu_job, base); >>>> + struct amdgpu_ctx *ctx = job->ctx; >>>> + >>>> + if (ctx) >>>> + amdgpu_ctx_put(ctx); >>>> >>>> fence_put(job->fence); >>>> amdgpu_sync_free(&job->sync); diff --git >>>> a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>> index 6f4e31f..9100ca8 100644 >>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>> @@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler >>>> *sched, >>>> if (!amd_sched_entity_is_initialized(sched, entity)) >>>> return; >>>> >>>> - /** >>>> - * The client will not queue more IBs during this fini, consume existing >>>> - * queued IBs >>>> - */ >>>> - wait_event(sched->job_scheduled, amd_sched_entity_is_idle(entity)); >>>> - >>>> amd_sched_rq_remove_entity(rq, entity); >>>> kfifo_free(&entity->job_queue); >>>> } >>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h >>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h >>>> index 8cb41d3..ccbbcb0 100644 >>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h >>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h >>>> @@ -49,6 +49,7 @@ struct amd_sched_entity { >>>> >>>> struct fence *dependency; >>>> struct fence_cb cb; >>>> + bool *ptr_guilty; >>>> }; >>>> >>>> /** >>> _______________________________________________ >>> amd-gfx mailing list >>> [email protected] >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> _______________________________________________ >> amd-gfx mailing list >> [email protected] >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > > _______________________________________________ > amd-gfx mailing list > [email protected] > https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/amd-gfx
