That should be way you said , but I didn't see the logic to assure that, If kref of one BO go downs to 0, this Bo will be destroy (amdgpu_bo_destory), I don't see what code to prevent this destroy invoked if the resv of this BO Still have fence not signaling, can you share this tricks ?
Because I found if a job hang, and after my code kick it out from scheduler, (I manually call amd_sched_fence_finished() on the sched_fence of the timedout job ), the page_dir won't get destroy .... But the BO created through GEM by app can be destroy as expected BR Monk -----Original Message----- From: Christian König [mailto:[email protected]] Sent: Wednesday, May 03, 2017 9:34 PM To: Liu, Monk <[email protected]>; [email protected] Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished > Can we guarantee the pde/pte and PRT/CSA are all alive (BO, mappings) when > resubmitting the timeout job (assume this time out job can signal after the > resubmit)? Yes, that's why we add all fences of each command submission to the PD/PT BOs. Regards, Christian. Am 03.05.2017 um 15:31 schrieb Liu, Monk: > Even we release the ctx as usual way, > Can we guarantee the pde/pte and PRT/CSA are all alive (BO, mappings) when > resubmitting the timeout job (assume this time out job can signal after the > resubmit)? > > You know App can submit a command a release all BO and free_ctx, close > FD/VM, and exit very soon, it just doesn't wait for the fence signal > > BR Monk > > -----Original Message----- > From: Christian König [mailto:[email protected]] > Sent: Wednesday, May 03, 2017 8:50 PM > To: Liu, Monk <[email protected]>; [email protected] > Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job > finished > >> and the ironic thing is I want to alive as well (especially CSA, PTR) > Yes, and exactly that is the danger I was talking about. We messed up the > tear down oder with that and try to access resources which are already freed > when the job is now scheduled. > > I would rather say we should get completely rid of the ctx kref counting, > that was a rather bad idea in the first place. > > Regards, > Christian. > > Am 03.05.2017 um 11:36 schrieb Liu, Monk: >> Since I get one more kref for ctx when creating jobs, so >> amdgpu_ctx_mgr_fini(&fpriv->ctx_mgr) here won't actually waiting ... because >> the " amdgpu_ctx_do_release" >> Won't going to run (kref > 0 before all job signaled). >> >> That way amdgpu_driver_postclose_kms() can continue go on , So >> actually " UVD and VCE handle, the PRT VAs, the CSA and even the >> whole VM structure" won't be kept alive , and the ironic thing is I >> want to alive as well (especially CSA, PTR) >> >> >> BR Monk >> >> >> >> >> -----Original Message----- >> From: Christian König [mailto:[email protected]] >> Sent: Wednesday, May 03, 2017 5:19 PM >> To: Liu, Monk <[email protected]>; [email protected] >> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job >> finished >> >>>> I'm afraid not: CSA is gone with the VM, and VM is gone after app close >>>> our FD, I don't see amdgpu_vm_fini() is depended on context living or not >>>> ... >> See the teardown order in amdgpu_driver_postclose_kms(): >>> amdgpu_ctx_mgr_fini(&fpriv->ctx_mgr); >>> >>> amdgpu_uvd_free_handles(adev, file_priv); >>> amdgpu_vce_free_handles(adev, file_priv); >>> >>> amdgpu_vm_bo_rmv(adev, fpriv->prt_va); >>> >>> if (amdgpu_sriov_vf(adev)) { >>> /* TODO: how to handle reserve failure */ >>> BUG_ON(amdgpu_bo_reserve(adev->virt.csa_obj, false)); >>> amdgpu_vm_bo_rmv(adev, fpriv->vm.csa_bo_va); >>> fpriv->vm.csa_bo_va = NULL; >>> amdgpu_bo_unreserve(adev->virt.csa_obj); >>> } >>> >>> amdgpu_vm_fini(adev, &fpriv->vm); >> amdgpu_ctx_mgr_fini() waits for scheduling to finish and releases all >> contexts of the current fd. >> >> If we don't release the context here because some jobs are still executed we >> need to keep the UVD and VCE handle, the PRT VAs, the CSA and even the whole >> VM structure alive. >> >>> I'll see if dma_fence could solve my issue, but I wish you can give >>> me your detail idea >> Please take a look at David's idea of using the fence_context to find which >> jobs and entity to skip, that is even better than mine about the fence >> status and should be trivial to implement because all the data is already >> present we just need to use it. >> >> Regards, >> Christian. >> >> Am 03.05.2017 um 11:08 schrieb Liu, Monk: >>> 1,My idea is that userspace should rather gather the feedback during the >>> next command submission. This has the advantage that you don't need to keep >>> userspace alive till all jobs are done. >>> >>>> No, we need to clean the hw ring (cherry-pick out guilty entities' job in >>>> all rings) after gpu reset, and we need fackly signal all sched_fence in >>>> the guity entity as well, and we need mark context as guilty so the next >>>> IOCTL on it will return -ENODEV. >>>> I don't understand how your idea can solve my request ... >>> 2,You need to keep quite a bunch of stuff alive (VM, CSA) when you don't >>> tear down the ctx immediately. >>> >>>> I'm afraid not: CSA is gone with the VM, and VM is gone after app close >>>> our FD, I don't see amdgpu_vm_fini() is depended on context living or not >>>> ... >>> 3, struct fence was renamed to struct dma_fence on newer kernels and a >>> status field added to exactly this purpose. >>> >>> The Intel guys did this because they ran into the exactly same problem. >>> >>>> I'll see if dma_fence could solve my issue, but I wish you can give >>>> me your detail idea >>> BR Monk >>> >>> >>> >>> -----Original Message----- >>> From: Christian König [mailto:[email protected]] >>> Sent: Wednesday, May 03, 2017 4:59 PM >>> To: Liu, Monk <[email protected]>; [email protected] >>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job >>> finished >>> >>>> 1, This is necessary otherwise how can I access entity pointer >>>> after a job timedout >>> No that isn't necessary. >>> >>> The problem with your idea is that you want to actively push the >>> feedback/status from the job execution back to userspace when an >>> error >>> (timeout) happens. >>> >>> My idea is that userspace should rather gather the feedback during the next >>> command submission. This has the advantage that you don't need to keep >>> userspace alive till all jobs are done. >>> >>>> , and why it is dangerous ? >>> You need to keep quite a bunch of stuff alive (VM, CSA) when you don't tear >>> down the ctx immediately. >>> >>> We could split ctx tear down into freeing the resources and freeing the >>> structure, but I think just gathering the information needed on CS is >>> easier to do. >>> >>>> 2, what's the status field in the fences you were referring to ? I >>>> need to judge if it could satisfy my requirement >>> struct fence was renamed to struct dma_fence on newer kernels and a status >>> field added to exactly this purpose. >>> >>> The Intel guys did this because they ran into the exactly same problem. >>> >>> Regards, >>> Christian. >>> >>> Am 03.05.2017 um 05:30 schrieb Liu, Monk: >>>> 1, This is necessary otherwise how can I access entity pointer after a job >>>> timedout , and why it is dangerous ? >>>> 2, what's the status field in the fences you were referring to ? I >>>> need to judge if it could satisfy my requirement >>>> >>>> >>>> >>>> -----Original Message----- >>>> From: Christian König [mailto:[email protected]] >>>> Sent: Monday, May 01, 2017 10:48 PM >>>> To: Liu, Monk <[email protected]>; [email protected] >>>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job >>>> finished >>>> >>>> Am 01.05.2017 um 09:22 schrieb Monk Liu: >>>>> for TDR guilty context feature, we need access ctx/s_entity field >>>>> member through sched job pointer,so ctx must keep alive till all >>>>> job from it signaled. >>>> NAK, that is unnecessary and quite dangerous. >>>> >>>> Instead we have the designed status field in the fences which should be >>>> checked for that. >>>> >>>> Regards, >>>> Christian. >>>> >>>>> Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43 >>>>> Signed-off-by: Monk Liu <[email protected]> >>>>> --- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 6 +++++- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 9 +++++++++ >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 9 +++++++-- >>>>> drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------ >>>>> drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 + >>>>> 6 files changed, 23 insertions(+), 10 deletions(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>>> index e330009..8e031d6 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>>> @@ -760,10 +760,12 @@ struct amdgpu_ib { >>>>> uint32_t flags; >>>>> }; >>>>> >>>>> +struct amdgpu_ctx; >>>>> + >>>>> extern const struct amd_sched_backend_ops amdgpu_sched_ops; >>>>> >>>>> int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs, >>>>> - struct amdgpu_job **job, struct amdgpu_vm *vm); >>>>> + struct amdgpu_job **job, struct amdgpu_vm *vm, struct >>>>> +amdgpu_ctx *ctx); >>>>> int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned >>>>> size, >>>>> struct amdgpu_job **job); >>>>> >>>>> @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr { >>>>> >>>>> struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv *fpriv, >>>>> uint32_t id); >>>>> int amdgpu_ctx_put(struct amdgpu_ctx *ctx); >>>>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx); >>>>> >>>>> uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx, struct >>>>> amdgpu_ring *ring, >>>>> struct fence *fence); @@ -1129,6 +1132,7 >>>>> @@ struct >>>>> amdgpu_job { >>>>> struct amdgpu_sync sync; >>>>> struct amdgpu_ib *ibs; >>>>> struct fence *fence; /* the hw fence */ >>>>> + struct amdgpu_ctx *ctx; >>>>> uint32_t preamble_status; >>>>> uint32_t num_ibs; >>>>> void *owner; >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c >>>>> index 699f5fe..267fb65 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c >>>>> @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, >>>>> void *data) >>>>> } >>>>> } >>>>> >>>>> - ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm); >>>>> + ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm, p->ctx); >>>>> if (ret) >>>>> goto free_all_kdata; >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >>>>> index b4bbbb3..81438af 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >>>>> @@ -25,6 +25,13 @@ >>>>> #include <drm/drmP.h> >>>>> #include "amdgpu.h" >>>>> >>>>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx) { >>>>> + if (ctx) >>>>> + kref_get(&ctx->refcount); >>>>> + return ctx; >>>>> +} >>>>> + >>>>> static int amdgpu_ctx_init(struct amdgpu_device *adev, struct >>>>> amdgpu_ctx *ctx) >>>>> { >>>>> unsigned i, j; >>>>> @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev, >>>>> struct amdgpu_ctx *ctx) >>>>> rq, amdgpu_sched_jobs); >>>>> if (r) >>>>> goto failed; >>>>> + >>>>> + ctx->rings[i].entity.ptr_guilty = &ctx->guilty; /* kernel >>>>> +entity doesn't have ptr_guilty */ >>>>> } >>>>> >>>>> return 0; >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>>>> index 690ef3d..208da11 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>>>> @@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct amd_sched_job >>>>> *s_job) >>>>> } >>>>> >>>>> int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs, >>>>> - struct amdgpu_job **job, struct amdgpu_vm *vm) >>>>> + struct amdgpu_job **job, struct amdgpu_vm *vm, struct >>>>> +amdgpu_ctx *ctx) >>>>> { >>>>> size_t size = sizeof(struct amdgpu_job); >>>>> >>>>> @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, >>>>> unsigned num_ibs, >>>>> (*job)->vm = vm; >>>>> (*job)->ibs = (void *)&(*job)[1]; >>>>> (*job)->num_ibs = num_ibs; >>>>> + (*job)->ctx = amdgpu_ctx_kref_get(ctx); >>>>> >>>>> amdgpu_sync_create(&(*job)->sync); >>>>> >>>>> @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct amdgpu_device >>>>> *adev, unsigned size, >>>>> { >>>>> int r; >>>>> >>>>> - r = amdgpu_job_alloc(adev, 1, job, NULL); >>>>> + r = amdgpu_job_alloc(adev, 1, job, NULL, NULL); >>>>> if (r) >>>>> return r; >>>>> >>>>> @@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct amdgpu_job *job) >>>>> static void amdgpu_job_free_cb(struct amd_sched_job *s_job) >>>>> { >>>>> struct amdgpu_job *job = container_of(s_job, struct >>>>> amdgpu_job, base); >>>>> + struct amdgpu_ctx *ctx = job->ctx; >>>>> + >>>>> + if (ctx) >>>>> + amdgpu_ctx_put(ctx); >>>>> >>>>> fence_put(job->fence); >>>>> amdgpu_sync_free(&job->sync); diff --git >>>>> a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>> index 6f4e31f..9100ca8 100644 >>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>> @@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler >>>>> *sched, >>>>> if (!amd_sched_entity_is_initialized(sched, entity)) >>>>> return; >>>>> >>>>> - /** >>>>> - * The client will not queue more IBs during this fini, consume existing >>>>> - * queued IBs >>>>> - */ >>>>> - wait_event(sched->job_scheduled, amd_sched_entity_is_idle(entity)); >>>>> - >>>>> amd_sched_rq_remove_entity(rq, entity); >>>>> kfifo_free(&entity->job_queue); >>>>> } >>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h >>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h >>>>> index 8cb41d3..ccbbcb0 100644 >>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h >>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h >>>>> @@ -49,6 +49,7 @@ struct amd_sched_entity { >>>>> >>>>> struct fence *dependency; >>>>> struct fence_cb cb; >>>>> + bool *ptr_guilty; >>>>> }; >>>>> >>>>> /** >>>> _______________________________________________ >>>> amd-gfx mailing list >>>> [email protected] >>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>> _______________________________________________ >>> amd-gfx mailing list >>> [email protected] >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> _______________________________________________ >> amd-gfx mailing list >> [email protected] >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > > _______________________________________________ > amd-gfx mailing list > [email protected] > https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/amd-gfx
