amdgpu:keep ctx alive till all job finished

Liu, Monk Wed, 03 May 2017 06:32:33 -0700

Even we release the ctx as usual way,
Can we guarantee the pde/pte and PRT/CSA are all alive (BO, mappings) when 
resubmitting the timeout job (assume this time out job can signal after the 
resubmit)?


You know App can submit a command a release all BO and free_ctx, close FD/VM, 
and exit very soon, it just doesn't wait for the  fence signal 

BR Monk

-----Original Message-----
From: Christian König [mailto:[email protected]] 
Sent: Wednesday, May 03, 2017 8:50 PM
To: Liu, Monk <[email protected]>; [email protected]
Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished

> and the ironic thing is I want to alive as well (especially CSA, PTR)
Yes, and exactly that is the danger I was talking about. We messed up the tear 
down oder with that and try to access resources which are already freed when 
the job is now scheduled.

I would rather say we should get completely rid of the ctx kref counting, that 
was a rather bad idea in the first place.

Regards,
Christian.

Am 03.05.2017 um 11:36 schrieb Liu, Monk:
> Since I get one more kref for ctx when creating jobs, so 
> amdgpu_ctx_mgr_fini(&fpriv->ctx_mgr) here won't actually waiting ... because 
> the " amdgpu_ctx_do_release"
> Won't going to run (kref > 0 before all job signaled).
>
> That way amdgpu_driver_postclose_kms() can continue go on , So 
> actually " UVD and VCE handle, the PRT VAs, the CSA and even the whole 
> VM structure" won't be kept alive , and the ironic thing is I want to 
> alive as well (especially CSA, PTR)
>
>
> BR Monk
>
>
>
>
> -----Original Message-----
> From: Christian König [mailto:[email protected]]
> Sent: Wednesday, May 03, 2017 5:19 PM
> To: Liu, Monk <[email protected]>; [email protected]
> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job 
> finished
>
>>> I'm afraid not:  CSA is gone with the VM, and VM is gone after app close 
>>> our FD, I don't see amdgpu_vm_fini() is depended on context living or not 
>>> ...
> See the teardown order in amdgpu_driver_postclose_kms():
>> amdgpu_ctx_mgr_fini(&fpriv->ctx_mgr);
>>
>>          amdgpu_uvd_free_handles(adev, file_priv);
>>          amdgpu_vce_free_handles(adev, file_priv);
>>
>>          amdgpu_vm_bo_rmv(adev, fpriv->prt_va);
>>
>>          if (amdgpu_sriov_vf(adev)) {
>>                  /* TODO: how to handle reserve failure */
>>                  BUG_ON(amdgpu_bo_reserve(adev->virt.csa_obj, false));
>>                  amdgpu_vm_bo_rmv(adev, fpriv->vm.csa_bo_va);
>>                  fpriv->vm.csa_bo_va = NULL;
>>                  amdgpu_bo_unreserve(adev->virt.csa_obj);
>>          }
>>
>>          amdgpu_vm_fini(adev, &fpriv->vm);
> amdgpu_ctx_mgr_fini() waits for scheduling to finish and releases all 
> contexts of the current fd.
>
> If we don't release the context here because some jobs are still executed we 
> need to keep the UVD and VCE handle, the PRT VAs, the CSA and even the whole 
> VM structure alive.
>
>> I'll see if dma_fence could solve my issue, but I wish you can give 
>> me your detail idea
> Please take a look at David's idea of using the fence_context to find which 
> jobs and entity to skip, that is even better than mine about the fence status 
> and should be trivial to implement because all the data is already present we 
> just need to use it.
>
> Regards,
> Christian.
>
> Am 03.05.2017 um 11:08 schrieb Liu, Monk:
>> 1,My idea is that userspace should rather gather the feedback during the 
>> next command submission. This has the advantage that you don't need to keep 
>> userspace alive till all jobs are done.
>>
>>> No, we need to clean the hw ring (cherry-pick out guilty entities' job in 
>>> all rings) after gpu reset, and we need fackly signal all sched_fence in 
>>> the guity entity as well, and we need mark context as guilty so the next 
>>> IOCTL on it will return -ENODEV.
>>> I don't understand how your idea can solve my request ...
>> 2,You need to keep quite a bunch of stuff alive (VM, CSA) when you don't 
>> tear down the ctx immediately.
>>
>>> I'm afraid not:  CSA is gone with the VM, and VM is gone after app close 
>>> our FD, I don't see amdgpu_vm_fini() is depended on context living or not 
>>> ...
>> 3, struct fence was renamed to struct dma_fence on newer kernels and a 
>> status field added to exactly this purpose.
>>
>> The Intel guys did this because they ran into the exactly same problem.
>>
>>> I'll see if dma_fence could solve my issue, but I wish you can give 
>>> me your detail idea
>> BR Monk
>>
>>
>>
>> -----Original Message-----
>> From: Christian König [mailto:[email protected]]
>> Sent: Wednesday, May 03, 2017 4:59 PM
>> To: Liu, Monk <[email protected]>; [email protected]
>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job 
>> finished
>>
>>> 1, This is necessary otherwise how can I access entity pointer after 
>>> a job timedout
>> No that isn't necessary.
>>
>> The problem with your idea is that you want to actively push the 
>> feedback/status from the job execution back to userspace when an 
>> error
>> (timeout) happens.
>>
>> My idea is that userspace should rather gather the feedback during the next 
>> command submission. This has the advantage that you don't need to keep 
>> userspace alive till all jobs are done.
>>
>>>     , and why it is dangerous ?
>> You need to keep quite a bunch of stuff alive (VM, CSA) when you don't tear 
>> down the ctx immediately.
>>
>> We could split ctx tear down into freeing the resources and freeing the 
>> structure, but I think just gathering the information needed on CS is easier 
>> to do.
>>
>>> 2, what's the status field in the fences you were referring to ? I 
>>> need to judge if it could satisfy my requirement
>> struct fence was renamed to struct dma_fence on newer kernels and a status 
>> field added to exactly this purpose.
>>
>> The Intel guys did this because they ran into the exactly same problem.
>>
>> Regards,
>> Christian.
>>
>> Am 03.05.2017 um 05:30 schrieb Liu, Monk:
>>> 1, This is necessary otherwise how can I access entity pointer after a job 
>>> timedout , and why it is dangerous ?
>>> 2, what's the status field in the fences you were referring to ? I 
>>> need to judge if it could satisfy my requirement
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:[email protected]]
>>> Sent: Monday, May 01, 2017 10:48 PM
>>> To: Liu, Monk <[email protected]>; [email protected]
>>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job 
>>> finished
>>>
>>> Am 01.05.2017 um 09:22 schrieb Monk Liu:
>>>> for TDR guilty context feature, we need access ctx/s_entity field 
>>>> member through sched job pointer,so ctx must keep alive till all 
>>>> job from it signaled.
>>> NAK, that is unnecessary and quite dangerous.
>>>
>>> Instead we have the designed status field in the fences which should be 
>>> checked for that.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43
>>>> Signed-off-by: Monk Liu <[email protected]>
>>>> ---
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 6 +++++-
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 2 +-
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 9 +++++++++
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 9 +++++++--
>>>>      drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------
>>>>      drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 +
>>>>      6 files changed, 23 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> index e330009..8e031d6 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> @@ -760,10 +760,12 @@ struct amdgpu_ib {
>>>>            uint32_t                        flags;
>>>>      };
>>>>      
>>>> +struct amdgpu_ctx;
>>>> +
>>>>      extern const struct amd_sched_backend_ops amdgpu_sched_ops;
>>>>      
>>>>      int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>>> -               struct amdgpu_job **job, struct amdgpu_vm *vm);
>>>> +               struct amdgpu_job **job, struct amdgpu_vm *vm, struct 
>>>> +amdgpu_ctx *ctx);
>>>>      int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, unsigned 
>>>> size,
>>>>                                 struct amdgpu_job **job);
>>>>      
>>>> @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr {
>>>>      
>>>>      struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv *fpriv, 
>>>> uint32_t id);
>>>>      int amdgpu_ctx_put(struct amdgpu_ctx *ctx);
>>>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx);
>>>>      
>>>>      uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx, struct 
>>>> amdgpu_ring *ring,
>>>>                                  struct fence *fence); @@ -1129,6 +1132,7 
>>>> @@ struct 
>>>> amdgpu_job {
>>>>            struct amdgpu_sync      sync;
>>>>            struct amdgpu_ib        *ibs;
>>>>            struct fence            *fence; /* the hw fence */
>>>> +  struct amdgpu_ctx *ctx;
>>>>            uint32_t                preamble_status;
>>>>            uint32_t                num_ibs;
>>>>            void                    *owner;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> index 699f5fe..267fb65 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, 
>>>> void *data)
>>>>                    }
>>>>            }
>>>>      
>>>> -  ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm);
>>>> +  ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm, p->ctx);
>>>>            if (ret)
>>>>                    goto free_all_kdata;
>>>>      
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> index b4bbbb3..81438af 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> @@ -25,6 +25,13 @@
>>>>      #include <drm/drmP.h>
>>>>      #include "amdgpu.h"
>>>>      
>>>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx) {
>>>> +  if (ctx)
>>>> +          kref_get(&ctx->refcount);
>>>> +  return ctx;
>>>> +}
>>>> +
>>>>      static int amdgpu_ctx_init(struct amdgpu_device *adev, struct 
>>>> amdgpu_ctx *ctx)
>>>>      {
>>>>            unsigned i, j;
>>>> @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev, 
>>>> struct amdgpu_ctx *ctx)
>>>>                                              rq, amdgpu_sched_jobs);
>>>>                    if (r)
>>>>                            goto failed;
>>>> +
>>>> +          ctx->rings[i].entity.ptr_guilty = &ctx->guilty; /* kernel 
>>>> entity 
>>>> +doesn't have ptr_guilty */
>>>>            }
>>>>      
>>>>            return 0;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> index 690ef3d..208da11 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> @@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct amd_sched_job 
>>>> *s_job)
>>>>      }
>>>>      
>>>>      int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>>>> -               struct amdgpu_job **job, struct amdgpu_vm *vm)
>>>> +               struct amdgpu_job **job, struct amdgpu_vm *vm, struct 
>>>> +amdgpu_ctx *ctx)
>>>>      {
>>>>            size_t size = sizeof(struct amdgpu_job);
>>>>      
>>>> @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, 
>>>> unsigned num_ibs,
>>>>            (*job)->vm = vm;
>>>>            (*job)->ibs = (void *)&(*job)[1];
>>>>            (*job)->num_ibs = num_ibs;
>>>> +  (*job)->ctx = amdgpu_ctx_kref_get(ctx);
>>>>      
>>>>            amdgpu_sync_create(&(*job)->sync);
>>>>      
>>>> @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, 
>>>> unsigned size,
>>>>      {
>>>>            int r;
>>>>      
>>>> -  r = amdgpu_job_alloc(adev, 1, job, NULL);
>>>> +  r = amdgpu_job_alloc(adev, 1, job, NULL, NULL);
>>>>            if (r)
>>>>                    return r;
>>>>      
>>>> @@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct amdgpu_job *job)
>>>>      static void amdgpu_job_free_cb(struct amd_sched_job *s_job)
>>>>      {
>>>>            struct amdgpu_job *job = container_of(s_job, struct 
>>>> amdgpu_job, base);
>>>> +  struct amdgpu_ctx *ctx = job->ctx;
>>>> +
>>>> +  if (ctx)
>>>> +          amdgpu_ctx_put(ctx);
>>>>      
>>>>            fence_put(job->fence);
>>>>            amdgpu_sync_free(&job->sync); diff --git 
>>>> a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>> index 6f4e31f..9100ca8 100644
>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>>> @@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler 
>>>> *sched,
>>>>            if (!amd_sched_entity_is_initialized(sched, entity))
>>>>                    return;
>>>>      
>>>> -  /**
>>>> -   * The client will not queue more IBs during this fini, consume existing
>>>> -   * queued IBs
>>>> -  */
>>>> -  wait_event(sched->job_scheduled, amd_sched_entity_is_idle(entity));
>>>> -
>>>>            amd_sched_rq_remove_entity(rq, entity);
>>>>            kfifo_free(&entity->job_queue);
>>>>      }
>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>> index 8cb41d3..ccbbcb0 100644
>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>>> @@ -49,6 +49,7 @@ struct amd_sched_entity {
>>>>      
>>>>            struct fence                    *dependency;
>>>>            struct fence_cb                 cb;
>>>> +  bool *ptr_guilty;
>>>>      };
>>>>      
>>>>      /**
>>> _______________________________________________
>>> amd-gfx mailing list
>>> [email protected]
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> [email protected]
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
> _______________________________________________
> amd-gfx mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished

Reply via email to