Re: [PATCH] drm/amdgpu:resolv deadlock between reset and cs_ioctl

Christian König Mon, 11 Sep 2017 23:49:48 -0700

but since UMD/APP submits their jobs in one thread upon one ctx

That's correct for Mesa, but not necessary other user space stacks.

But that's actually irrelevant, IOCTLs must be thread save to the pointthat they at least don't crash or produce a hardware hang or somethinglike this.

Aborting with a result that indicates concurrently access is forbidden(like -EBUSY) is ok, but hard to get right and so should usually beavoided as well.


Regards,
Christian.

Am 12.09.2017 um 05:47 schrieb Liu, Monk:

The order of job->base.s_fence and preventing concurrent CS was guaranteed by 
the VM PD being reserved, that is now no longer the case and we need a new lock 
for protection.

Thought of this either, but since UMD/APP submits their jobs in one thread upon 
one ctx (if submits jobs within multiple thread upon one ctx that means the 
order is not important for UMD/app), so after we unreserved PD's resv lock
We still only have one thread doing the cs_ioctl upon *current ctx*,

So how that not work now ? thanks !

BR Monk

-----Original Message-----
From: Christian König [mailto:[email protected]]
Sent: 2017年9月11日 17:13
To: Liu, Monk <[email protected]>; [email protected]
Subject: Re: [PATCH] drm/amdgpu:resolv deadlock between reset and cs_ioctl

Am 11.09.2017 um 11:02 schrieb Monk Liu:

need to unreserve ttm bo before "cs_add_fence" and "entity_push_job"
otherwise there will be deadlock between "recover_vram_from_shadow"
and previous two routines on the ttm bo's resv lock.

Change-Id: I9c18b677e474ce5b45a9cc24da3fa255d77b3d44
Signed-off-by: Monk Liu <[email protected]>

Yeah, I was thinking about this as well. Problem is this won't work without 
further changes.

---
   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 35 
++++++++++++++++++----------------
   1 file changed, 19 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index c1f1ae9..823ac12 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -736,26 +736,12 @@ static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser 
*p)
   /**
    * cs_parser_fini() - clean parser states
    * @parser:  parser structure holding parsing context.
- * @error:     error number
- *
- * If error is set than unvalidate buffer, otherwise just free memory
- * used by parsing context.
    **/
-static void amdgpu_cs_parser_fini(struct amdgpu_cs_parser *parser,
int error, bool backoff)
+static void amdgpu_cs_parser_fini(struct amdgpu_cs_parser *parser)
   {
        struct amdgpu_fpriv *fpriv = parser->filp->driver_priv;
        unsigned i;

- if (!error) {

-               amdgpu_vm_move_pt_bos_in_lru(parser->adev, &fpriv->vm);
-
-               ttm_eu_fence_buffer_objects(&parser->ticket,
-                                           &parser->validated,
-                                           parser->fence);
-       } else if (backoff) {
-               ttm_eu_backoff_reservation(&parser->ticket,
-                                          &parser->validated);
-       }
        fence_put(parser->fence);

if (parser->ctx)

@@ -1075,6 +1061,16 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
        job->owner = p->filp;
        job->fence_ctx = entity->fence_context;
        p->fence = fence_get(&job->base.s_fence->finished);
+
+       /* hook sched fence to all BOs' reservation in validated list
+        * and unreserve them.
+        *
+        * we unreserve at here is because otherwise
+        * there'll be deadlock between ctx_add_fence/sched_entity_push_job
+        * and gpu_reset routine's recover_bo_from_shadow on PD/PTEs' ttm bo 
lock
+        */
+       ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence);

The order of job->base.s_fence and preventing concurrent CS was guaranteed by 
the VM PD being reserved, that is now no longer the case and we need a new lock 
for protection.

I suggest to add another lock to the context which we grab in
amdgpu_ctx_get() and release in amdgpu_ctx_put().

Regards,
Christian.

+
        r = amdgpu_ctx_add_fence(p->ctx, ring, p->fence, &seq);
        if (r) {
                fence_put(p->fence);
@@ -1095,6 +1091,7 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, 
struct drm_file *filp)
   {
        struct amdgpu_device *adev = dev->dev_private;
        union drm_amdgpu_cs *cs = data;
+       struct amdgpu_fpriv *fpriv;
        struct amdgpu_cs_parser parser = {};
        bool reserved_buffers = false;
        int i, r;
@@ -1104,6 +1101,7 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void
*data, struct drm_file *filp)

parser.adev = adev;

        parser.filp = filp;
+       fpriv = filp->driver_priv;

r = amdgpu_cs_parser_init(&parser, data);

        if (r) {
@@ -1141,7 +1139,12 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, 
struct drm_file *filp)
        r = amdgpu_cs_submit(&parser, cs);

out:

-       amdgpu_cs_parser_fini(&parser, r, reserved_buffers);
+       if (!r)
+               amdgpu_vm_move_pt_bos_in_lru(parser.adev, &fpriv->vm);
+       else if (reserved_buffers)
+               ttm_eu_backoff_reservation(&parser.ticket, &parser.validated);
+
+       amdgpu_cs_parser_fini(&parser);
        return r;
   }


_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx



_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu:resolv deadlock between reset and cs_ioctl

Reply via email to