Re: [Intel-gfx] [PATCH] drm/i915: Hide one invalid cancellation bug in i915_switch_context()

Chris Wilson Mon, 21 Dec 2015 04:59:42 -0800

On Mon, Dec 21, 2015 at 01:28:17PM +0100, Daniel Vetter wrote:
> On Thu, Dec 17, 2015 at 06:18:05PM +0000, Chris Wilson wrote:
> > As we add the VMA to the request early, it may be cancelled during
> > execbuf reservation. This will leave the context object pointing to a
> > dangling request; i915_wait_request() simply skips the wait and so we
> > may unbind the object whilst it is still active.
> > 
> > We can partially prevent such atrocity by doing the RCS context
> > initialisation earlier. This ensures that one callsite from blowing up
> > (and for igt this is a frequent culprit due to how the stressful batches
> > are submitted).
> > 
> > Signed-off-by: Chris Wilson <[email protected]>
> > Cc: Daniel Vetter <[email protected]>
> > Cc: [email protected]
> > ---
> >  drivers/gpu/drm/i915/i915_gem_context.c | 29 +++++++++++++++++------------
> >  1 file changed, 17 insertions(+), 12 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gem_context.c 
> > b/drivers/gpu/drm/i915/i915_gem_context.c
> > index 900ffd044db8..657686e6492f 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_context.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> > @@ -657,7 +657,6 @@ static int do_switch(struct drm_i915_gem_request *req)
> >     struct drm_i915_private *dev_priv = ring->dev->dev_private;
> >     struct intel_context *from = ring->last_context;
> >     u32 hw_flags = 0;
> > -   bool uninitialized = false;
> >     int ret, i;
> >  
> >     if (from != NULL && ring == &dev_priv->ring[RCS]) {
> > @@ -764,6 +763,15 @@ static int do_switch(struct drm_i915_gem_request *req)
> >                     to->remap_slice &= ~(1<<i);
> >     }
> >  
> > +   if (!to->legacy_hw_ctx.initialized) {
> > +           if (ring->init_context) {
> > +                   ret = ring->init_context(req);
> > +                   if (ret)
> > +                           goto unpin_out;
> > +           }
> > +           to->legacy_hw_ctx.initialized = true;
> > +   }
> > +
> >     /* The backing object for the context is done after switching to the
> >      * *next* context. Therefore we cannot retire the previous context until
> >      * the next context has already started running. In fact, the below code
> > @@ -772,6 +780,14 @@ static int do_switch(struct drm_i915_gem_request *req)
> >      */
> >     if (from != NULL) {
> >             from->legacy_hw_ctx.rcs_state->base.read_domains = 
> > I915_GEM_DOMAIN_INSTRUCTION;
> > +           /* XXX Note very well this is dangerous!
> > +            * We are pinning this object using this request as our
> > +            * active reference. However, this request may yet be cancelled
> > +            * during the execbuf dispatch, leaving us waiting on a
> > +            * dangling request. Waiting upon this dangling request is
> > +            * ignored, which means that we may unbind the context whilst
> > +            * the GPU is still writing to the backing storage.
> > +            */
> >             
> > i915_vma_move_to_active(i915_gem_obj_to_ggtt(from->legacy_hw_ctx.rcs_state),
> >  req);
> 
> Hm, since this is just a partial fix, what about instead marking any
> request that has been put to use already in move_to_active. And then when
> we try to cancel it from execbuf notice that and not cancel it? Fixes both
> bugs and makes the entire thing a bit more robust since it allows us to
> throw stuff at a request without ordering constraints.


Hmm, it leaks a bit furter than execbuffer. For example, we need to
submit the request to maintain our state tracking correctly for the
semaphores and whatnot for incomplete CS flips.

From inspection, we need:

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 730a6d2f5163..c33dd6f59064 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2708,12 +2708,9 @@ int i915_gpu_idle(struct drm_device *dev)
                                return PTR_ERR(req);
 
                        ret = i915_switch_context(req);
-                       if (ret) {
-                               i915_gem_request_cancel(req);
-                               return ret;
-                       }
-
                        i915_add_request_no_flush(req);
+                       if (ret)
+                               return ret;
                }
 
                ret = intel_engine_idle(ring);
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c 
b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index be2302f8a0cf..2496dc0948e1 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -1306,7 +1306,6 @@ execbuf_submit(struct i915_execbuffer_params *params,
        trace_i915_gem_ring_dispatch(params->request, params->dispatch_flags);
 
        i915_gem_execbuffer_move_to_active(vmas, params->request);
-       i915_gem_execbuffer_retire_commands(params);
 
        return 0;
 }
@@ -1603,8 +1602,10 @@ i915_gem_do_execbuffer(struct drm_device *dev, void 
*data,
        }
 
        ret = i915_gem_request_add_to_client(params->request, file);
-       if (ret)
-               goto err_request;
+       if (ret) {
+               i915_gem_request_cancel(params->request);
+               goto err_batch_unpin;
+       }
 
        /*
         * Save assorted stuff away to pass through to *_submission().
@@ -1620,15 +1621,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void 
*data,
        params->ctx                     = ctx;
 
        ret = execbuf_submit(params, args, &eb->vmas);
-
-err_request:
-       /*
-        * If the request was created but not successfully submitted then it
-        * must be freed again. If it was submitted then it is being tracked
-        * on the active request list and no clean up is required here.
-        */
-       if (ret)
-               i915_gem_request_cancel(params->request);
+       i915_gem_execbuffer_retire_commands(params);
 
 err_batch_unpin:
        /*
diff --git a/drivers/gpu/drm/i915/intel_display.c 
b/drivers/gpu/drm/i915/intel_display.c
index f73c06387ac3..c01765dea0c9 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -11671,7 +11671,7 @@ cleanup_unpin:
        intel_unpin_fb_obj(fb, crtc->primary->state);
 cleanup_request:
        if (request)
-               i915_gem_request_cancel(request);
+               i915_add_request_no_flush(request);
 cleanup_pending:
        atomic_dec(&intel_crtc->unpin_work_count);
        mutex_unlock(&dev->struct_mutex);

And then everytime we need to consider, can I actually cancel this request?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] [PATCH] drm/i915: Hide one invalid cancellation bug in i915_switch_context()

Reply via email to