Re: [Intel-gfx] [PATCH] drm/i915: Hide one invalid cancellation bug in i915_switch_context()

2015-12-22 Thread Chris Wilson
On Mon, Dec 21, 2015 at 12:59:01PM +, Chris Wilson wrote:
> Hmm, it leaks a bit furter than execbuffer. For example, we need to
> submit the request to maintain our state tracking correctly for the
> semaphores and whatnot for incomplete CS flips.
> 
> From inspection, we need:

And there's more from engine->init_context().
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Hide one invalid cancellation bug in i915_switch_context()

2015-12-21 Thread Chris Wilson
On Mon, Dec 21, 2015 at 01:28:17PM +0100, Daniel Vetter wrote:
> On Thu, Dec 17, 2015 at 06:18:05PM +, Chris Wilson wrote:
> > As we add the VMA to the request early, it may be cancelled during
> > execbuf reservation. This will leave the context object pointing to a
> > dangling request; i915_wait_request() simply skips the wait and so we
> > may unbind the object whilst it is still active.
> > 
> > We can partially prevent such atrocity by doing the RCS context
> > initialisation earlier. This ensures that one callsite from blowing up
> > (and for igt this is a frequent culprit due to how the stressful batches
> > are submitted).
> > 
> > Signed-off-by: Chris Wilson 
> > Cc: Daniel Vetter 
> > Cc: sta...@vger.kernel.org
> > ---
> >  drivers/gpu/drm/i915/i915_gem_context.c | 29 +
> >  1 file changed, 17 insertions(+), 12 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gem_context.c 
> > b/drivers/gpu/drm/i915/i915_gem_context.c
> > index 900ffd044db8..657686e6492f 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_context.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> > @@ -657,7 +657,6 @@ static int do_switch(struct drm_i915_gem_request *req)
> > struct drm_i915_private *dev_priv = ring->dev->dev_private;
> > struct intel_context *from = ring->last_context;
> > u32 hw_flags = 0;
> > -   bool uninitialized = false;
> > int ret, i;
> >  
> > if (from != NULL && ring == _priv->ring[RCS]) {
> > @@ -764,6 +763,15 @@ static int do_switch(struct drm_i915_gem_request *req)
> > to->remap_slice &= ~(1< > }
> >  
> > +   if (!to->legacy_hw_ctx.initialized) {
> > +   if (ring->init_context) {
> > +   ret = ring->init_context(req);
> > +   if (ret)
> > +   goto unpin_out;
> > +   }
> > +   to->legacy_hw_ctx.initialized = true;
> > +   }
> > +
> > /* The backing object for the context is done after switching to the
> >  * *next* context. Therefore we cannot retire the previous context until
> >  * the next context has already started running. In fact, the below code
> > @@ -772,6 +780,14 @@ static int do_switch(struct drm_i915_gem_request *req)
> >  */
> > if (from != NULL) {
> > from->legacy_hw_ctx.rcs_state->base.read_domains = 
> > I915_GEM_DOMAIN_INSTRUCTION;
> > +   /* XXX Note very well this is dangerous!
> > +* We are pinning this object using this request as our
> > +* active reference. However, this request may yet be cancelled
> > +* during the execbuf dispatch, leaving us waiting on a
> > +* dangling request. Waiting upon this dangling request is
> > +* ignored, which means that we may unbind the context whilst
> > +* the GPU is still writing to the backing storage.
> > +*/
> > 
> > i915_vma_move_to_active(i915_gem_obj_to_ggtt(from->legacy_hw_ctx.rcs_state),
> >  req);
> 
> Hm, since this is just a partial fix, what about instead marking any
> request that has been put to use already in move_to_active. And then when
> we try to cancel it from execbuf notice that and not cancel it? Fixes both
> bugs and makes the entire thing a bit more robust since it allows us to
> throw stuff at a request without ordering constraints.

Hmm, it leaks a bit furter than execbuffer. For example, we need to
submit the request to maintain our state tracking correctly for the
semaphores and whatnot for incomplete CS flips.

From inspection, we need:

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 730a6d2f5163..c33dd6f59064 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2708,12 +2708,9 @@ int i915_gpu_idle(struct drm_device *dev)
return PTR_ERR(req);
 
ret = i915_switch_context(req);
-   if (ret) {
-   i915_gem_request_cancel(req);
-   return ret;
-   }
-
i915_add_request_no_flush(req);
+   if (ret)
+   return ret;
}
 
ret = intel_engine_idle(ring);
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c 
b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index be2302f8a0cf..2496dc0948e1 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -1306,7 +1306,6 @@ execbuf_submit(struct i915_execbuffer_params *params,
trace_i915_gem_ring_dispatch(params->request, params->dispatch_flags);
 
i915_gem_execbuffer_move_to_active(vmas, params->request);
-   i915_gem_execbuffer_retire_commands(params);
 
return 0;
 }
@@ -1603,8 +1602,10 @@ i915_gem_do_execbuffer(struct drm_device 

Re: [Intel-gfx] [PATCH] drm/i915: Hide one invalid cancellation bug in i915_switch_context()

2015-12-21 Thread Daniel Vetter
On Thu, Dec 17, 2015 at 06:18:05PM +, Chris Wilson wrote:
> As we add the VMA to the request early, it may be cancelled during
> execbuf reservation. This will leave the context object pointing to a
> dangling request; i915_wait_request() simply skips the wait and so we
> may unbind the object whilst it is still active.
> 
> We can partially prevent such atrocity by doing the RCS context
> initialisation earlier. This ensures that one callsite from blowing up
> (and for igt this is a frequent culprit due to how the stressful batches
> are submitted).
> 
> Signed-off-by: Chris Wilson 
> Cc: Daniel Vetter 
> Cc: sta...@vger.kernel.org
> ---
>  drivers/gpu/drm/i915/i915_gem_context.c | 29 +
>  1 file changed, 17 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c 
> b/drivers/gpu/drm/i915/i915_gem_context.c
> index 900ffd044db8..657686e6492f 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -657,7 +657,6 @@ static int do_switch(struct drm_i915_gem_request *req)
>   struct drm_i915_private *dev_priv = ring->dev->dev_private;
>   struct intel_context *from = ring->last_context;
>   u32 hw_flags = 0;
> - bool uninitialized = false;
>   int ret, i;
>  
>   if (from != NULL && ring == _priv->ring[RCS]) {
> @@ -764,6 +763,15 @@ static int do_switch(struct drm_i915_gem_request *req)
>   to->remap_slice &= ~(1<   }
>  
> + if (!to->legacy_hw_ctx.initialized) {
> + if (ring->init_context) {
> + ret = ring->init_context(req);
> + if (ret)
> + goto unpin_out;
> + }
> + to->legacy_hw_ctx.initialized = true;
> + }
> +
>   /* The backing object for the context is done after switching to the
>* *next* context. Therefore we cannot retire the previous context until
>* the next context has already started running. In fact, the below code
> @@ -772,6 +780,14 @@ static int do_switch(struct drm_i915_gem_request *req)
>*/
>   if (from != NULL) {
>   from->legacy_hw_ctx.rcs_state->base.read_domains = 
> I915_GEM_DOMAIN_INSTRUCTION;
> + /* XXX Note very well this is dangerous!
> +  * We are pinning this object using this request as our
> +  * active reference. However, this request may yet be cancelled
> +  * during the execbuf dispatch, leaving us waiting on a
> +  * dangling request. Waiting upon this dangling request is
> +  * ignored, which means that we may unbind the context whilst
> +  * the GPU is still writing to the backing storage.
> +  */
>   
> i915_vma_move_to_active(i915_gem_obj_to_ggtt(from->legacy_hw_ctx.rcs_state), 
> req);

Hm, since this is just a partial fix, what about instead marking any
request that has been put to use already in move_to_active. And then when
we try to cancel it from execbuf notice that and not cancel it? Fixes both
bugs and makes the entire thing a bit more robust since it allows us to
throw stuff at a request without ordering constraints.
-Daniel

>   /* As long as MI_SET_CONTEXT is serializing, ie. it flushes the
>* whole damn pipeline, we don't need to explicitly mark the
> @@ -787,21 +803,10 @@ static int do_switch(struct drm_i915_gem_request *req)
>   i915_gem_context_unreference(from);
>   }
>  
> - uninitialized = !to->legacy_hw_ctx.initialized;
> - to->legacy_hw_ctx.initialized = true;
> -
>  done:
>   i915_gem_context_reference(to);
>   ring->last_context = to;
>  
> - if (uninitialized) {
> - if (ring->init_context) {
> - ret = ring->init_context(req);
> - if (ret)
> - DRM_ERROR("ring init context: %d\n", ret);
> - }
> - }
> -
>   return 0;
>  
>  unpin_out:
> -- 
> 2.6.4
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH] drm/i915: Hide one invalid cancellation bug in i915_switch_context()

2015-12-17 Thread Chris Wilson
As we add the VMA to the request early, it may be cancelled during
execbuf reservation. This will leave the context object pointing to a
dangling request; i915_wait_request() simply skips the wait and so we
may unbind the object whilst it is still active.

We can partially prevent such atrocity by doing the RCS context
initialisation earlier. This ensures that one callsite from blowing up
(and for igt this is a frequent culprit due to how the stressful batches
are submitted).

Signed-off-by: Chris Wilson 
Cc: Daniel Vetter 
Cc: sta...@vger.kernel.org
---
 drivers/gpu/drm/i915/i915_gem_context.c | 29 +
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c 
b/drivers/gpu/drm/i915/i915_gem_context.c
index 900ffd044db8..657686e6492f 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -657,7 +657,6 @@ static int do_switch(struct drm_i915_gem_request *req)
struct drm_i915_private *dev_priv = ring->dev->dev_private;
struct intel_context *from = ring->last_context;
u32 hw_flags = 0;
-   bool uninitialized = false;
int ret, i;
 
if (from != NULL && ring == _priv->ring[RCS]) {
@@ -764,6 +763,15 @@ static int do_switch(struct drm_i915_gem_request *req)
to->remap_slice &= ~(1<legacy_hw_ctx.initialized) {
+   if (ring->init_context) {
+   ret = ring->init_context(req);
+   if (ret)
+   goto unpin_out;
+   }
+   to->legacy_hw_ctx.initialized = true;
+   }
+
/* The backing object for the context is done after switching to the
 * *next* context. Therefore we cannot retire the previous context until
 * the next context has already started running. In fact, the below code
@@ -772,6 +780,14 @@ static int do_switch(struct drm_i915_gem_request *req)
 */
if (from != NULL) {
from->legacy_hw_ctx.rcs_state->base.read_domains = 
I915_GEM_DOMAIN_INSTRUCTION;
+   /* XXX Note very well this is dangerous!
+* We are pinning this object using this request as our
+* active reference. However, this request may yet be cancelled
+* during the execbuf dispatch, leaving us waiting on a
+* dangling request. Waiting upon this dangling request is
+* ignored, which means that we may unbind the context whilst
+* the GPU is still writing to the backing storage.
+*/

i915_vma_move_to_active(i915_gem_obj_to_ggtt(from->legacy_hw_ctx.rcs_state), 
req);
/* As long as MI_SET_CONTEXT is serializing, ie. it flushes the
 * whole damn pipeline, we don't need to explicitly mark the
@@ -787,21 +803,10 @@ static int do_switch(struct drm_i915_gem_request *req)
i915_gem_context_unreference(from);
}
 
-   uninitialized = !to->legacy_hw_ctx.initialized;
-   to->legacy_hw_ctx.initialized = true;
-
 done:
i915_gem_context_reference(to);
ring->last_context = to;
 
-   if (uninitialized) {
-   if (ring->init_context) {
-   ret = ring->init_context(req);
-   if (ret)
-   DRM_ERROR("ring init context: %d\n", ret);
-   }
-   }
-
return 0;
 
 unpin_out:
-- 
2.6.4

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx