Re: [Intel-gfx] [PATCH] drm/i915: Do not overwrite the request with zero on reallocation

Daniel Vetter Fri, 05 Aug 2016 09:17:45 -0700

On Fri, Aug 05, 2016 at 04:13:28PM +0100, Chris Wilson wrote:
> When using RCU lookup for the request, commit 0eafec6d3244 ("drm/i915:
> Enable lockless lookup of request tracking via RCU"), we acknowledge that
> we may race with another thread that could have reallocated the request.
> In order for the first thread not to blow up, the second thread must not
> clear the request completed before overwriting it. In the RCU lookup, we
> allow for the engine/seqno to be replaced but we do not allow for it to
> be zeroed.


First few remarks:
- Commit message definitely needs to explain the tradeoff between avoiding
  the memset and just making req->engine lookup a bit safer for _rcu like
  below:


diff --git a/drivers/gpu/drm/i915/i915_gem_request.h 
b/drivers/gpu/drm/i915/i915_gem_request.h
index 6002adc43523..e55492ba20ec 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.h
+++ b/drivers/gpu/drm/i915/i915_gem_request.h
@@ -244,6 +244,26 @@ i915_gem_request_started(const struct drm_i915_gem_request 
*req)
 }
 
 static inline bool
+i915_gem_request_completed_rcu(const struct drm_i915_gem_request *req)
+{
+       struct intel_engine_cs *engine = READ_ONCE(req->engine);
+
+       /* When we peek at a request solely under rcu protection, without
+        * hodling a full reference, the request might be in the process of
+        * getting freed and reallocated. Make sure we don't stumble over a NULL
+        * engine in that case.
+        *
+        * If we are hitting this race it means that the old request has been
+        * released, which only happens once it has completed.
+        */
+       if (!engine)
+               return true;
+
+       return i915_seqno_passed(intel_engine_get_seqno(engine),
+                                req->fence.seqno);
+}
+
+static inline bool
 i915_gem_request_completed(const struct drm_i915_gem_request *req)
 {
        return i915_seqno_passed(intel_engine_get_seqno(req->engine),
@@ -384,7 +404,7 @@ i915_gem_active_peek_rcu(const struct i915_gem_active 
*active)
        struct drm_i915_gem_request *request;
 
        request = rcu_dereference(active->request);
-       if (!request || i915_gem_request_completed(request))
+       if (!request || i915_gem_request_completed_rcu(request))
                return NULL;
 
        return request;
@@ -459,7 +479,7 @@ __i915_gem_active_get_rcu(const struct i915_gem_active 
*active)
                struct drm_i915_gem_request *request;
 
                request = rcu_dereference(active->request);
-               if (!request || i915_gem_request_completed(request))
+               if (!request || i915_gem_request_completed_rcu(request))
                        return NULL;
 
                request = i915_gem_request_get_rcu(request);

I'd go as far as putting this as an alternative fix into the changelog.

- We need a big hoonking comment somewhere (probably right above the
  kmem_cache_alloc) why this is not zalloc. Proposal:

        /* Reallocation can race with rcu-protected request lookup. The
         * request look code does eventually acquire a full reference, but
         * before that it has a fast-path to peek at the request
         * completion. We must make sure that that code can't fall over
         * a request in the process of getting reinitialized here. Since
         * it's a pure optimization data integrity is not important, the
         * only risk is in chasing NULL pointers. Currently this is only
         * request->engine which must not be cleared.
         *
         * Alternative fix would be to make the request peeking more
         * robust, but that's overhead. Also, requests get reallocated a
         * lot, avoid the memset makes sense. Hence this is not allocated
         * with kzalloc, which is a rare exception in the i915 driver.
         *
         * BEWARE: Everything must be correctly initialized or set to
         * NULL!
         */
> 
> Fixes: 0eafec6d3244 ("drm/i915: Enable lockless lookup of request...")
> Signed-off-by: Chris Wilson <[email protected]>
> Cc: "Goel, Akash" <[email protected]>
> Cc: Daniel Vetter <[email protected]>
> Cc: Joonas Lahtinen <[email protected]>
> ---
>  drivers/gpu/drm/i915/i915_gem_request.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_request.c 
> b/drivers/gpu/drm/i915/i915_gem_request.c
> index b317a672040f..7529b6b5deda 100644
> --- a/drivers/gpu/drm/i915/i915_gem_request.c
> +++ b/drivers/gpu/drm/i915/i915_gem_request.c
> @@ -355,7 +355,7 @@ i915_gem_request_alloc(struct intel_engine_cs *engine,
>       if (req && i915_gem_request_completed(req))
>               i915_gem_request_retire(req);
>  
> -     req = kmem_cache_zalloc(dev_priv->requests, GFP_KERNEL);
> +     req = kmem_cache_alloc(dev_priv->requests, GFP_KERNEL);
>       if (!req)
>               return ERR_PTR(-ENOMEM);
>  
> @@ -375,6 +375,10 @@ i915_gem_request_alloc(struct intel_engine_cs *engine,
>       req->engine = engine;
>       req->ctx = i915_gem_context_get(ctx);
>  
> +     req->signaling.wait.tsk = NULL;

Do we need to reinit this? The important bit is that we remove ourselves
from the rb tree, and we do that in intel_engine_remove_wait.

> +     req->previous_context = NULL;

Should we move that into the retire function where we call the lrc unpin?

> +     req->file_priv = NULL;

We already clear this in remove_from_client.

Admittedly didn't do a full audit whether those are all we need yet.
-Daniel

> +
>       /*
>        * Reserve space in the ring buffer for all the commands required to
>        * eventually emit this request. This is to guarantee that the
> -- 
> 2.8.1
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] [PATCH] drm/i915: Do not overwrite the request with zero on reallocation

Reply via email to