Re: [Intel-gfx] [PATCH v3] drm/i915: Fix premature release of request's reusable memory

2023-07-31 Thread Andi Shyti
Hi Janusz,

On Thu, Jul 20, 2023 at 11:35:44AM +0200, Janusz Krzysztofik wrote:
> Infinite waits for completion of GPU activity have been observed in CI,
> mostly inside __i915_active_wait(), triggered by igt@gem_barrier_race or
> igt@perf@stress-open-close.  Root cause analysis, based of ftrace dumps
> generated with a lot of extra trace_printk() calls added to the code,
> revealed loops of request dependencies being accidentally built,
> preventing the requests from being processed, each waiting for completion
> of another one's activity.
> 
> After we substitute a new request for a last active one tracked on a
> timeline, we set up a dependency of our new request to wait on completion
> of current activity of that previous one.  While doing that, we must take
> care of keeping the old request still in memory until we use its
> attributes for setting up that await dependency, or we can happen to set
> up the await dependency on an unrelated request that already reuses the
> memory previously allocated to the old one, already released.  Combined
> with perf adding consecutive kernel context remote requests to different
> user context timelines, unresolvable loops of await dependencies can be
> built, leading do infinite waits.
> 
> We obtain a pointer to the previous request to wait upon when we
> substitute it with a pointer to our new request in an active tracker,
> e.g. in intel_timeline.last_request.  In some processing paths we protect
> that old request from being freed before we use it by getting a reference
> to it under RCU protection, but in others, e.g.  __i915_request_commit()
> -> __i915_request_add_to_timeline() -> __i915_request_ensure_ordering(),
> we don't.  But anyway, since the requests' memory is SLAB_FAILSAFE_BY_RCU,
> that RCU protection is not sufficient against reuse of memory.
> 
> We could protect i915_request's memory from being prematurely reused by
> calling its release function via call_rcu() and using rcu_read_lock()
> consequently, as proposed in v1.  However, that approach leads to
> significant (up to 10 times) increase of SLAB utilization by i915_request
> SLAB cache.  Another potential approach is to take a reference to the
> previous active fence.
> 
> When updating an active fence tracker, we first lock the new fence,
> substitute a pointer of the current active fence with the new one, then we
> lock the substituted fence.  With this approach, there is a time window
> after the substitution and before the lock when the request can be
> concurrently released by an interrupt handler and its memory reused, then
> we may happen to lock and return a new, unrelated request.
> 
> Always get a reference to the current active fence first, before
> replacing it with a new one.  Having it protected from premature release
> and reuse, lock it and then replace with the new one but only if not
> yet signalled via a potential concurrent interrupt nor replaced with
> another one by a potential concurrent thread, otherwise retry, starting
> from getting a reference to the new current one.  Adjust users to not
> get a reference to the previous active fence themselves and always put the
> reference got by __i915_active_fence_set() when no longer needed.
> 
> v3: Fix lockdep splat reports and other issues caused by incorrect use of
> try_cmpxchg() (use (cmpxchg() != prev) instead)
> v2: Protect request's memory by getting a reference to it in favor of
> delegating its release to call_rcu() (Chris)
> 
> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/8211
> Fixes: df9f85d8582e ("drm/i915: Serialise i915_active_fence_set() with 
> itself")
> Suggested-by: Chris Wilson 
> Signed-off-by: Janusz Krzysztofik 
> Cc:  # v5.6+

pushed to drm-intel-gt-next... thank you!

Andi


Re: [Intel-gfx] [PATCH v3] drm/i915: Fix premature release of request's reusable memory

2023-07-28 Thread Andi Shyti
Hi Janusz,

On Thu, Jul 20, 2023 at 11:35:44AM +0200, Janusz Krzysztofik wrote:
> Infinite waits for completion of GPU activity have been observed in CI,
> mostly inside __i915_active_wait(), triggered by igt@gem_barrier_race or
> igt@perf@stress-open-close.  Root cause analysis, based of ftrace dumps
> generated with a lot of extra trace_printk() calls added to the code,
> revealed loops of request dependencies being accidentally built,
> preventing the requests from being processed, each waiting for completion
> of another one's activity.
> 
> After we substitute a new request for a last active one tracked on a
> timeline, we set up a dependency of our new request to wait on completion
> of current activity of that previous one.  While doing that, we must take
> care of keeping the old request still in memory until we use its
> attributes for setting up that await dependency, or we can happen to set
> up the await dependency on an unrelated request that already reuses the
> memory previously allocated to the old one, already released.  Combined
> with perf adding consecutive kernel context remote requests to different
> user context timelines, unresolvable loops of await dependencies can be
> built, leading do infinite waits.
> 
> We obtain a pointer to the previous request to wait upon when we
> substitute it with a pointer to our new request in an active tracker,
> e.g. in intel_timeline.last_request.  In some processing paths we protect
> that old request from being freed before we use it by getting a reference
> to it under RCU protection, but in others, e.g.  __i915_request_commit()
> -> __i915_request_add_to_timeline() -> __i915_request_ensure_ordering(),
> we don't.  But anyway, since the requests' memory is SLAB_FAILSAFE_BY_RCU,
> that RCU protection is not sufficient against reuse of memory.
> 
> We could protect i915_request's memory from being prematurely reused by
> calling its release function via call_rcu() and using rcu_read_lock()
> consequently, as proposed in v1.  However, that approach leads to
> significant (up to 10 times) increase of SLAB utilization by i915_request
> SLAB cache.  Another potential approach is to take a reference to the
> previous active fence.
> 
> When updating an active fence tracker, we first lock the new fence,
> substitute a pointer of the current active fence with the new one, then we
> lock the substituted fence.  With this approach, there is a time window
> after the substitution and before the lock when the request can be
> concurrently released by an interrupt handler and its memory reused, then
> we may happen to lock and return a new, unrelated request.
> 
> Always get a reference to the current active fence first, before
> replacing it with a new one.  Having it protected from premature release
> and reuse, lock it and then replace with the new one but only if not
> yet signalled via a potential concurrent interrupt nor replaced with
> another one by a potential concurrent thread, otherwise retry, starting
> from getting a reference to the new current one.  Adjust users to not
> get a reference to the previous active fence themselves and always put the
> reference got by __i915_active_fence_set() when no longer needed.
> 
> v3: Fix lockdep splat reports and other issues caused by incorrect use of
> try_cmpxchg() (use (cmpxchg() != prev) instead)
> v2: Protect request's memory by getting a reference to it in favor of
> delegating its release to call_rcu() (Chris)
> 
> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/8211
> Fixes: df9f85d8582e ("drm/i915: Serialise i915_active_fence_set() with 
> itself")
> Suggested-by: Chris Wilson 
> Signed-off-by: Janusz Krzysztofik 
> Cc:  # v5.6+

thanks for the offline clarification on this! It's another good
catch of yours :)

Reviewed-by: Andi Shyti  

Thank you!
Andi


[Intel-gfx] [PATCH v3] drm/i915: Fix premature release of request's reusable memory

2023-07-20 Thread Janusz Krzysztofik
Infinite waits for completion of GPU activity have been observed in CI,
mostly inside __i915_active_wait(), triggered by igt@gem_barrier_race or
igt@perf@stress-open-close.  Root cause analysis, based of ftrace dumps
generated with a lot of extra trace_printk() calls added to the code,
revealed loops of request dependencies being accidentally built,
preventing the requests from being processed, each waiting for completion
of another one's activity.

After we substitute a new request for a last active one tracked on a
timeline, we set up a dependency of our new request to wait on completion
of current activity of that previous one.  While doing that, we must take
care of keeping the old request still in memory until we use its
attributes for setting up that await dependency, or we can happen to set
up the await dependency on an unrelated request that already reuses the
memory previously allocated to the old one, already released.  Combined
with perf adding consecutive kernel context remote requests to different
user context timelines, unresolvable loops of await dependencies can be
built, leading do infinite waits.

We obtain a pointer to the previous request to wait upon when we
substitute it with a pointer to our new request in an active tracker,
e.g. in intel_timeline.last_request.  In some processing paths we protect
that old request from being freed before we use it by getting a reference
to it under RCU protection, but in others, e.g.  __i915_request_commit()
-> __i915_request_add_to_timeline() -> __i915_request_ensure_ordering(),
we don't.  But anyway, since the requests' memory is SLAB_FAILSAFE_BY_RCU,
that RCU protection is not sufficient against reuse of memory.

We could protect i915_request's memory from being prematurely reused by
calling its release function via call_rcu() and using rcu_read_lock()
consequently, as proposed in v1.  However, that approach leads to
significant (up to 10 times) increase of SLAB utilization by i915_request
SLAB cache.  Another potential approach is to take a reference to the
previous active fence.

When updating an active fence tracker, we first lock the new fence,
substitute a pointer of the current active fence with the new one, then we
lock the substituted fence.  With this approach, there is a time window
after the substitution and before the lock when the request can be
concurrently released by an interrupt handler and its memory reused, then
we may happen to lock and return a new, unrelated request.

Always get a reference to the current active fence first, before
replacing it with a new one.  Having it protected from premature release
and reuse, lock it and then replace with the new one but only if not
yet signalled via a potential concurrent interrupt nor replaced with
another one by a potential concurrent thread, otherwise retry, starting
from getting a reference to the new current one.  Adjust users to not
get a reference to the previous active fence themselves and always put the
reference got by __i915_active_fence_set() when no longer needed.

v3: Fix lockdep splat reports and other issues caused by incorrect use of
try_cmpxchg() (use (cmpxchg() != prev) instead)
v2: Protect request's memory by getting a reference to it in favor of
delegating its release to call_rcu() (Chris)

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/8211
Fixes: df9f85d8582e ("drm/i915: Serialise i915_active_fence_set() with itself")
Suggested-by: Chris Wilson 
Signed-off-by: Janusz Krzysztofik 
Cc:  # v5.6+
---
 drivers/gpu/drm/i915/i915_active.c  | 99 -
 drivers/gpu/drm/i915/i915_request.c | 11 
 2 files changed, 81 insertions(+), 29 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_active.c 
b/drivers/gpu/drm/i915/i915_active.c
index 8ef93889061a6..5ec293011d990 100644
--- a/drivers/gpu/drm/i915/i915_active.c
+++ b/drivers/gpu/drm/i915/i915_active.c
@@ -449,8 +449,11 @@ int i915_active_add_request(struct i915_active *ref, 
struct i915_request *rq)
}
} while (unlikely(is_barrier(active)));
 
-   if (!__i915_active_fence_set(active, fence))
+   fence = __i915_active_fence_set(active, fence);
+   if (!fence)
__i915_active_acquire(ref);
+   else
+   dma_fence_put(fence);
 
 out:
i915_active_release(ref);
@@ -469,13 +472,9 @@ __i915_active_set_fence(struct i915_active *ref,
return NULL;
}
 
-   rcu_read_lock();
prev = __i915_active_fence_set(active, fence);
-   if (prev)
-   prev = dma_fence_get_rcu(prev);
-   else
+   if (!prev)
__i915_active_acquire(ref);
-   rcu_read_unlock();
 
return prev;
 }
@@ -1019,10 +1018,11 @@ void i915_request_add_active_barriers(struct 
i915_request *rq)
  *
  * Records the new @fence as the last active fence along its timeline in
  * this active tracker, moving the tracking callbacks from the previous
- * fence onto this one. Returns the