On 15/10/2019 13:19, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2019-10-15 13:15:49)

On 14/10/2019 23:05, Chris Wilson wrote:
Normally, we rely on our hangcheck to prevent persistent batches from
hogging the GPU. However, if the user disables hangcheck, this mechanism
breaks down. Despite our insistence that this is unsafe, the users are
equally insistent that they want to use endless batches and will disable
the hangcheck mechanism. We are looking at perhaps replacing hangcheck
with a softer mechanism, that sends a pulse down the engine to check if
it is well. We can use the same preemptive pulse to flush an active
persistent context off the GPU upon context close, preventing resources
being lost and unkillable requests remaining on the GPU after process
termination. To avoid changing the ABI and accidentally breaking
existing userspace, we make the persistence of a context explicit and
enable it by default (matching current ABI). Userspace can opt out of
persistent mode (forcing requests to be cancelled when the context is
closed by process termination or explicitly) by a context parameter. To
facilitate existing use-cases of disabling hangcheck, if the modparam is
disabled (i915.enable_hangcheck=0), we disable persistence mode by
default.  (Note, one of the outcomes for supporting endless mode will be
the removal of hangchecking, at which point opting into persistent mode
will be mandatory, or maybe the default perhaps controlled by cgroups.)

v2: Check for hangchecking at context termination, so that we are not
left with undying contexts from a crafty user.
v3: Force context termination even if forced-preemption is disabled.

Testcase: igt/gem_ctx_persistence
Signed-off-by: Chris Wilson <ch...@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahti...@linux.intel.com>
Cc: Michał Winiarski <michal.winiar...@intel.com>
Cc: Jon Bloomfield <jon.bloomfi...@intel.com>
Reviewed-by: Jon Bloomfield <jon.bloomfi...@intel.com>
---
   drivers/gpu/drm/i915/gem/i915_gem_context.c   | 189 ++++++++++++++++++
   drivers/gpu/drm/i915/gem/i915_gem_context.h   |  15 ++
   .../gpu/drm/i915/gem/i915_gem_context_types.h |   1 +
   .../gpu/drm/i915/gem/selftests/mock_context.c |   2 +
   include/uapi/drm/i915_drm.h                   |  15 ++
   5 files changed, 222 insertions(+)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c 
b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index 5d8221c7ba83..49f37bba5693 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -70,6 +70,7 @@
   #include <drm/i915_drm.h>
#include "gt/intel_lrc_reg.h"
+#include "gt/intel_engine_heartbeat.h"
   #include "gt/intel_engine_user.h"
#include "i915_gem_context.h"
@@ -269,6 +270,135 @@ void i915_gem_context_release(struct kref *ref)
               schedule_work(&gc->free_work);
   }
+static inline struct i915_gem_engines *
+__context_engines_static(const struct i915_gem_context *ctx)
+{
+     return rcu_dereference_protected(ctx->engines, true);
+}
+
+static bool __reset_engine(struct intel_engine_cs *engine)
+{
+     struct intel_gt *gt = engine->gt;
+     bool success = false;
+
+     if (!intel_has_reset_engine(gt))
+             return false;
+
+     if (!test_and_set_bit(I915_RESET_ENGINE + engine->id,
+                           &gt->reset.flags)) {
+             success = intel_engine_reset(engine, NULL) == 0;
+             clear_and_wake_up_bit(I915_RESET_ENGINE + engine->id,
+                                   &gt->reset.flags);
+     }
+
+     return success;
+}
+
+static void __reset_context(struct i915_gem_context *ctx,
+                         struct intel_engine_cs *engine)
+{
+     intel_gt_handle_error(engine->gt, engine->mask, 0,
+                           "context closure in %s", ctx->name);
+}
+
+static bool __cancel_engine(struct intel_engine_cs *engine)
+{
+     /*
+      * Send a "high priority pulse" down the engine to cause the
+      * current request to be momentarily preempted. (If it fails to
+      * be preempted, it will be reset). As we have marked our context
+      * as banned, any incomplete request, including any running, will
+      * be skipped following the preemption.
+      *
+      * If there is no hangchecking (one of the reasons why we try to
+      * cancel the context) and no forced preemption, there may be no
+      * means by which we reset the GPU and evict the persistent hog.
+      * Ergo if we are unable to inject a preemptive pulse that can
+      * kill the banned context, we fallback to doing a local reset
+      * instead.
+      */
+     if (CONFIG_DRM_I915_PREEMPT_TIMEOUT && !intel_engine_pulse(engine))
+             return true;
+
+     /* If we are unable to send a pulse, try resetting this engine. */
+     return __reset_engine(engine);

Open from last round is how likely is forced preemption to be compiled
out, in which case the code will always fall back to reset immediately,
even if workload would otherwise preempt just fine. (Given the dangers
of reset hitting something unrelated as you explained.)

Could you always compile in forced preemption so it could be used on
context close? I am thinking, allow it to be disabled via sysfs on its
own, but on context close have it active for the context being closed so
that if the pulse does not work it can kick in and reset. Sounds nicer
than just resorting to engine reset, which as you described, can impact
someone innocent.

I was thinking that those who care about interruptions would not
tolerate even having a preempt timer running. I would say its masochism
and would not expect it to be disabled, but I also say the same thing
about RT.

In short, I made it optional at compiletime just because I could.

Okay it can be improved later if need be. Because nothing I am suggesting is not an ABI change.

Reviewed-by: Tvrtko Ursulin <tvrtko.ursu...@intel.com>

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Reply via email to