> -----Original Message-----
> From: Chris Wilson <[email protected]>
> Sent: Wednesday, August 7, 2019 6:23 AM
> To: [email protected]
> Cc: Joonas Lahtinen <[email protected]>; Winiarski, Michal
> <[email protected]>; Bloomfield, Jon <[email protected]>
> Subject: Re: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
> 
> Quoting Chris Wilson (2019-08-06 14:47:25)
> > @@ -433,6 +482,8 @@ __create_context(struct drm_i915_private *i915)
> >
> >         i915_gem_context_set_bannable(ctx);
> >         i915_gem_context_set_recoverable(ctx);
> > +       if (i915_modparams.enable_hangcheck)
> > +               i915_gem_context_set_persistence(ctx);
> 
> I am not fond of this, but from a pragmatic point of view, this does
> prevent the issue Jon raised: HW resources being pinned indefinitely
> past process termination.
> 
> I don't like it because we cannot perform the operation cleanly
> everywhere, it requires preemption first and foremost (with a cooperating
> submission backend) and per-engine reset. The alternative is to try and
> do a full GPU reset if the context is still active. For the sake of the
> issue raised, I think that (full reset on older HW) is required.
> 
> That we are baking in a change of ABI due to an unsafe modparam is meh.
> There are a few more corner cases to deal with before endless just
> works. But since it is being used in the wild, I'm not sure we can wait
> for userspace to opt-in, or wait for cgroups. However, since users are
> being encouraged to disable hangcheck, should we extend the concept of
> persistence to also mean "survives hangcheck"? -- though it should be a
> separate parameter, and I'm not sure how exactly to protect it from the
> heartbeat reset without giving gross privileges to the context. (CPU
> isolation is nicer from the pov where we can just give up and not even
> worry about the engine. If userspace can request isolation, it has the
> privilege to screw up.)
> -Chris

Ok, so your concern is supporting non-persistence on older non-preempting, 
engine-reset capable, hardware. Why is that strictly required? Can't we simply 
make it dependent on the features needed to do it well, and if your hardware 
cannot, then the advice is not to disable hangcheck? I'm doubtful that anyone 
would attempt a HPC type workload on n IVB.

I'm not sure I understand your "survives hangcheck" idea. You mean instead of 
simply disabling hangcheck, just opt in to persistence and have that also 
prevent hangcheck? Isn't that the wrong way around, since persistence is what 
is happening today?

_______________________________________________
Intel-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Reply via email to