On 18-03-31 12:00:16, Chris Wilson wrote:
Quoting Kenneth Graunke (2018-03-30 19:20:57)
On Friday, March 30, 2018 7:40:13 AM PDT Chris Wilson wrote:
> For i915, we are proposing to use a quality-of-service parameter in
> addition to that of just a priority that usurps everyone. Due to our HW,
> preemption may not be immediate and will be forced to wait until an
> uncooperative process hits an arbitration point. To prevent that unduly
> impacting the privileged RealTime context, we back up the preemption
> request with a timeout to reset the GPU and forcibly evict the GPU hog
> in order to execute the new context.
I am strongly against exposing this in general. Performing a GPU reset
in the middle of a batch can completely screw up whatever application
was running. If the application is using robustness extensions, we may
be forced to return GL_DEVICE_LOST, causing the application to have to
recreate their entire GL context and start over. If not, we may try to
let them limp on(*) - and hope they didn't get too badly damaged by some
of their commands not executing, or executing twice (if the kernel tries
to resubmit it). But it may very well cause the app to misrender, or
Yes, I think the revulsion has been universal. However, as a
quality-of-service guarantee, I can understand the appeal. The
difference is that instead of allowing a DoS for 6s or so as we
currently allow, we allow that to be specified by the context. As it
does allow one context to impact another, I want it locked down to
privileged processes. I have been using CAP_SYS_ADMIN as the potential
to do harm is even greater than exploiting the weak scheduler by
I'm not terribly worried about this on our hardware for 3d. Today, there is
exactly one case I think where this would happen, if you have a sufficiently
long running shader on a sufficiently large triangle.
The concern I have is about compute where I think we don't do preemption nearly
This seems like a crazy plan to me. Scheduling has never been allowed
to just kill random processes.
That's not strictly true, as processes have their limits which if they
exceed they will be killed. On the CPU preemption is much better, the
issue of unyielding processes is pretty much limited to the kernel, where
we can run the NMI watchdog to kill broken code.
If you ever hit that case, then your
customers will see random application crashes, glitches, GPU hangs,
and be pretty unhappy with the result. And not because something was
broken, but because somebody was impatient and an app was a bit slow.
Yes, that is their decision. Kill random apps so that their
uber-critical interface updates the clock.
If you have work that is so mission critical, maybe you shouldn't run it
on the same machine as one that runs applications which you care so
little about that you're willing to watch them crash and burn. Don't
run the entertainment system on the flight computer, so to speak.
You are not the first to say that ;)
mesa-dev mailing list