Re: [Intel-gfx] drm/i915: Watchdog timeout: IRQ handler for gen8+

Antonio Argenziano Wed, 16 Jan 2019 09:42:07 -0800


On 16/01/19 08:15, Tvrtko Ursulin wrote:

On 11/01/2019 21:28, John Harrison wrote:
On 1/11/2019 09:31, Antonio Argenziano wrote:
On 11/01/19 00:22, Tvrtko Ursulin wrote:
On 11/01/2019 00:47, Antonio Argenziano wrote:
On 07/01/19 08:58, Tvrtko Ursulin wrote:
On 07/01/2019 13:57, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2019-01-07 13:43:29)
On 07/01/2019 11:58, Tvrtko Ursulin wrote:

[snip]
Note about future interaction with preemption: Preemptioncould happen
in a command sequence prior to watchdog counter getting disabled,
resulting in watchdog being triggered following preemption(e.g. whenwatchdog had been enabled in the low priority batch). Thedriver will
need to explicitly disable the watchdog counter as part of the
preemption sequence.
Does the series take care of preemption?
I did not find that it does.
Oh. I hoped that the watchdog was saved as part of the context...Thendespite preemption, the timeout would resume from where we leftoff as
soon as it was back on the gpu.
If the timeout remaining was context saved it would be muchsimpler (at
least on first glance), please say it is.
I made my comments going only by the text from the commit messageand the absence of any preemption special handling.
Having read the spec, the situation seems like this:
* Watchdog control and threshold register are context saved andrestored.
* On a context switch watchdog counter is reset to zero andautomatically disabled until enabled by a context restore orexplicitly.
So it sounds the commit message could be wrong that specialhandling is needed from this direction. But read till the end onthe restriction listed.
* Watchdog counter is reset to zero and is not accumulatedacross multiple submission of the same context (due preemption).
I read this as - after preemption contexts gets a new full timeoutallocation. Or in other words, if a context is preempted N times,it's cumulative watchdog timeout will be N * set value.
This could be theoretically exploitable to bypass the timeout. Ifa client sets up two contexts with prio -1 and -2, and keepssubmitting periodical no-op batches against prio -1 context, whileprio -2 is it's own hog, then prio -2 context defeats the watchdogtimer. I think.. would appreciate is someone challenged thisconclusion.
I think you are right that is a possibility but, is that a problem?The client can just not set the threshold to bypass the timeout.Also because you need the hanging batch to be simply preemptible,you cannot disrupt any work from another client that is higherpriority. This is
But I think higher priority client can have the same effect on thelower priority purely by accident, no?
As a real world example, user kicks off an background transcodingjob, which happens to use prio -2, and uses the watchdog timer.
At the same time user watches a video from a player of normalpriority. This causes periodic, say 24Hz, preemption events, dueframe decoding activity on the same engine as the transcoding client.
Does this defeat the watchdog timer for the former is the question?Then the questions of can we do something about it and whether itreally isn't a problem?
I guess it depends if you consider that timeout as the maximumlifespan a workload can have or max contiguous active time.
I believe the intended purpose of the watchdog is to prevent brokenbitstreams hanging the transcoder/player. That is, it is a form oferror detection used by the media driver to handle bad user input. Soif there is a way for the watchdog to be extended indefinitely undernormal situations, that would be a problem. It means the transcoderwill not detect the broken input data in a timely manner andeffectively hang rather than skip over to the next packet. And notethat broken input data can be caused by something as innocent as adropped packet due to high network contention. No need for anymalicious activity at all.
My understanding of the intended purpose is the same. And it would be avery useful feature.

I'm not familiar enough with the application but, in the scenario above,what if the batch that is being preempted is not stuck but just niceenough to be preempted enough times so that it wouldn't complete in thegiven wall clock time but would be fast enough by itself.

Chris mentioned the other day that until hardware is fixed to contextsave/restore the watchdog counter this could simply be implemented usingtimers. And I have to say I agree. Shouldn't be too hard to prototype itusing hrtimers - start on context in, stop on context out and kickforward on user interrupts. More or less.

Would this implement the feature on the driver side just like it wouldfor the HW? I mean have the same IOCTL and silently discard workloadthat hit the timeout. Also, would it discard batches while they are inthe queue (not active)?


Antonio

Then if the cost of these hrtimer manipulations wouldn't show inprofiles significantly we would have a solution. At least in execlistsmode. :) But in parallel we could file a feature request to fix thehardware implementation and then could just switch the timer "backend"from hrtimers to GPU.
Regards,

Tvrtko

_______________________________________________
Intel-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] drm/i915: Watchdog timeout: IRQ handler for gen8+

Reply via email to