guc: do not capture error state on exiting context

John Harrison Thu, 29 Sep 2022 09:50:18 -0700

On 9/29/2022 01:22, Tvrtko Ursulin wrote:

On 28/09/2022 19:27, John Harrison wrote:
On 9/28/2022 00:19, Tvrtko Ursulin wrote:
On 27/09/2022 22:36, Ceraolo Spurio, Daniele wrote:
On 9/27/2022 12:45 AM, Tvrtko Ursulin wrote:
On 27/09/2022 07:49, Andrzej Hajda wrote:
On 27.09.2022 01:34, Ceraolo Spurio, Daniele wrote:
On 9/26/2022 3:44 PM, Andi Shyti wrote:
Hi Andrzej,
On Mon, Sep 26, 2022 at 11:54:09PM +0200, Andrzej Hajda wrote:
Capturing error state is time consuming (up to 350ms on DG2),so it shouldbe avoided if possible. Context reset triggered by contextremoval is a
good example.
With this patch multiple igt tests will not timeout and shouldrun faster.
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1551
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3952
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5891
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6268
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6281
Signed-off-by: Andrzej Hajda <[email protected]>
fine for me:

Reviewed-by: Andi Shyti <[email protected]>

Just to be on the safe side, can we also have the ack from any of
the GuC folks? Daniele, John?

Andi
---
drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.cb/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 22ba66e48a9b01..cb58029208afe1 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4425,7 +4425,8 @@ static voidguc_handle_context_reset(struct intel_guc *guc,
      trace_intel_context_reset(ce);
        if (likely(!intel_context_is_banned(ce))) {
-        capture_error_state(guc, ce);
+        if (!intel_context_is_exiting(ce))
+            capture_error_state(guc, ce);
I am not sure here - if we have a persistent context which causeda GPU hang I'd expect we'd still want error capture.
What causes the reset in the affected IGTs? Always preemptiontimeout?
guc_context_replay(ce);
You definitely don't want to replay requests of a context thatis going away.
My intention was to just avoid error capture, but that's evenbetter, only condition change:
-        if (likely(!intel_context_is_banned(ce))) {
+       if (likely(intel_context_is_schedulable(ce)))  {
Yes that helper was intended to be used for contexts which shouldnot be scheduled post exit or ban.
Daniele - you say there are some misses in the GuC backend. Shouldmost, or even all in intel_guc_submission.c be converted to useintel_context_is_schedulable? My idea indeed was that "ban" shouldbe a level up from the backends. Backend should only distinguishbetween "should I run this or not", and not the reason.
I think that all of them should be updated, but I'd like Matt B toconfirm as he's more familiar with the code than me.
Right, that sounds plausible to me as well.
One thing I forgot to mention - the only place where backend cancare between "schedulable" and "banned" is when it picks the preempttimeout for non-schedulable contexts. This is to only apply thestrict 1ms to banned (so bad or naught contexts), while the oneswhich are exiting cleanly get the full preempt timeout as otherwiseconfigured. This solves the ugly user experience quirk where GPUresets/errors were logged upon exit/Ctrl-C of a well behavingapplication (using non-persistent contexts). Hopefully GuC can matchthat behaviour so customers stay happy.
Regards,

Tvrtko
The whole revoke vs ban thing seems broken to me.
First of all, if the user hits Ctrl+C we need to kill the context offimmediately. That is a fundamental customer requirement. Render andcompute engines have a 7.5s pre-emption timeout. The user should nothave to wait 7.5s for a context to be removed from the system whenthey have explicitly killed it themselves. Even the regular timeoutof 640ms is borderline a long time to wait. And note that there is anongoing request/requirement to increase that to 1900ms.
Under what circumstances would a user expect anything sensible tohappen after a Ctrl+C in terms of things finishing their renderingand display nice pretty images? They killed the app. They want itdead. We should be getting it off the hardware as quickly aspossible. If you are really concerned about resets causing collateraldamage then maybe bump the termination timeout from 1ms up to 10ms,maybe at most 100ms. If an app is 'well behaved' then it shouldcleanly exit within 10ms. But if it is bad (which is almost certainlythe case if the user is manually and explicitly killing it) then itneeds to be killed because it is not going to gracefully exit.
Right.. I had it like that initially (lower timeout - I think 20ms orso, patch history on the mailing list would know for sure), but thensimplified it after review feedback to avoid adding another timeoutvalue.
So it's not at all about any expectation that something shouldactually finish to any sort of completion/success. It is primarilyabout not logging an error message when there is no error. Thing tokeep in mind is that error messages are a big deal in some cultures.In addition to that, avoiding needless engine resets is a good thingas well.

But not calling the error capture code on a banned context is a trivialchange. I don't see why it is so complicated to just suppress that partof the clean up.

Previously the execlists backend was over eager and only allowed for1ms for such contexts to exit. If the context was banned sure - thatmeans it was a bad context which was causing many hangs already. Butif the context was a clean one I argue there is no point in doing anengine reset.
So if you want, I think it is okay to re-introduce a secondary timeout.
Or if you have an idea on how to avoid the error messages / GPU resetswhen "friendly" contexts exit in some other way, that is alsosomething to discuss.

Well, yes. Just don't call the error capture code for a banned context.That's the only bit that prints out any GPU hang error messages. If youdon't call that, the user won't know that anything has happened.

Secondly, the whole persistence thing is a total mess, completelybroken and intended to be massively simplified. See the internal taskfor it. In short, the plan is that all contexts will be immediatelykilled when the last DRM file handle is closed. Persistence is onlyvalid between the time the per context file handle is closed and thetime the master DRM handle is closed. Whereas, non-persistentcontexts get killed as soon as the per context handle is closed.There is absolutely no connection to heartbeats or other irrelevantoperations.
The change we are discussing is not about persistence, but for thepersistence itself - I am not sure it is completely broken and if, orwhen, the internal task will result with anything being attempted. Inthe meantime we had unhappy customers for more than a year. So do wetell them "please wait for a few years more until some internal taskwith no clear timeline or anyone assigned maybe gets looked at"?

Persistence is totally broken for any post-execlist platform. Itfundamentally relies upon code deep within the execlst backend thatcannot be done with any other backend - GuC, DRM, anything that comes inthe future, ... Pretty much any IGT with 'persistence' (or'no-hangcheck') in the name is failing for GuC because of this.

Daniel Vetter's view is that any connection to a submission backend,heartbeat, or indeed anything other than file handle closure ishorrendous over complication and must be removed.

The task is theoretically at the top of my todo list. But I keep gettinglarge high priority interrupts and never manage to work on it :(. If youare feeling bored, then please pick it up. You would massively improveour DG2 pass rates...

So in my view, the best option is to revert the ban vs revoke patch.It is creating bugs. It is making persistence more complex notsimpler. It harms the user experience.
I am not aware of the bugs, even less so that it is harming userexperience!?

This whole thread is because there are bugs. E.g. the fact that the GuCbackend did not get properly updated to cope with the new distinction ofban vs revoke. The fact that compute contexts now take 7.5s to kill viaCtrl+C. And if the user has disabled the pre-emption timeout completelythen Ctrl+C just won't work at all.

Bugs are limited to the GuC backend or in general? My CI runs wereclean so maybe test cases are lacking. Is it just a case ofs/intel_context_is_banned/intel_context_is_schedulable/ in there tofix it?
Again, the change was not about persistence. It is the opposite -allowing non-persistent contexts to exit cleanly.

If the code being added says 'if(persistent) X; else Y;' then it isabout persistence and it is making the whole persistence problem worse.

If the original problem was simply that error captures were beingdone on Ctrl+C then the fix is simple. Don't capture for a bannedcontext. There is no need for all the rest of the revoke patch.
Error capture was not part of the original story so it may be acompletely orthogonal topic that we are discussing it in this thread.

Then I'm lost. What was the purpose of the original change? According tothe commit message, the whole point of introducing revoke was tosuppress the error capture on a Ctrl+C wasn't it? - "logging engineresets during normal operation not desirable".


John


Regards,

Tvrtko

Re: [Intel-gfx] [PATCH] drm/i915/guc: do not capture error state on exiting context

Reply via email to