a for compute workloads

John Harrison Fri, 25 Feb 2022 10:01:35 -0800

On 2/25/2022 09:39, Tvrtko Ursulin wrote:

On 25/02/2022 17:11, John Harrison wrote:
On 2/25/2022 08:36, Tvrtko Ursulin wrote:
On 24/02/2022 20:02, John Harrison wrote:
On 2/23/2022 04:00, Tvrtko Ursulin wrote:
On 23/02/2022 02:22, John Harrison wrote:
On 2/22/2022 01:53, Tvrtko Ursulin wrote:
On 18/02/2022 21:33, john.c.harri...@intel.com wrote:
From: John Harrison <john.c.harri...@intel.com>
Compute workloads are inherently not pre-emptible on currenthardware.Thus the pre-emption timeout was disabled as a workaround topreventunwanted resets. Instead, the hang detection was left to theheartbeatand its (longer) timeout. This is undesirable with GuCsubmission asthe heartbeat is a full GT reset rather than a per engine resetand sois much more destructive. Instead, just bump the pre-emptiontimeout
Can we have a feature request to allow asking GuC for an enginereset?
For what purpose?
To allow "stopped heartbeat" to reset the engine, however..
GuC manages the scheduling of contexts across engines. Withvirtual engines, the KMD has no knowledge of which engine acontext might be executing on. Even without virtual engines, theKMD still has no knowledge of which context is currentlyexecuting on any given engine at any given time.
There is a reason why hang detection should be left to the entitythat is doing the scheduling. Any other entity is second guessingat best.
The reason for keeping the heartbeat around even when GuCsubmission is enabled is for the case where the KMD/GuC have gotout of sync with either other somehow or GuC itself has justcrashed. I.e. when no submission at all is working and we need toreset the GuC itself and start over.
.. I wasn't really up to speed to know/remember heartbeats arenerfed already in GuC mode.
Not sure what you mean by that claim. Engine resets are handled byGuC because GuC handles the scheduling. You can't do the former ifyou aren't doing the latter. However, the heartbeat is stillpresent and is still the watchdog by which engine resets aretriggered. As per the rest of the submission process, the hangdetection and recovery is split between i915 and GuC.
I meant that "stopped heartbeat on engine XXX" can only do a fullGPU reset on GuC.
I mean that there is no 'stopped heartbeat on engine XXX' when i915is not handling the recovery part of the process.
Hmmmm?

static void
reset_engine(struct intel_engine_cs *engine, struct i915_request *rq)
{
    if (IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM))
        show_heartbeat(rq, engine);

    if (intel_engine_uses_guc(engine))
        /*
         * GuC itself is toast or GuC's hang detection
         * is disabled. Either way, need to find the
         * hang culprit manually.
         */
        intel_guc_find_hung_context(engine);

    intel_gt_handle_error(engine->gt, engine->mask,
                  I915_ERROR_CAPTURE,
                  "stopped heartbeat on %s",
                  engine->name);
}
How there is no "stopped hearbeat" in guc mode? From this code itcertainly looks there is.

Only when the GuC is toast and it is no longer an engine reset but afull GT reset that is required. So technically, it is not a 'stoppedheartbeat on engine XXX' it is 'stopped heartbeat on GT#'.

You say below heartbeats are going in GuC mode. Now I totally don'tunderstand how they are going but there is allegedly no "stoppedhearbeat".

Because if GuC is handling the detection and recovery then i915 will notreach that point. GuC will do the engine reset and start scheduling thenext context before the heartbeat period expires. So the notificationwill be a G2H about a specific context being reset rather than the i915notification about a stopped heartbeat.

    intel_gt_handle_error(engine->gt, engine->mask,
                  I915_ERROR_CAPTURE,
                  "stopped heartbeat on %s",
                  engine->name);

intel_gt_handle_error:

    /*
     * Try engine reset when available. We fall back to full reset if
     * single reset fails.
     */
    if (!intel_uc_uses_guc_submission(&gt->uc) &&
        intel_has_reset_engine(gt) && !intel_gt_is_wedged(gt)) {
        local_bh_disable();
        for_each_engine_masked(engine, gt, engine_mask, tmp) {
You said "However, the heartbeat is still present and is still thewatchdog by which engine resets are triggered", now I don't knowwhat you meant by this. It actually triggers a single engine resetin GuC mode? Where in code does that happen if this block aboveshows it not taking the engine reset path?
i915 sends down the per engine pulse.
GuC schedules the pulse
GuC attempts to pre-empt the currently active context
GuC detects the pre-emption timeout
GuC resets the engine
The fundamental process is exactly the same as in execlist mode. It'sjust that the above blocks of code (calls to intel_gt_handle_errorand such) are now inside the GuC not i915.
Without the heartbeat going ping, there would be no context switchingand thus no pre-emption, no pre-emption timeout and so no hang andreset recovery. And GuC cannot sent pulses by itself - it has noability to generate context workloads. So we need i915 to send thepings and to gradually raise their priority. But the back half of theheartbeat code is now inside the GuC. It will simply never reach thei915 side timeout if GuC is doing the recovery (unless theheartbeat's final period is too short). We should only reach the i915side timeout if GuC itself is toast. At which point we need the fullGT reset to recover the GuC.
If workload is not preempting and reset does not work, like engine istruly stuck, does the current code hit "stopped heartbeat" or not inGuC mode?

Hang on, where did 'reset does not work' come into this?

If GuC is alive and the hardware is not broken then no, it won't. That'sthe whole point. GuC does the detection and recovery. The KMD will neverreach 'stopped heartbeat'.

If the hardware is broken and the reset does not work then GuC will senda 'failed reset' notification to the KMD. The KMD treats that as a majorerror and immediately does a full GT reset. So there is still no'stopped heartbeat'.

If GuC has died (or a KMD bug has caused sufficient confusion to make itthink the GuC has died) then yes, you will reach that code. But in thatcase it is not an engine reset that is required, it is a full GT resetincluding a reset of the GuC.


Regards,

Tvrtko

Re: [Intel-gfx] [PATCH 0/3] Improve anti-pre-emption w/a for compute workloads

Reply via email to