Re: [Intel-gfx] [PATCH 0/3] Improve anti-pre-emption w/a for compute workloads

2022-03-02 Thread John Harrison

On 3/2/2022 03:21, Tvrtko Ursulin wrote:

On 28/02/2022 19:17, John Harrison wrote:

On 2/28/2022 07:32, Tvrtko Ursulin wrote:

On 25/02/2022 19:03, John Harrison wrote:

On 2/25/2022 10:29, Tvrtko Ursulin wrote:

On 25/02/2022 18:01, John Harrison wrote:

On 2/25/2022 09:39, Tvrtko Ursulin wrote:

On 25/02/2022 17:11, John Harrison wrote:

On 2/25/2022 08:36, Tvrtko Ursulin wrote:

On 24/02/2022 20:02, John Harrison wrote:

On 2/23/2022 04:00, Tvrtko Ursulin wrote:

On 23/02/2022 02:22, John Harrison wrote:

On 2/22/2022 01:53, Tvrtko Ursulin wrote:

On 18/02/2022 21:33, john.c.harri...@intel.com wrote:

From: John Harrison 

Compute workloads are inherently not pre-emptible on 
current hardware.
Thus the pre-emption timeout was disabled as a workaround 
to prevent
unwanted resets. Instead, the hang detection was left to 
the heartbeat
and its (longer) timeout. This is undesirable with GuC 
submission as
the heartbeat is a full GT reset rather than a per engine 
reset and so
is much more destructive. Instead, just bump the 
pre-emption timeout


Can we have a feature request to allow asking GuC for an 
engine reset?

For what purpose?


To allow "stopped heartbeat" to reset the engine, however..

GuC manages the scheduling of contexts across engines. With 
virtual engines, the KMD has no knowledge of which engine a 
context might be executing on. Even without virtual 
engines, the KMD still has no knowledge of which context is 
currently executing on any given engine at any given time.


There is a reason why hang detection should be left to the 
entity that is doing the scheduling. Any other entity is 
second guessing at best.


The reason for keeping the heartbeat around even when GuC 
submission is enabled is for the case where the KMD/GuC 
have got out of sync with either other somehow or GuC 
itself has just crashed. I.e. when no submission at all is 
working and we need to reset the GuC itself and start over.


.. I wasn't really up to speed to know/remember heartbeats 
are nerfed already in GuC mode.
Not sure what you mean by that claim. Engine resets are 
handled by GuC because GuC handles the scheduling. You can't 
do the former if you aren't doing the latter. However, the 
heartbeat is still present and is still the watchdog by which 
engine resets are triggered. As per the rest of the 
submission process, the hang detection and recovery is split 
between i915 and GuC.


I meant that "stopped heartbeat on engine XXX" can only do a 
full GPU reset on GuC.
I mean that there is no 'stopped heartbeat on engine XXX' when 
i915 is not handling the recovery part of the process.


H?

static void
reset_engine(struct intel_engine_cs *engine, struct i915_request 
*rq)

{
if (IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM))
    show_heartbeat(rq, engine);

if (intel_engine_uses_guc(engine))
    /*
 * GuC itself is toast or GuC's hang detection
 * is disabled. Either way, need to find the
 * hang culprit manually.
 */
    intel_guc_find_hung_context(engine);

intel_gt_handle_error(engine->gt, engine->mask,
  I915_ERROR_CAPTURE,
  "stopped heartbeat on %s",
  engine->name);
}

How there is no "stopped hearbeat" in guc mode? From this code 
it certainly looks there is.
Only when the GuC is toast and it is no longer an engine reset 
but a full GT reset that is required. So technically, it is not a 
'stopped heartbeat on engine XXX' it is 'stopped heartbeat on GT#'.




You say below heartbeats are going in GuC mode. Now I totally 
don't understand how they are going but there is allegedly no 
"stopped hearbeat".
Because if GuC is handling the detection and recovery then i915 
will not reach that point. GuC will do the engine reset and start 
scheduling the next context before the heartbeat period expires. 
So the notification will be a G2H about a specific context being 
reset rather than the i915 notification about a stopped heartbeat.






intel_gt_handle_error(engine->gt, engine->mask,
  I915_ERROR_CAPTURE,
  "stopped heartbeat on %s",
  engine->name);

intel_gt_handle_error:

/*
 * Try engine reset when available. We fall back to full 
reset if

 * single reset fails.
 */
if (!intel_uc_uses_guc_submission(>uc) &&
    intel_has_reset_engine(gt) && !intel_gt_is_wedged(gt)) {
    local_bh_disable();
    for_each_engine_masked(engine, gt, engine_mask, tmp) {

You said "However, the heartbeat is still present and is still 
the watchdog by which engine resets are triggered", now I 
don't know what you meant by this. It actually triggers a 
single engine reset in GuC mode? Where in code does that 
happen if this block above shows it not taking the engine 
reset path?

i915 sends down the per engine pulse.
GuC schedules the pulse
GuC attempts to pre-empt the currently active context
GuC detects the pre-emption timeout
GuC resets the 

Re: [Intel-gfx] [PATCH 0/3] Improve anti-pre-emption w/a for compute workloads

2022-03-02 Thread Tvrtko Ursulin



On 28/02/2022 19:17, John Harrison wrote:

On 2/28/2022 07:32, Tvrtko Ursulin wrote:

On 25/02/2022 19:03, John Harrison wrote:

On 2/25/2022 10:29, Tvrtko Ursulin wrote:

On 25/02/2022 18:01, John Harrison wrote:

On 2/25/2022 09:39, Tvrtko Ursulin wrote:

On 25/02/2022 17:11, John Harrison wrote:

On 2/25/2022 08:36, Tvrtko Ursulin wrote:

On 24/02/2022 20:02, John Harrison wrote:

On 2/23/2022 04:00, Tvrtko Ursulin wrote:

On 23/02/2022 02:22, John Harrison wrote:

On 2/22/2022 01:53, Tvrtko Ursulin wrote:

On 18/02/2022 21:33, john.c.harri...@intel.com wrote:

From: John Harrison 

Compute workloads are inherently not pre-emptible on 
current hardware.
Thus the pre-emption timeout was disabled as a workaround 
to prevent
unwanted resets. Instead, the hang detection was left to 
the heartbeat
and its (longer) timeout. This is undesirable with GuC 
submission as
the heartbeat is a full GT reset rather than a per engine 
reset and so
is much more destructive. Instead, just bump the 
pre-emption timeout


Can we have a feature request to allow asking GuC for an 
engine reset?

For what purpose?


To allow "stopped heartbeat" to reset the engine, however..

GuC manages the scheduling of contexts across engines. With 
virtual engines, the KMD has no knowledge of which engine a 
context might be executing on. Even without virtual engines, 
the KMD still has no knowledge of which context is currently 
executing on any given engine at any given time.


There is a reason why hang detection should be left to the 
entity that is doing the scheduling. Any other entity is 
second guessing at best.


The reason for keeping the heartbeat around even when GuC 
submission is enabled is for the case where the KMD/GuC have 
got out of sync with either other somehow or GuC itself has 
just crashed. I.e. when no submission at all is working and 
we need to reset the GuC itself and start over.


.. I wasn't really up to speed to know/remember heartbeats are 
nerfed already in GuC mode.
Not sure what you mean by that claim. Engine resets are handled 
by GuC because GuC handles the scheduling. You can't do the 
former if you aren't doing the latter. However, the heartbeat 
is still present and is still the watchdog by which engine 
resets are triggered. As per the rest of the submission 
process, the hang detection and recovery is split between i915 
and GuC.


I meant that "stopped heartbeat on engine XXX" can only do a 
full GPU reset on GuC.
I mean that there is no 'stopped heartbeat on engine XXX' when 
i915 is not handling the recovery part of the process.


H?

static void
reset_engine(struct intel_engine_cs *engine, struct i915_request *rq)
{
if (IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM))
    show_heartbeat(rq, engine);

if (intel_engine_uses_guc(engine))
    /*
 * GuC itself is toast or GuC's hang detection
 * is disabled. Either way, need to find the
 * hang culprit manually.
 */
    intel_guc_find_hung_context(engine);

intel_gt_handle_error(engine->gt, engine->mask,
  I915_ERROR_CAPTURE,
  "stopped heartbeat on %s",
  engine->name);
}

How there is no "stopped hearbeat" in guc mode? From this code it 
certainly looks there is.
Only when the GuC is toast and it is no longer an engine reset but 
a full GT reset that is required. So technically, it is not a 
'stopped heartbeat on engine XXX' it is 'stopped heartbeat on GT#'.




You say below heartbeats are going in GuC mode. Now I totally 
don't understand how they are going but there is allegedly no 
"stopped hearbeat".
Because if GuC is handling the detection and recovery then i915 
will not reach that point. GuC will do the engine reset and start 
scheduling the next context before the heartbeat period expires. So 
the notification will be a G2H about a specific context being reset 
rather than the i915 notification about a stopped heartbeat.






intel_gt_handle_error(engine->gt, engine->mask,
  I915_ERROR_CAPTURE,
  "stopped heartbeat on %s",
  engine->name);

intel_gt_handle_error:

/*
 * Try engine reset when available. We fall back to full 
reset if

 * single reset fails.
 */
if (!intel_uc_uses_guc_submission(>uc) &&
    intel_has_reset_engine(gt) && !intel_gt_is_wedged(gt)) {
    local_bh_disable();
    for_each_engine_masked(engine, gt, engine_mask, tmp) {

You said "However, the heartbeat is still present and is still 
the watchdog by which engine resets are triggered", now I don't 
know what you meant by this. It actually triggers a single 
engine reset in GuC mode? Where in code does that happen if this 
block above shows it not taking the engine reset path?

i915 sends down the per engine pulse.
GuC schedules the pulse
GuC attempts to pre-empt the currently active context
GuC detects the pre-emption timeout
GuC resets the engine

The fundamental process is exactly 

Re: [Intel-gfx] [PATCH 0/3] Improve anti-pre-emption w/a for compute workloads

2022-02-28 Thread John Harrison

On 2/28/2022 07:32, Tvrtko Ursulin wrote:

On 25/02/2022 19:03, John Harrison wrote:

On 2/25/2022 10:29, Tvrtko Ursulin wrote:

On 25/02/2022 18:01, John Harrison wrote:

On 2/25/2022 09:39, Tvrtko Ursulin wrote:

On 25/02/2022 17:11, John Harrison wrote:

On 2/25/2022 08:36, Tvrtko Ursulin wrote:

On 24/02/2022 20:02, John Harrison wrote:

On 2/23/2022 04:00, Tvrtko Ursulin wrote:

On 23/02/2022 02:22, John Harrison wrote:

On 2/22/2022 01:53, Tvrtko Ursulin wrote:

On 18/02/2022 21:33, john.c.harri...@intel.com wrote:

From: John Harrison 

Compute workloads are inherently not pre-emptible on 
current hardware.
Thus the pre-emption timeout was disabled as a workaround 
to prevent
unwanted resets. Instead, the hang detection was left to 
the heartbeat
and its (longer) timeout. This is undesirable with GuC 
submission as
the heartbeat is a full GT reset rather than a per engine 
reset and so
is much more destructive. Instead, just bump the 
pre-emption timeout


Can we have a feature request to allow asking GuC for an 
engine reset?

For what purpose?


To allow "stopped heartbeat" to reset the engine, however..

GuC manages the scheduling of contexts across engines. With 
virtual engines, the KMD has no knowledge of which engine a 
context might be executing on. Even without virtual engines, 
the KMD still has no knowledge of which context is currently 
executing on any given engine at any given time.


There is a reason why hang detection should be left to the 
entity that is doing the scheduling. Any other entity is 
second guessing at best.


The reason for keeping the heartbeat around even when GuC 
submission is enabled is for the case where the KMD/GuC have 
got out of sync with either other somehow or GuC itself has 
just crashed. I.e. when no submission at all is working and 
we need to reset the GuC itself and start over.


.. I wasn't really up to speed to know/remember heartbeats are 
nerfed already in GuC mode.
Not sure what you mean by that claim. Engine resets are handled 
by GuC because GuC handles the scheduling. You can't do the 
former if you aren't doing the latter. However, the heartbeat 
is still present and is still the watchdog by which engine 
resets are triggered. As per the rest of the submission 
process, the hang detection and recovery is split between i915 
and GuC.


I meant that "stopped heartbeat on engine XXX" can only do a 
full GPU reset on GuC.
I mean that there is no 'stopped heartbeat on engine XXX' when 
i915 is not handling the recovery part of the process.


H?

static void
reset_engine(struct intel_engine_cs *engine, struct i915_request *rq)
{
if (IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM))
    show_heartbeat(rq, engine);

if (intel_engine_uses_guc(engine))
    /*
 * GuC itself is toast or GuC's hang detection
 * is disabled. Either way, need to find the
 * hang culprit manually.
 */
    intel_guc_find_hung_context(engine);

intel_gt_handle_error(engine->gt, engine->mask,
  I915_ERROR_CAPTURE,
  "stopped heartbeat on %s",
  engine->name);
}

How there is no "stopped hearbeat" in guc mode? From this code it 
certainly looks there is.
Only when the GuC is toast and it is no longer an engine reset but 
a full GT reset that is required. So technically, it is not a 
'stopped heartbeat on engine XXX' it is 'stopped heartbeat on GT#'.




You say below heartbeats are going in GuC mode. Now I totally 
don't understand how they are going but there is allegedly no 
"stopped hearbeat".
Because if GuC is handling the detection and recovery then i915 
will not reach that point. GuC will do the engine reset and start 
scheduling the next context before the heartbeat period expires. So 
the notification will be a G2H about a specific context being reset 
rather than the i915 notification about a stopped heartbeat.






intel_gt_handle_error(engine->gt, engine->mask,
  I915_ERROR_CAPTURE,
  "stopped heartbeat on %s",
  engine->name);

intel_gt_handle_error:

/*
 * Try engine reset when available. We fall back to full 
reset if

 * single reset fails.
 */
if (!intel_uc_uses_guc_submission(>uc) &&
    intel_has_reset_engine(gt) && !intel_gt_is_wedged(gt)) {
    local_bh_disable();
    for_each_engine_masked(engine, gt, engine_mask, tmp) {

You said "However, the heartbeat is still present and is still 
the watchdog by which engine resets are triggered", now I don't 
know what you meant by this. It actually triggers a single 
engine reset in GuC mode? Where in code does that happen if this 
block above shows it not taking the engine reset path?

i915 sends down the per engine pulse.
GuC schedules the pulse
GuC attempts to pre-empt the currently active context
GuC detects the pre-emption timeout
GuC resets the engine

The fundamental process is exactly the same as in execlist mode. 
It's just 

Re: [Intel-gfx] [PATCH 0/3] Improve anti-pre-emption w/a for compute workloads

2022-02-28 Thread Tvrtko Ursulin



On 25/02/2022 19:03, John Harrison wrote:

On 2/25/2022 10:29, Tvrtko Ursulin wrote:

On 25/02/2022 18:01, John Harrison wrote:

On 2/25/2022 09:39, Tvrtko Ursulin wrote:

On 25/02/2022 17:11, John Harrison wrote:

On 2/25/2022 08:36, Tvrtko Ursulin wrote:

On 24/02/2022 20:02, John Harrison wrote:

On 2/23/2022 04:00, Tvrtko Ursulin wrote:

On 23/02/2022 02:22, John Harrison wrote:

On 2/22/2022 01:53, Tvrtko Ursulin wrote:

On 18/02/2022 21:33, john.c.harri...@intel.com wrote:

From: John Harrison 

Compute workloads are inherently not pre-emptible on current 
hardware.
Thus the pre-emption timeout was disabled as a workaround to 
prevent
unwanted resets. Instead, the hang detection was left to the 
heartbeat
and its (longer) timeout. This is undesirable with GuC 
submission as
the heartbeat is a full GT reset rather than a per engine 
reset and so
is much more destructive. Instead, just bump the pre-emption 
timeout


Can we have a feature request to allow asking GuC for an 
engine reset?

For what purpose?


To allow "stopped heartbeat" to reset the engine, however..

GuC manages the scheduling of contexts across engines. With 
virtual engines, the KMD has no knowledge of which engine a 
context might be executing on. Even without virtual engines, 
the KMD still has no knowledge of which context is currently 
executing on any given engine at any given time.


There is a reason why hang detection should be left to the 
entity that is doing the scheduling. Any other entity is second 
guessing at best.


The reason for keeping the heartbeat around even when GuC 
submission is enabled is for the case where the KMD/GuC have 
got out of sync with either other somehow or GuC itself has 
just crashed. I.e. when no submission at all is working and we 
need to reset the GuC itself and start over.


.. I wasn't really up to speed to know/remember heartbeats are 
nerfed already in GuC mode.
Not sure what you mean by that claim. Engine resets are handled 
by GuC because GuC handles the scheduling. You can't do the 
former if you aren't doing the latter. However, the heartbeat is 
still present and is still the watchdog by which engine resets 
are triggered. As per the rest of the submission process, the 
hang detection and recovery is split between i915 and GuC.


I meant that "stopped heartbeat on engine XXX" can only do a full 
GPU reset on GuC.
I mean that there is no 'stopped heartbeat on engine XXX' when i915 
is not handling the recovery part of the process.


H?

static void
reset_engine(struct intel_engine_cs *engine, struct i915_request *rq)
{
if (IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM))
    show_heartbeat(rq, engine);

if (intel_engine_uses_guc(engine))
    /*
 * GuC itself is toast or GuC's hang detection
 * is disabled. Either way, need to find the
 * hang culprit manually.
 */
    intel_guc_find_hung_context(engine);

intel_gt_handle_error(engine->gt, engine->mask,
  I915_ERROR_CAPTURE,
  "stopped heartbeat on %s",
  engine->name);
}

How there is no "stopped hearbeat" in guc mode? From this code it 
certainly looks there is.
Only when the GuC is toast and it is no longer an engine reset but a 
full GT reset that is required. So technically, it is not a 'stopped 
heartbeat on engine XXX' it is 'stopped heartbeat on GT#'.




You say below heartbeats are going in GuC mode. Now I totally don't 
understand how they are going but there is allegedly no "stopped 
hearbeat".
Because if GuC is handling the detection and recovery then i915 will 
not reach that point. GuC will do the engine reset and start 
scheduling the next context before the heartbeat period expires. So 
the notification will be a G2H about a specific context being reset 
rather than the i915 notification about a stopped heartbeat.






intel_gt_handle_error(engine->gt, engine->mask,
  I915_ERROR_CAPTURE,
  "stopped heartbeat on %s",
  engine->name);

intel_gt_handle_error:

/*
 * Try engine reset when available. We fall back to full reset if
 * single reset fails.
 */
if (!intel_uc_uses_guc_submission(>uc) &&
    intel_has_reset_engine(gt) && !intel_gt_is_wedged(gt)) {
    local_bh_disable();
    for_each_engine_masked(engine, gt, engine_mask, tmp) {

You said "However, the heartbeat is still present and is still the 
watchdog by which engine resets are triggered", now I don't know 
what you meant by this. It actually triggers a single engine reset 
in GuC mode? Where in code does that happen if this block above 
shows it not taking the engine reset path?

i915 sends down the per engine pulse.
GuC schedules the pulse
GuC attempts to pre-empt the currently active context
GuC detects the pre-emption timeout
GuC resets the engine

The fundamental process is exactly the same as in execlist mode. 
It's just that the above blocks of code (calls to 

Re: [Intel-gfx] [PATCH 0/3] Improve anti-pre-emption w/a for compute workloads

2022-02-25 Thread John Harrison

On 2/25/2022 10:29, Tvrtko Ursulin wrote:

On 25/02/2022 18:01, John Harrison wrote:

On 2/25/2022 09:39, Tvrtko Ursulin wrote:

On 25/02/2022 17:11, John Harrison wrote:

On 2/25/2022 08:36, Tvrtko Ursulin wrote:

On 24/02/2022 20:02, John Harrison wrote:

On 2/23/2022 04:00, Tvrtko Ursulin wrote:

On 23/02/2022 02:22, John Harrison wrote:

On 2/22/2022 01:53, Tvrtko Ursulin wrote:

On 18/02/2022 21:33, john.c.harri...@intel.com wrote:

From: John Harrison 

Compute workloads are inherently not pre-emptible on current 
hardware.
Thus the pre-emption timeout was disabled as a workaround to 
prevent
unwanted resets. Instead, the hang detection was left to the 
heartbeat
and its (longer) timeout. This is undesirable with GuC 
submission as
the heartbeat is a full GT reset rather than a per engine 
reset and so
is much more destructive. Instead, just bump the pre-emption 
timeout


Can we have a feature request to allow asking GuC for an 
engine reset?

For what purpose?


To allow "stopped heartbeat" to reset the engine, however..

GuC manages the scheduling of contexts across engines. With 
virtual engines, the KMD has no knowledge of which engine a 
context might be executing on. Even without virtual engines, 
the KMD still has no knowledge of which context is currently 
executing on any given engine at any given time.


There is a reason why hang detection should be left to the 
entity that is doing the scheduling. Any other entity is second 
guessing at best.


The reason for keeping the heartbeat around even when GuC 
submission is enabled is for the case where the KMD/GuC have 
got out of sync with either other somehow or GuC itself has 
just crashed. I.e. when no submission at all is working and we 
need to reset the GuC itself and start over.


.. I wasn't really up to speed to know/remember heartbeats are 
nerfed already in GuC mode.
Not sure what you mean by that claim. Engine resets are handled 
by GuC because GuC handles the scheduling. You can't do the 
former if you aren't doing the latter. However, the heartbeat is 
still present and is still the watchdog by which engine resets 
are triggered. As per the rest of the submission process, the 
hang detection and recovery is split between i915 and GuC.


I meant that "stopped heartbeat on engine XXX" can only do a full 
GPU reset on GuC.
I mean that there is no 'stopped heartbeat on engine XXX' when i915 
is not handling the recovery part of the process.


H?

static void
reset_engine(struct intel_engine_cs *engine, struct i915_request *rq)
{
if (IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM))
    show_heartbeat(rq, engine);

if (intel_engine_uses_guc(engine))
    /*
 * GuC itself is toast or GuC's hang detection
 * is disabled. Either way, need to find the
 * hang culprit manually.
 */
    intel_guc_find_hung_context(engine);

intel_gt_handle_error(engine->gt, engine->mask,
  I915_ERROR_CAPTURE,
  "stopped heartbeat on %s",
  engine->name);
}

How there is no "stopped hearbeat" in guc mode? From this code it 
certainly looks there is.
Only when the GuC is toast and it is no longer an engine reset but a 
full GT reset that is required. So technically, it is not a 'stopped 
heartbeat on engine XXX' it is 'stopped heartbeat on GT#'.




You say below heartbeats are going in GuC mode. Now I totally don't 
understand how they are going but there is allegedly no "stopped 
hearbeat".
Because if GuC is handling the detection and recovery then i915 will 
not reach that point. GuC will do the engine reset and start 
scheduling the next context before the heartbeat period expires. So 
the notification will be a G2H about a specific context being reset 
rather than the i915 notification about a stopped heartbeat.






intel_gt_handle_error(engine->gt, engine->mask,
  I915_ERROR_CAPTURE,
  "stopped heartbeat on %s",
  engine->name);

intel_gt_handle_error:

/*
 * Try engine reset when available. We fall back to full reset if
 * single reset fails.
 */
if (!intel_uc_uses_guc_submission(>uc) &&
    intel_has_reset_engine(gt) && !intel_gt_is_wedged(gt)) {
    local_bh_disable();
    for_each_engine_masked(engine, gt, engine_mask, tmp) {

You said "However, the heartbeat is still present and is still the 
watchdog by which engine resets are triggered", now I don't know 
what you meant by this. It actually triggers a single engine reset 
in GuC mode? Where in code does that happen if this block above 
shows it not taking the engine reset path?

i915 sends down the per engine pulse.
GuC schedules the pulse
GuC attempts to pre-empt the currently active context
GuC detects the pre-emption timeout
GuC resets the engine

The fundamental process is exactly the same as in execlist mode. 
It's just that the above blocks of code (calls to 
intel_gt_handle_error and such) are now inside the 

Re: [Intel-gfx] [PATCH 0/3] Improve anti-pre-emption w/a for compute workloads

2022-02-25 Thread Tvrtko Ursulin



On 25/02/2022 18:01, John Harrison wrote:

On 2/25/2022 09:39, Tvrtko Ursulin wrote:

On 25/02/2022 17:11, John Harrison wrote:

On 2/25/2022 08:36, Tvrtko Ursulin wrote:

On 24/02/2022 20:02, John Harrison wrote:

On 2/23/2022 04:00, Tvrtko Ursulin wrote:

On 23/02/2022 02:22, John Harrison wrote:

On 2/22/2022 01:53, Tvrtko Ursulin wrote:

On 18/02/2022 21:33, john.c.harri...@intel.com wrote:

From: John Harrison 

Compute workloads are inherently not pre-emptible on current 
hardware.
Thus the pre-emption timeout was disabled as a workaround to 
prevent
unwanted resets. Instead, the hang detection was left to the 
heartbeat
and its (longer) timeout. This is undesirable with GuC 
submission as
the heartbeat is a full GT reset rather than a per engine reset 
and so
is much more destructive. Instead, just bump the pre-emption 
timeout


Can we have a feature request to allow asking GuC for an engine 
reset?

For what purpose?


To allow "stopped heartbeat" to reset the engine, however..

GuC manages the scheduling of contexts across engines. With 
virtual engines, the KMD has no knowledge of which engine a 
context might be executing on. Even without virtual engines, the 
KMD still has no knowledge of which context is currently 
executing on any given engine at any given time.


There is a reason why hang detection should be left to the entity 
that is doing the scheduling. Any other entity is second guessing 
at best.


The reason for keeping the heartbeat around even when GuC 
submission is enabled is for the case where the KMD/GuC have got 
out of sync with either other somehow or GuC itself has just 
crashed. I.e. when no submission at all is working and we need to 
reset the GuC itself and start over.


.. I wasn't really up to speed to know/remember heartbeats are 
nerfed already in GuC mode.
Not sure what you mean by that claim. Engine resets are handled by 
GuC because GuC handles the scheduling. You can't do the former if 
you aren't doing the latter. However, the heartbeat is still 
present and is still the watchdog by which engine resets are 
triggered. As per the rest of the submission process, the hang 
detection and recovery is split between i915 and GuC.


I meant that "stopped heartbeat on engine XXX" can only do a full 
GPU reset on GuC.
I mean that there is no 'stopped heartbeat on engine XXX' when i915 
is not handling the recovery part of the process.


H?

static void
reset_engine(struct intel_engine_cs *engine, struct i915_request *rq)
{
if (IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM))
    show_heartbeat(rq, engine);

if (intel_engine_uses_guc(engine))
    /*
 * GuC itself is toast or GuC's hang detection
 * is disabled. Either way, need to find the
 * hang culprit manually.
 */
    intel_guc_find_hung_context(engine);

intel_gt_handle_error(engine->gt, engine->mask,
  I915_ERROR_CAPTURE,
  "stopped heartbeat on %s",
  engine->name);
}

How there is no "stopped hearbeat" in guc mode? From this code it 
certainly looks there is.
Only when the GuC is toast and it is no longer an engine reset but a 
full GT reset that is required. So technically, it is not a 'stopped 
heartbeat on engine XXX' it is 'stopped heartbeat on GT#'.




You say below heartbeats are going in GuC mode. Now I totally don't 
understand how they are going but there is allegedly no "stopped 
hearbeat".
Because if GuC is handling the detection and recovery then i915 will not 
reach that point. GuC will do the engine reset and start scheduling the 
next context before the heartbeat period expires. So the notification 
will be a G2H about a specific context being reset rather than the i915 
notification about a stopped heartbeat.






intel_gt_handle_error(engine->gt, engine->mask,
  I915_ERROR_CAPTURE,
  "stopped heartbeat on %s",
  engine->name);

intel_gt_handle_error:

/*
 * Try engine reset when available. We fall back to full reset if
 * single reset fails.
 */
if (!intel_uc_uses_guc_submission(>uc) &&
    intel_has_reset_engine(gt) && !intel_gt_is_wedged(gt)) {
    local_bh_disable();
    for_each_engine_masked(engine, gt, engine_mask, tmp) {

You said "However, the heartbeat is still present and is still the 
watchdog by which engine resets are triggered", now I don't know 
what you meant by this. It actually triggers a single engine reset 
in GuC mode? Where in code does that happen if this block above 
shows it not taking the engine reset path?

i915 sends down the per engine pulse.
GuC schedules the pulse
GuC attempts to pre-empt the currently active context
GuC detects the pre-emption timeout
GuC resets the engine

The fundamental process is exactly the same as in execlist mode. It's 
just that the above blocks of code (calls to intel_gt_handle_error 
and such) are now inside the GuC not i915.


Without the heartbeat 

Re: [Intel-gfx] [PATCH 0/3] Improve anti-pre-emption w/a for compute workloads

2022-02-25 Thread John Harrison

On 2/25/2022 09:39, Tvrtko Ursulin wrote:

On 25/02/2022 17:11, John Harrison wrote:

On 2/25/2022 08:36, Tvrtko Ursulin wrote:

On 24/02/2022 20:02, John Harrison wrote:

On 2/23/2022 04:00, Tvrtko Ursulin wrote:

On 23/02/2022 02:22, John Harrison wrote:

On 2/22/2022 01:53, Tvrtko Ursulin wrote:

On 18/02/2022 21:33, john.c.harri...@intel.com wrote:

From: John Harrison 

Compute workloads are inherently not pre-emptible on current 
hardware.
Thus the pre-emption timeout was disabled as a workaround to 
prevent
unwanted resets. Instead, the hang detection was left to the 
heartbeat
and its (longer) timeout. This is undesirable with GuC 
submission as
the heartbeat is a full GT reset rather than a per engine reset 
and so
is much more destructive. Instead, just bump the pre-emption 
timeout


Can we have a feature request to allow asking GuC for an engine 
reset?

For what purpose?


To allow "stopped heartbeat" to reset the engine, however..

GuC manages the scheduling of contexts across engines. With 
virtual engines, the KMD has no knowledge of which engine a 
context might be executing on. Even without virtual engines, the 
KMD still has no knowledge of which context is currently 
executing on any given engine at any given time.


There is a reason why hang detection should be left to the entity 
that is doing the scheduling. Any other entity is second guessing 
at best.


The reason for keeping the heartbeat around even when GuC 
submission is enabled is for the case where the KMD/GuC have got 
out of sync with either other somehow or GuC itself has just 
crashed. I.e. when no submission at all is working and we need to 
reset the GuC itself and start over.


.. I wasn't really up to speed to know/remember heartbeats are 
nerfed already in GuC mode.
Not sure what you mean by that claim. Engine resets are handled by 
GuC because GuC handles the scheduling. You can't do the former if 
you aren't doing the latter. However, the heartbeat is still 
present and is still the watchdog by which engine resets are 
triggered. As per the rest of the submission process, the hang 
detection and recovery is split between i915 and GuC.


I meant that "stopped heartbeat on engine XXX" can only do a full 
GPU reset on GuC.
I mean that there is no 'stopped heartbeat on engine XXX' when i915 
is not handling the recovery part of the process.


H?

static void
reset_engine(struct intel_engine_cs *engine, struct i915_request *rq)
{
if (IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM))
    show_heartbeat(rq, engine);

if (intel_engine_uses_guc(engine))
    /*
 * GuC itself is toast or GuC's hang detection
 * is disabled. Either way, need to find the
 * hang culprit manually.
 */
    intel_guc_find_hung_context(engine);

intel_gt_handle_error(engine->gt, engine->mask,
  I915_ERROR_CAPTURE,
  "stopped heartbeat on %s",
  engine->name);
}

How there is no "stopped hearbeat" in guc mode? From this code it 
certainly looks there is.
Only when the GuC is toast and it is no longer an engine reset but a 
full GT reset that is required. So technically, it is not a 'stopped 
heartbeat on engine XXX' it is 'stopped heartbeat on GT#'.




You say below heartbeats are going in GuC mode. Now I totally don't 
understand how they are going but there is allegedly no "stopped 
hearbeat".
Because if GuC is handling the detection and recovery then i915 will not 
reach that point. GuC will do the engine reset and start scheduling the 
next context before the heartbeat period expires. So the notification 
will be a G2H about a specific context being reset rather than the i915 
notification about a stopped heartbeat.






intel_gt_handle_error(engine->gt, engine->mask,
  I915_ERROR_CAPTURE,
  "stopped heartbeat on %s",
  engine->name);

intel_gt_handle_error:

/*
 * Try engine reset when available. We fall back to full reset if
 * single reset fails.
 */
if (!intel_uc_uses_guc_submission(>uc) &&
    intel_has_reset_engine(gt) && !intel_gt_is_wedged(gt)) {
    local_bh_disable();
    for_each_engine_masked(engine, gt, engine_mask, tmp) {

You said "However, the heartbeat is still present and is still the 
watchdog by which engine resets are triggered", now I don't know 
what you meant by this. It actually triggers a single engine reset 
in GuC mode? Where in code does that happen if this block above 
shows it not taking the engine reset path?

i915 sends down the per engine pulse.
GuC schedules the pulse
GuC attempts to pre-empt the currently active context
GuC detects the pre-emption timeout
GuC resets the engine

The fundamental process is exactly the same as in execlist mode. It's 
just that the above blocks of code (calls to intel_gt_handle_error 
and such) are now inside the GuC not i915.


Without the heartbeat going ping, there would be no context switching 

Re: [Intel-gfx] [PATCH 0/3] Improve anti-pre-emption w/a for compute workloads

2022-02-25 Thread Tvrtko Ursulin



On 25/02/2022 17:11, John Harrison wrote:

On 2/25/2022 08:36, Tvrtko Ursulin wrote:

On 24/02/2022 20:02, John Harrison wrote:

On 2/23/2022 04:00, Tvrtko Ursulin wrote:

On 23/02/2022 02:22, John Harrison wrote:

On 2/22/2022 01:53, Tvrtko Ursulin wrote:

On 18/02/2022 21:33, john.c.harri...@intel.com wrote:

From: John Harrison 

Compute workloads are inherently not pre-emptible on current 
hardware.

Thus the pre-emption timeout was disabled as a workaround to prevent
unwanted resets. Instead, the hang detection was left to the 
heartbeat

and its (longer) timeout. This is undesirable with GuC submission as
the heartbeat is a full GT reset rather than a per engine reset 
and so

is much more destructive. Instead, just bump the pre-emption timeout


Can we have a feature request to allow asking GuC for an engine 
reset?

For what purpose?


To allow "stopped heartbeat" to reset the engine, however..

GuC manages the scheduling of contexts across engines. With virtual 
engines, the KMD has no knowledge of which engine a context might 
be executing on. Even without virtual engines, the KMD still has no 
knowledge of which context is currently executing on any given 
engine at any given time.


There is a reason why hang detection should be left to the entity 
that is doing the scheduling. Any other entity is second guessing 
at best.


The reason for keeping the heartbeat around even when GuC 
submission is enabled is for the case where the KMD/GuC have got 
out of sync with either other somehow or GuC itself has just 
crashed. I.e. when no submission at all is working and we need to 
reset the GuC itself and start over.


.. I wasn't really up to speed to know/remember heartbeats are 
nerfed already in GuC mode.
Not sure what you mean by that claim. Engine resets are handled by 
GuC because GuC handles the scheduling. You can't do the former if 
you aren't doing the latter. However, the heartbeat is still present 
and is still the watchdog by which engine resets are triggered. As 
per the rest of the submission process, the hang detection and 
recovery is split between i915 and GuC.


I meant that "stopped heartbeat on engine XXX" can only do a full GPU 
reset on GuC.
I mean that there is no 'stopped heartbeat on engine XXX' when i915 is 
not handling the recovery part of the process.


H?

static void
reset_engine(struct intel_engine_cs *engine, struct i915_request *rq)
{
if (IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM))
show_heartbeat(rq, engine);

if (intel_engine_uses_guc(engine))
/*
 * GuC itself is toast or GuC's hang detection
 * is disabled. Either way, need to find the
 * hang culprit manually.
 */
intel_guc_find_hung_context(engine);

intel_gt_handle_error(engine->gt, engine->mask,
  I915_ERROR_CAPTURE,
  "stopped heartbeat on %s",
  engine->name);
}

How there is no "stopped hearbeat" in guc mode? From this code it certainly 
looks there is.

You say below heartbeats are going in GuC mode. Now I totally don't understand how they 
are going but there is allegedly no "stopped hearbeat".



intel_gt_handle_error(engine->gt, engine->mask,
  I915_ERROR_CAPTURE,
  "stopped heartbeat on %s",
  engine->name);

intel_gt_handle_error:

/*
 * Try engine reset when available. We fall back to full reset if
 * single reset fails.
 */
if (!intel_uc_uses_guc_submission(>uc) &&
    intel_has_reset_engine(gt) && !intel_gt_is_wedged(gt)) {
    local_bh_disable();
    for_each_engine_masked(engine, gt, engine_mask, tmp) {

You said "However, the heartbeat is still present and is still the 
watchdog by which engine resets are triggered", now I don't know what 
you meant by this. It actually triggers a single engine reset in GuC 
mode? Where in code does that happen if this block above shows it not 
taking the engine reset path?

i915 sends down the per engine pulse.
GuC schedules the pulse
GuC attempts to pre-empt the currently active context
GuC detects the pre-emption timeout
GuC resets the engine

The fundamental process is exactly the same as in execlist mode. It's 
just that the above blocks of code (calls to intel_gt_handle_error and 
such) are now inside the GuC not i915.


Without the heartbeat going ping, there would be no context switching 
and thus no pre-emption, no pre-emption timeout and so no hang and reset 
recovery. And GuC cannot sent pulses by itself - it has no ability to 
generate context workloads. So we need i915 to send the pings and to 
gradually raise their priority. But the back half of the heartbeat code 
is now inside the GuC. It will simply never reach the i915 side timeout 
if GuC is doing the recovery (unless the heartbeat's final period is too 
short). We should only reach the i915 

Re: [Intel-gfx] [PATCH 0/3] Improve anti-pre-emption w/a for compute workloads

2022-02-25 Thread John Harrison

On 2/25/2022 08:36, Tvrtko Ursulin wrote:

On 24/02/2022 20:02, John Harrison wrote:

On 2/23/2022 04:00, Tvrtko Ursulin wrote:

On 23/02/2022 02:22, John Harrison wrote:

On 2/22/2022 01:53, Tvrtko Ursulin wrote:

On 18/02/2022 21:33, john.c.harri...@intel.com wrote:

From: John Harrison 

Compute workloads are inherently not pre-emptible on current 
hardware.

Thus the pre-emption timeout was disabled as a workaround to prevent
unwanted resets. Instead, the hang detection was left to the 
heartbeat

and its (longer) timeout. This is undesirable with GuC submission as
the heartbeat is a full GT reset rather than a per engine reset 
and so

is much more destructive. Instead, just bump the pre-emption timeout


Can we have a feature request to allow asking GuC for an engine 
reset?

For what purpose?


To allow "stopped heartbeat" to reset the engine, however..

GuC manages the scheduling of contexts across engines. With virtual 
engines, the KMD has no knowledge of which engine a context might 
be executing on. Even without virtual engines, the KMD still has no 
knowledge of which context is currently executing on any given 
engine at any given time.


There is a reason why hang detection should be left to the entity 
that is doing the scheduling. Any other entity is second guessing 
at best.


The reason for keeping the heartbeat around even when GuC 
submission is enabled is for the case where the KMD/GuC have got 
out of sync with either other somehow or GuC itself has just 
crashed. I.e. when no submission at all is working and we need to 
reset the GuC itself and start over.


.. I wasn't really up to speed to know/remember heartbeats are 
nerfed already in GuC mode.
Not sure what you mean by that claim. Engine resets are handled by 
GuC because GuC handles the scheduling. You can't do the former if 
you aren't doing the latter. However, the heartbeat is still present 
and is still the watchdog by which engine resets are triggered. As 
per the rest of the submission process, the hang detection and 
recovery is split between i915 and GuC.


I meant that "stopped heartbeat on engine XXX" can only do a full GPU 
reset on GuC.
I mean that there is no 'stopped heartbeat on engine XXX' when i915 is 
not handling the recovery part of the process.





intel_gt_handle_error(engine->gt, engine->mask,
  I915_ERROR_CAPTURE,
  "stopped heartbeat on %s",
  engine->name);

intel_gt_handle_error:

/*
 * Try engine reset when available. We fall back to full reset if
 * single reset fails.
 */
if (!intel_uc_uses_guc_submission(>uc) &&
    intel_has_reset_engine(gt) && !intel_gt_is_wedged(gt)) {
    local_bh_disable();
    for_each_engine_masked(engine, gt, engine_mask, tmp) {

You said "However, the heartbeat is still present and is still the 
watchdog by which engine resets are triggered", now I don't know what 
you meant by this. It actually triggers a single engine reset in GuC 
mode? Where in code does that happen if this block above shows it not 
taking the engine reset path?

i915 sends down the per engine pulse.
GuC schedules the pulse
GuC attempts to pre-empt the currently active context
GuC detects the pre-emption timeout
GuC resets the engine

The fundamental process is exactly the same as in execlist mode. It's 
just that the above blocks of code (calls to intel_gt_handle_error and 
such) are now inside the GuC not i915.


Without the heartbeat going ping, there would be no context switching 
and thus no pre-emption, no pre-emption timeout and so no hang and reset 
recovery. And GuC cannot sent pulses by itself - it has no ability to 
generate context workloads. So we need i915 to send the pings and to 
gradually raise their priority. But the back half of the heartbeat code 
is now inside the GuC. It will simply never reach the i915 side timeout 
if GuC is doing the recovery (unless the heartbeat's final period is too 
short). We should only reach the i915 side timeout if GuC itself is 
toast. At which point we need the full GT reset to recover the GuC.


John.




I am not sure it was the best way since full reset penalizes 
everyone for one hanging engine. Better question would be why leave 
heartbeats around to start with with GuC? If you want to use it to 
health check GuC, as you say, maybe just send a CT message and 
expect replies? Then full reset would make sense. It also achieves 
the goal of not seconding guessing the submission backend you raise.
Adding yet another hang detection mechanism is yet more complication 
and is unnecessary when we already have one that works well enough. 
As above, the heartbeat is still required for sending the pulses that 
cause pre-emptions and so let GuC detect hangs. It also provides a 
fallback against a dead GuC by default. So why invent yet another wheel?


Lets first clarify the previous block to make sure there aren't any 
misunderstandings there.


Regards,

Tvrtko

Like it 

Re: [Intel-gfx] [PATCH 0/3] Improve anti-pre-emption w/a for compute workloads

2022-02-25 Thread Tvrtko Ursulin



On 24/02/2022 20:02, John Harrison wrote:

On 2/23/2022 04:00, Tvrtko Ursulin wrote:

On 23/02/2022 02:22, John Harrison wrote:

On 2/22/2022 01:53, Tvrtko Ursulin wrote:

On 18/02/2022 21:33, john.c.harri...@intel.com wrote:

From: John Harrison 

Compute workloads are inherently not pre-emptible on current hardware.
Thus the pre-emption timeout was disabled as a workaround to prevent
unwanted resets. Instead, the hang detection was left to the heartbeat
and its (longer) timeout. This is undesirable with GuC submission as
the heartbeat is a full GT reset rather than a per engine reset and so
is much more destructive. Instead, just bump the pre-emption timeout


Can we have a feature request to allow asking GuC for an engine reset?

For what purpose?


To allow "stopped heartbeat" to reset the engine, however..

GuC manages the scheduling of contexts across engines. With virtual 
engines, the KMD has no knowledge of which engine a context might be 
executing on. Even without virtual engines, the KMD still has no 
knowledge of which context is currently executing on any given engine 
at any given time.


There is a reason why hang detection should be left to the entity 
that is doing the scheduling. Any other entity is second guessing at 
best.


The reason for keeping the heartbeat around even when GuC submission 
is enabled is for the case where the KMD/GuC have got out of sync 
with either other somehow or GuC itself has just crashed. I.e. when 
no submission at all is working and we need to reset the GuC itself 
and start over.


.. I wasn't really up to speed to know/remember heartbeats are nerfed 
already in GuC mode.
Not sure what you mean by that claim. Engine resets are handled by GuC 
because GuC handles the scheduling. You can't do the former if you 
aren't doing the latter. However, the heartbeat is still present and is 
still the watchdog by which engine resets are triggered. As per the rest 
of the submission process, the hang detection and recovery is split 
between i915 and GuC.


I meant that "stopped heartbeat on engine XXX" can only do a full GPU reset on 
GuC.

intel_gt_handle_error(engine->gt, engine->mask,
  I915_ERROR_CAPTURE,
  "stopped heartbeat on %s",
  engine->name);

intel_gt_handle_error:

/*
 * Try engine reset when available. We fall back to full reset if
 * single reset fails.
 */
if (!intel_uc_uses_guc_submission(>uc) &&
intel_has_reset_engine(gt) && !intel_gt_is_wedged(gt)) {
local_bh_disable();
for_each_engine_masked(engine, gt, engine_mask, tmp) {

You said "However, the heartbeat is still present and is still the watchdog by which 
engine resets are triggered", now I don't know what you meant by this. It actually 
triggers a single engine reset in GuC mode? Where in code does that happen if this block 
above shows it not taking the engine reset path?

I am not sure it was the best way since full reset penalizes everyone 
for one hanging engine. Better question would be why leave heartbeats 
around to start with with GuC? If you want to use it to health check 
GuC, as you say, maybe just send a CT message and expect replies? Then 
full reset would make sense. It also achieves the goal of not 
seconding guessing the submission backend you raise.
Adding yet another hang detection mechanism is yet more complication and 
is unnecessary when we already have one that works well enough. As 
above, the heartbeat is still required for sending the pulses that cause 
pre-emptions and so let GuC detect hangs. It also provides a fallback 
against a dead GuC by default. So why invent yet another wheel?


Lets first clarify the previous block to make sure there aren't any 
misunderstandings there.

Regards,

Tvrtko

Like it is now, and the need for this series demonstrates it, the 
whole thing has a pretty poor "impedance" match. Not even sure what 
intel_guc_find_hung_context is doing in intel_engine_hearbeat.c - why 
is that not in intel_gt_handle_error at least? Why is hearbeat code 
special and other callers of intel_gt_handle_error don't need it?
There is no guilty context if the reset was triggered via debugfs or 
similar. And as stated ad nauseam, i915 is no longer handling the 
scheduling and so cannot make assumptions about what is or is not 
running on the hardware at any given time. And obviously, if the reset 
initiated by GuC itself then i915 should not be second guessing the 
guilty context as the GuC notification has already told us who was 
responsible.


And to be clear, the 'poor impedance match' is purely because we don't 
have mid-thread pre-emption and so need a stupidly huge timeout on 
compute capable engines. Whereas, we don't want a total heatbeat timeout 
of a minute or more. That is the impedance mis-match. If the 640ms was 
acceptable for RCS then none of this hacky timeout 

Re: [Intel-gfx] [PATCH 0/3] Improve anti-pre-emption w/a for compute workloads

2022-02-24 Thread John Harrison

On 2/23/2022 04:00, Tvrtko Ursulin wrote:

On 23/02/2022 02:22, John Harrison wrote:

On 2/22/2022 01:53, Tvrtko Ursulin wrote:

On 18/02/2022 21:33, john.c.harri...@intel.com wrote:

From: John Harrison 

Compute workloads are inherently not pre-emptible on current hardware.
Thus the pre-emption timeout was disabled as a workaround to prevent
unwanted resets. Instead, the hang detection was left to the heartbeat
and its (longer) timeout. This is undesirable with GuC submission as
the heartbeat is a full GT reset rather than a per engine reset and so
is much more destructive. Instead, just bump the pre-emption timeout


Can we have a feature request to allow asking GuC for an engine reset?

For what purpose?


To allow "stopped heartbeat" to reset the engine, however..

GuC manages the scheduling of contexts across engines. With virtual 
engines, the KMD has no knowledge of which engine a context might be 
executing on. Even without virtual engines, the KMD still has no 
knowledge of which context is currently executing on any given engine 
at any given time.


There is a reason why hang detection should be left to the entity 
that is doing the scheduling. Any other entity is second guessing at 
best.


The reason for keeping the heartbeat around even when GuC submission 
is enabled is for the case where the KMD/GuC have got out of sync 
with either other somehow or GuC itself has just crashed. I.e. when 
no submission at all is working and we need to reset the GuC itself 
and start over.


.. I wasn't really up to speed to know/remember heartbeats are nerfed 
already in GuC mode.
Not sure what you mean by that claim. Engine resets are handled by GuC 
because GuC handles the scheduling. You can't do the former if you 
aren't doing the latter. However, the heartbeat is still present and is 
still the watchdog by which engine resets are triggered. As per the rest 
of the submission process, the hang detection and recovery is split 
between i915 and GuC.





I am not sure it was the best way since full reset penalizes everyone 
for one hanging engine. Better question would be why leave heartbeats 
around to start with with GuC? If you want to use it to health check 
GuC, as you say, maybe just send a CT message and expect replies? Then 
full reset would make sense. It also achieves the goal of not 
seconding guessing the submission backend you raise.
Adding yet another hang detection mechanism is yet more complication and 
is unnecessary when we already have one that works well enough. As 
above, the heartbeat is still required for sending the pulses that cause 
pre-emptions and so let GuC detect hangs. It also provides a fallback 
against a dead GuC by default. So why invent yet another wheel?




Like it is now, and the need for this series demonstrates it, the 
whole thing has a pretty poor "impedance" match. Not even sure what 
intel_guc_find_hung_context is doing in intel_engine_hearbeat.c - why 
is that not in intel_gt_handle_error at least? Why is hearbeat code 
special and other callers of intel_gt_handle_error don't need it?
There is no guilty context if the reset was triggered via debugfs or 
similar. And as stated ad nauseam, i915 is no longer handling the 
scheduling and so cannot make assumptions about what is or is not 
running on the hardware at any given time. And obviously, if the reset 
initiated by GuC itself then i915 should not be second guessing the 
guilty context as the GuC notification has already told us who was 
responsible.


And to be clear, the 'poor impedance match' is purely because we don't 
have mid-thread pre-emption and so need a stupidly huge timeout on 
compute capable engines. Whereas, we don't want a total heatbeat timeout 
of a minute or more. That is the impedance mis-match. If the 640ms was 
acceptable for RCS then none of this hacky timeout algorithm mush would 
be needed.


John.




Regards,

Tvrtko




Re: [Intel-gfx] [PATCH 0/3] Improve anti-pre-emption w/a for compute workloads

2022-02-23 Thread Tvrtko Ursulin




On 23/02/2022 02:22, John Harrison wrote:

On 2/22/2022 01:53, Tvrtko Ursulin wrote:

On 18/02/2022 21:33, john.c.harri...@intel.com wrote:

From: John Harrison 

Compute workloads are inherently not pre-emptible on current hardware.
Thus the pre-emption timeout was disabled as a workaround to prevent
unwanted resets. Instead, the hang detection was left to the heartbeat
and its (longer) timeout. This is undesirable with GuC submission as
the heartbeat is a full GT reset rather than a per engine reset and so
is much more destructive. Instead, just bump the pre-emption timeout


Can we have a feature request to allow asking GuC for an engine reset?

For what purpose?


To allow "stopped heartbeat" to reset the engine, however..

GuC manages the scheduling of contexts across engines. With virtual 
engines, the KMD has no knowledge of which engine a context might be 
executing on. Even without virtual engines, the KMD still has no 
knowledge of which context is currently executing on any given engine at 
any given time.


There is a reason why hang detection should be left to the entity that 
is doing the scheduling. Any other entity is second guessing at best.


The reason for keeping the heartbeat around even when GuC submission is 
enabled is for the case where the KMD/GuC have got out of sync with 
either other somehow or GuC itself has just crashed. I.e. when no 
submission at all is working and we need to reset the GuC itself and 
start over.


.. I wasn't really up to speed to know/remember heartbeats are nerfed 
already in GuC mode.


I am not sure it was the best way since full reset penalizes everyone 
for one hanging engine. Better question would be why leave heartbeats 
around to start with with GuC? If you want to use it to health check 
GuC, as you say, maybe just send a CT message and expect replies? Then 
full reset would make sense. It also achieves the goal of not seconding 
guessing the submission backend you raise.


Like it is now, and the need for this series demonstrates it, the whole 
thing has a pretty poor "impedance" match. Not even sure what 
intel_guc_find_hung_context is doing in intel_engine_hearbeat.c - why is 
that not in intel_gt_handle_error at least? Why is hearbeat code special 
and other callers of intel_gt_handle_error don't need it?


Regards,

Tvrtko


Re: [Intel-gfx] [PATCH 0/3] Improve anti-pre-emption w/a for compute workloads

2022-02-22 Thread John Harrison

On 2/22/2022 01:53, Tvrtko Ursulin wrote:

On 18/02/2022 21:33, john.c.harri...@intel.com wrote:

From: John Harrison 

Compute workloads are inherently not pre-emptible on current hardware.
Thus the pre-emption timeout was disabled as a workaround to prevent
unwanted resets. Instead, the hang detection was left to the heartbeat
and its (longer) timeout. This is undesirable with GuC submission as
the heartbeat is a full GT reset rather than a per engine reset and so
is much more destructive. Instead, just bump the pre-emption timeout


Can we have a feature request to allow asking GuC for an engine reset?

For what purpose?

GuC manages the scheduling of contexts across engines. With virtual 
engines, the KMD has no knowledge of which engine a context might be 
executing on. Even without virtual engines, the KMD still has no 
knowledge of which context is currently executing on any given engine at 
any given time.


There is a reason why hang detection should be left to the entity that 
is doing the scheduling. Any other entity is second guessing at best.


The reason for keeping the heartbeat around even when GuC submission is 
enabled is for the case where the KMD/GuC have got out of sync with 
either other somehow or GuC itself has just crashed. I.e. when no 
submission at all is working and we need to reset the GuC itself and 
start over.


John.




Regards,

Tvrtko


to a big value. Also, update the heartbeat to allow such a long
pre-emption delay in the final heartbeat period.

Signed-off-by: John Harrison 


John Harrison (3):
   drm/i915/guc: Limit scheduling properties to avoid overflow
   drm/i915/gt: Make the heartbeat play nice with long pre-emption
 timeouts
   drm/i915: Improve long running OCL w/a for GuC submission

  drivers/gpu/drm/i915/gt/intel_engine_cs.c | 37 +--
  .../gpu/drm/i915/gt/intel_engine_heartbeat.c  | 16 
  drivers/gpu/drm/i915/gt/sysfs_engines.c   | 14 +++
  drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  9 +
  4 files changed, 73 insertions(+), 3 deletions(-)





Re: [Intel-gfx] [PATCH 0/3] Improve anti-pre-emption w/a for compute workloads

2022-02-22 Thread Tvrtko Ursulin



On 18/02/2022 21:33, john.c.harri...@intel.com wrote:

From: John Harrison 

Compute workloads are inherently not pre-emptible on current hardware.
Thus the pre-emption timeout was disabled as a workaround to prevent
unwanted resets. Instead, the hang detection was left to the heartbeat
and its (longer) timeout. This is undesirable with GuC submission as
the heartbeat is a full GT reset rather than a per engine reset and so
is much more destructive. Instead, just bump the pre-emption timeout


Can we have a feature request to allow asking GuC for an engine reset?

Regards,

Tvrtko


to a big value. Also, update the heartbeat to allow such a long
pre-emption delay in the final heartbeat period.

Signed-off-by: John Harrison 


John Harrison (3):
   drm/i915/guc: Limit scheduling properties to avoid overflow
   drm/i915/gt: Make the heartbeat play nice with long pre-emption
 timeouts
   drm/i915: Improve long running OCL w/a for GuC submission

  drivers/gpu/drm/i915/gt/intel_engine_cs.c | 37 +--
  .../gpu/drm/i915/gt/intel_engine_heartbeat.c  | 16 
  drivers/gpu/drm/i915/gt/sysfs_engines.c   | 14 +++
  drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  9 +
  4 files changed, 73 insertions(+), 3 deletions(-)