watchdog timeout support for gen8 (edit: fixed coverletter)

Tomas Elf Fri, 24 Jul 2015 04:14:44 -0700

On 21/07/2015 15:51, Tomas Elf wrote:

This patch series introduces the following features:


* Feature 1: TDR (Timeout Detection and Recovery) for gen8 execlist mode.

TDR is an umbrella term for anything that goes into detecting and recovering 
from GPU hangs and is a term more widely used outside of the upstream driver.
This feature introduces an extensible framework that currently supports gen8 
but that can be easily extended to support gen7 as well (which is already the 
case in GMIN but unfortunately in a not quite upstreamable form). The code 
contained in this submission represents the essentials of what is currently in 
GMIN merged with what is currently in upstream (as of the time when this work 
commenced a few months back).

This feature adds a new hang recovery path alongside the legacy GPU reset path, 
which takes care of engine recovery only. Aside from adding support for 
per-engine recovery this feature also introduces rules for how to promote a 
potential per-engine reset to a legacy, full GPU reset.

The hang checker now integrates with the error handler in a slightly different 
way in that it allows hang recovery on multiple engines at the same time by 
passing an engine flag mask to the error handler where flags representing all 
of the hung engines are set. This allows us to schedule hang recovery once for 
all currently hung engines instead of one hang recovery per detected engine 
hang. Previously, when only full GPU reset was supported this was all the same 
since it wouldn't matter if one or four engines were hung at any given point 
since it would all amount to the same thing - the GPU getting reset. As it 
stands now the behaviour is different depending on which engine is hung since 
each engine is reset separately from all the other engines, therefore we have 
to think about this in terms of scheduling cost and recovery latency. (see open 
question below)

OPEN QUESTIONS:

        1. Do we want to investigate the possibility of per-engine hang
        detection? In the current upstream driver there is only one work queue
        that handles the hang checker and everything from initial hang
        detection to final hang recovery runs in this thread. This makes sense
        if you're only supporting one form of hang recovery - using full GPU
        reset and nothing tied to any particular engine. However, as part
        of this patch series we're changing that by introducing per-engine
        hang recovery. It could make sense to introduce multiple work
        queues - one per engine - to run multiple hang checking threads in
        parallel.

        This would potentially allow savings in terms of recovery latency since
        we don't have to scan all engines every time the hang checker is
        scheduled and the error handler does not have to scan all engines every
        time it is scheduled. Instead, we could implement one work queue per
        engine that would invoke the hang checker that only checks _that_
        particular engine and then the error handler is invoked for _that_
        particular engine. If one engine has hung the latency for getting to
        the hang recovery path for that particular engine would be (Time For
        Hang Checking One Engine) + (Time For Error Handling One Engine) rather
        than the time it takes to do hang checking for all engines + the time
        it takes to do error handling for all engines that have been detected
        as hung (which in the worst case would be all engines). There would
        potentially be as many hang checker and error handling threads going on
        concurrently as there are engines in the hardware but they would all be
        running in parallel without any significant locking. The first time
        where any thread needs exclusive access to the driver is at the point
        of the actual hang recovery but the time it takes to get there would
        theoretically be lower and the time it actually takes to do per-engine
        hang recovery is quite a lot lower than the time it takes to actually
        detect a hang reliably.

        How much we would save by such a change still needs to be analysed and
        compared against the current single-thread model but it makes sense
        from a theoretical design point of view.

* Feature 2: Watchdog Timeout (a.k.a "media engine reset") for gen8.

This feature allows userland applications to control whether or not individual 
batch buffers should have a first-level, fine-grained, hardware-based hang 
detection mechanism on top of the ordinary, software-based periodic hang 
checker that is already in the driver. The advantage over relying solely on the 
current software-based hang checker is that the watchdog timeout mechanism is 
about 1000x quicker and more precise. Since it's not a full driver-level hang 
detection mechanism but only targetting one individual batch buffer at a time 
it can afford to be that quick without risking an increase in false positive 
hang detection.

This feature includes the following changes:

a) Watchdog timeout interrupt service routine for handling watchdog interrupts 
and connecting these to per-engine hang recovery.

b) Injection of watchdog timer enablement/cancellation instructions 
before/after the batch buffer start instruction in the ring buffer so that 
watchdog timeout is connected to the submission of an individual batch buffer.

c) Extension of the DRM batch buffer interface, exposing the watchdog timeout 
feature to userland. We've got two open source groups in VPG currently in the 
process of integrating support for this feature, which should make it 
principally possible to upstream this extension.

There is currently full watchdog timeout support for gen7 in GMIN and it is 
quite similar to the gen8 implementation so there is nothing obvious that 
prevents us from upstreaming that code along with the gen8 code. However, 
watchdog timeout is fully dependent on the per-engine hang recovery path and 
that is not part of this code submission for gen7. Therefore watchdog timeout 
support for gen7 has been excluded until per-engine hang recovery support for
gen7 has landed upstream.

As part of this submission we've had to reinstate the work queue that was 
previously in place between the error handler and the hang recovery path. The 
reason for this is that the per-engine recovery path is called directly from 
the interrupt handler in the case of watchdog timeout. In that situation 
there's no way of grabbing the struct_mutex, which is a requirement for the 
hang recovery path. Therefore, by reinstating the work queue we provide a 
unified execution context for the hang recovery code that allows the hang 
recovery code to grab whatever locks it needs without sacrificing interrupt 
latency too much or sleeping indefinitely in hard interrupt context.

* Feature 3. Context Submission Status Consistency checking

Something that becomes apparent when you run long-duration operations tests 
with concurrent rendering processes with intermittently injected hangs is that 
it seems like the GPU forgets to send context completion interrupts to the 
driver under some circumstances. What this means is that the driver sometimes 
gets stuck on a context that never seems to finish, all the while the hardware 
has completed and is waiting for more work.

The problem with this is that the per-engine hang recovery path relies on 
context resubmission to kick off the hardware again following an engine reset.
This can only be done safely if the hardware and driver share the same opinion 
about the current state. Therefore we've extended the periodic hang checker to 
check for context submission state inconsistencies aside from the hang checking 
it already does.

If such a state is detected it is assumed (based on experience) that a context 
completion interrupt has been lost somehow. If this state persists for some 
time an attempt to correct it is made by faking the presumably lost context 
completion interrupt by manually calling the execlist interrupt handler, which 
is normally called from the main interrupt handler cued by a received context 
event interrupt. Just because an interrupt goes missing does not mean that the 
context status buffer (CSB) does not get appropriately updated by the hardware, 
which means that we can expect to find all the recent changes to the context 
states for each engine captured there.

If there are outstanding context status changes in store there then the faked 
context event interrupt will allow the interrupt handler to act on them. In the 
case of lost context completion interrupts this will prompt the driver to 
remove the already completed context from the execlist queue and move on to the 
next pending piece of work and thereby eliminating the inconsistency.

* Feature 4. Debugfs extensions for per-engine hang recovery and TDR/watchdog 
trace points.

GITHUB REPOSITORY:
https://github.com/telf/TDR_watchdog_RFC_1.git

RFC VERSION 1 BRANCH:
20150608_TDR_upstream_adaptation_single-thread_hangchecking_RFC_delivery_sendmail_1

RFC VERSION 2 BRANCH:
20150604_TDR_upstream_adaptation_single-thread_hangchecking_RFCv2_delivery_sendmail_1

CHANGES IN VERSION 2
--------------------
Version 2 of this RFC series addresses design concerns that Chris Wilson and 
Daniel Vetter et. al. had with the first version of this RFC series. Below is a 
summary of all the changes made between versions:

* [1/12] drm/i915: Early exit from semaphore_waits_for for execlist mode:
        Turned the execlist mode check into a ringbuffer NULL check to make it 
more submission mode agnostic and less of a layering violation.

* [2/12] drm/i915: Make i915_gem_reset_ring_status() public
        Replaces the old patch in RFCv1: "drm/i915: Add reset stats entry point for 
per-engine reset"

* [3/12] drm/i915: Adding TDR / per-engine reset support for gen8:

        1. Simply use the previously private function
        i915_gem_reset_ring_status() from the engine hang recovery path to set
        active/pending context status. This replicates the same behaviour as in
        full GPU reset but for a single, targetted engine.

        2. Remove all additional uevents for both full GPU reset and per-engine
        reset.  Adapted uevent behaviour to the new per-engine hang recovery
        mode in that it will only send one uevent regardless of which form of
        recovery is employed.  If a per-engine reset is attempted first then
        one uevent will be dispatched.  If that recovery mode fails and the
        hang is promoted to a full GPU reset no further uevents will be
        dispatched at that point.

        3. Removed the 2*HUNG hang threshold from i915_hangcheck_elapsed in
        order to not make the hang detection algorithm too complicated. This
        threshold was introduced to compensate for the possibility that hang
        recovery might be delayed due to inconsistent context submission status
        that would prevent per-engine hang recovery from happening. In a later
        patch we introduce faked context event interrupts and inconsistency
        rectification at the onset of per-engine hang recovery instead of
        relying on the hang checker to do this for us. Therefore, since we do
        not delay and defer to future hang detections, we never allow hangs to
        go addressed beyond the HUNG threshold and therefore there is no need
        for any further thresholds.

        4. Tidied up the TDR context resubmission path in intel_lrc.c . Reduced
        the amount of duplication by relying entirely on the normal unqueue
        function.  Added a new parameter to the unqueue function that takes
        into consideration if the unqueue call is for a first-time context
        submission or a resubmission and adapts the handling of elsp_submitted
        accordingly. The reason for this is that for context resubmission we
        don't expect any further interrupts for the submission or the following
        context completion. A more elegant way of handling this would be to
        phase out elsp_submitted altogether, however that's part of a
        LRC/execlist cleanup effort that is happening independently of this
        RFC. For now we make this change as simple as possible with as few
        non-TDR-related side-effects as possible.

* [4/12] drm/i915: Extending i915_gem_check_wedge to check engine reset in 
progress:
        Removed unwarranted changes made to i915_gem_ring_throttle()

* [7/12] drm/i915: Fake lost context interrupts through forced CSB check
        Remove context submission status consistency pre-check from
        i915_hangcheck_elapsed() and turn it into a pre-check to per-engine
        reset.

        The following describes the change in philosphy in how context
        submission state inconsistencies are detected:

        Previously we would let the periodic hang checker ensure that there
        were no context submission status inconsistencies on any engine, at any
        point. If an inconsistency was detected in the per-engine hang recovery
        path we would back off and defer to the next hang check since
        per-engine hang recovery is not effective during inconsistent context
        submission states.

        What we do in this new version is to move the consistency pre-check
        from the hang checker to the earliest point in the per-engine hang
        recovery path. If we detect an inconsistency at that point we fake a
        potentially lost context event interrupt by forcing a CSB check. If
        there are outstanding events in the CSB these will be acted upon and
        hopefully that will bring the driver up to speed with the hardware. If
        the CSB check did not amount to anything it is concluded that the
        inconsistency is unresolvable and the per-engine hang recovery fails
        and promotes to full GPU reset instead.

        In the hang checker-based consistency checking we would check the
        inconsistency for a number of times to make sure the detected state was
        stable before attempting to rectify the situation. This is possible
        since hang checking is a recurring event. Having moved the consistency
        checking to the recovery path instead (i.e. a one-time, fire &
        forget-style event) it is assumed that the hang detection that brought
        on the hang recovery has detected a stable hang and therefore, if an
        inconsistency is detected at that point, the inconsistency must be
        stable and not the result of a momentary context state transition.
        Therefore, unlike in the hang checker case, at the very first
        indication of an inconsistent context submission status the interrupt
        is faked speculatively. If outstanding CSB events are found it is
        determined that the hang was in fact just a context submission status
        inconsistency and no hang recovery is done. If the inconsistency cannot
        be resolved the per-engine hang recovery is failed and the hang is
        promoted to full GPU reset instead.

* [8/12] drm/i915: Debugfs interface for per-engine hang recovery
        1. After review comments by Chris Wilson we're dropping the dual-mode 
parameter
        value interpretation in i915_wedged_set(). In this version we only 
accept
        engine id flag masks that contain the engine id flags of all currently 
hung
        engines. Full GPU reset is most easily requested by passing an all zero
        engine id flag mask.

        2. Moved TDR-specific engine metrics like number of detected engine 
hangs and
        number of per-engine resets into i915_hangcheck_info() from
        i915_hangcheck_read().

* [9/12] drm/i915: TDR/watchdog trace points
        As a consequence of the faking context event interrupt commit being 
moved from
        the hang checker to the per-engine recovery path we no longer check 
context
        submission status from the hang checker. Therefore there is no need to 
provide
        submission status of the currently running context to the
        trace_i915_tdr_hang_check() event.

* [10/12] drm/i915: Fix __i915_wait_request() behaviour during hang detection
* [11/12] drm/i915: Port of Added scheduler support to __wait_request() calls
        NEW: Added to address the way that __i915_wait_request()
        behaves in the face of hang detections and hang recovery.

* [12/12] drm/i915: Extended error state with TDR count, watchdog count and 
engine reset count
        NEW: Adds per-engine TDR statistics to captured error state.


Tomas Elf (12):
   drm/i915: Early exit from semaphore_waits_for for execlist mode.
   drm/i915: Make i915_gem_reset_ring_status() public
   drm/i915: Adding TDR / per-engine reset support for gen8.
   drm/i915: Extending i915_gem_check_wedge to check engine reset in
     progress
   drm/i915: Reinstate hang recovery work queue.
   drm/i915: Watchdog timeout support for gen8.
   drm/i915: Fake lost context interrupts through forced CSB check.
   drm/i915: Debugfs interface for per-engine hang recovery.
   drm/i915: TDR/watchdog trace points.
   drm/i915: Port of Added scheduler support to __wait_request() calls
   drm/i915: Fix __i915_wait_request() behaviour during hang detection.
   drm/i915: Extended error state with TDR count, watchdog count and
     engine reset count

  drivers/gpu/drm/i915/i915_debugfs.c     |  76 +++-
  drivers/gpu/drm/i915/i915_dma.c         |  78 ++++
  drivers/gpu/drm/i915/i915_drv.c         | 257 +++++++++++
  drivers/gpu/drm/i915/i915_drv.h         |  79 +++-
  drivers/gpu/drm/i915/i915_gem.c         | 128 ++++--
  drivers/gpu/drm/i915/i915_gpu_error.c   |   8 +-
  drivers/gpu/drm/i915/i915_irq.c         | 292 +++++++++++--
  drivers/gpu/drm/i915/i915_params.c      |  10 +
  drivers/gpu/drm/i915/i915_reg.h         |  13 +
  drivers/gpu/drm/i915/i915_trace.h       | 308 +++++++++++++-
  drivers/gpu/drm/i915/intel_display.c    |   2 +-
  drivers/gpu/drm/i915/intel_lrc.c        | 729 ++++++++++++++++++++++++++++++--
  drivers/gpu/drm/i915/intel_lrc.h        |  16 +-
  drivers/gpu/drm/i915/intel_lrc_tdr.h    |  39 ++
  drivers/gpu/drm/i915/intel_ringbuffer.c |  87 +++-
  drivers/gpu/drm/i915/intel_ringbuffer.h |  95 +++++
  drivers/gpu/drm/i915/intel_uncore.c     | 203 +++++++++
  include/uapi/drm/i915_drm.h             |   5 +-
  18 files changed, 2313 insertions(+), 112 deletions(-)
  create mode 100644 drivers/gpu/drm/i915/intel_lrc_tdr.h


Summarising follow-up on IRC:

Chris Wilson is ok with RFCv2 from a design point-of-view. We're movingon to the detailed code review which will commence once I get back frommy holiday in three weeks and once I've prepared a properly split uppatch series that is suitable for code reviewing.


Thanks,
Tomas

_______________________________________________
Intel-gfx mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] [RFCv2 00/12] TDR/watchdog timeout support for gen8 (edit: fixed coverletter)

Reply via email to