Re: [Intel-gfx] [PATCH 01/17] drm/i915/execlists: Always clear pending requests on reset

2019-08-01 Thread Chris Wilson
Quoting Andi Shyti (2019-08-01 09:08:11)
> Hi Chris,
> 
> > If we skip the reset as we found the engine inactive at the time of the
> > reset, we still need to clear the residual inflight & pending request
> > bookkeeping to reflect the current state of HW.
> 
> [...]
> 
> > Where not having cleared the pending array on reset, it persists
> > indefinitely.
> 
> des this fix Bug 44(*)?

It shouldn't, since the stall is not on the engine that is being reset.

> > - /*
> > -  * Catch up with any missed context-switch interrupts.
> > -  *
> > -  * Ideally we would just read the remaining CSB entries now that we
> > -  * know the gpu is idle. However, the CSB registers are sometimes^W
> > -  * often trashed across a GPU reset! Instead we have to rely on
> > -  * guessing the missed context-switch events by looking at what
> > -  * requests were completed.
> > -  */
> > - execlists_cancel_port_requests(execlists);
> > -
> 
> why did this end up here in a first place?

History, that's where it started, then code got rearranged around it.
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] [PATCH 01/17] drm/i915/execlists: Always clear pending requests on reset

2019-08-01 Thread Andi Shyti
Hi Chris,

> If we skip the reset as we found the engine inactive at the time of the
> reset, we still need to clear the residual inflight & pending request
> bookkeeping to reflect the current state of HW.

[...]

> Where not having cleared the pending array on reset, it persists
> indefinitely.

des this fix Bug 44(*)?

> -
> - /*
> -  * Catch up with any missed context-switch interrupts.
> -  *
> -  * Ideally we would just read the remaining CSB entries now that we
> -  * know the gpu is idle. However, the CSB registers are sometimes^W
> -  * often trashed across a GPU reset! Instead we have to rely on
> -  * guessing the missed context-switch events by looking at what
> -  * requests were completed.
> -  */
> - execlists_cancel_port_requests(execlists);
> -

why did this end up here in a first place?

Reviewed-by: Andi Shyti 

Andi

(*) https://bugs.freedesktop.org/show_bug.cgi?id=44

>   if (!rq) {
>   ce->ring->head = ce->ring->tail;
>   goto out_replay;
> @@ -2356,6 +2344,7 @@ static void __execlists_reset(struct intel_engine_cs 
> *engine, bool stalled)
>  
>  unwind:
>   /* Push back any incomplete requests for replay after the reset. */
> + execlists_cancel_port_requests(execlists);
>   __unwind_incomplete_requests(engine);
>  }
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

[Intel-gfx] [PATCH 01/17] drm/i915/execlists: Always clear pending requests on reset

2019-07-30 Thread Chris Wilson
If we skip the reset as we found the engine inactive at the time of the
reset, we still need to clear the residual inflight & pending request
bookkeeping to reflect the current state of HW.

Otherwise, we may end up stuck in a loop like:

<7> [416.490346] hangcheck rcs0
<7> [416.490371] hangcheck  Awake? 1
<7> [416.490376] hangcheck  Hangcheck: 8003 ms ago
<7> [416.490380] hangcheck  Reset count: 0 (global 0)
<7> [416.490383] hangcheck  Requests:
<7> [416.491210] hangcheck  RING_START: 0x0017b000
<7> [416.491983] hangcheck  RING_HEAD:  0x0048
<7> [416.491992] hangcheck  RING_TAIL:  0x0048
<7> [416.492006] hangcheck  RING_CTL:   0x
<7> [416.492037] hangcheck  RING_MODE:  0x0200 [idle]
<7> [416.492044] hangcheck  RING_IMR: 
<7> [416.492809] hangcheck  ACTHD:  0x_9ca00048
<7> [416.492824] hangcheck  BBADDR: 0x_1004
<7> [416.492838] hangcheck  DMA_FADDR: 0x_
<7> [416.492845] hangcheck  IPEIR: 0x
<7> [416.492852] hangcheck  IPEHR: 0x
<7> [416.492863] hangcheck  Execlist status: 0x00018001 , entries 12
<7> [416.492869] hangcheck  Execlist CSB read 1, write 1, tasklet queued? 
no (enabled)
<7> [416.492938] hangcheck  Pending[0] ring:{start:0017b000, 
hwsp:fedf9000, seqno:00016fd6}, rq:  20ffa:16fd6!+  prio=-4094 @ 8307ms: 
signaled
<7> [416.492972] hangcheck  Queue priority hint: -4093
<7> [416.492979] hangcheck  Q  20ffa:16fd8-  prio=-4093 @ 8307ms: 
[i915]
<7> [416.492985] hangcheck  Q  20ffa:16fda  prio=-4094 @ 8307ms: 
[i915]
<7> [416.492990] hangcheck  Q  20ffa:16fdc  prio=-4094 @ 8307ms: 
[i915]
<7> [416.492996] hangcheck  Q  20ffa:16fde  prio=-4094 @ 8307ms: 
[i915]
<7> [416.493001] hangcheck  Q  20ffa:16fe0  prio=-4094 @ 8307ms: 
[i915]
<7> [416.493007] hangcheck  Q  20ffa:16fe2  prio=-4094 @ 8307ms: 
[i915]
<7> [416.493013] hangcheck  Q  20ffa:16fe4  prio=-4094 @ 8307ms: 
[i915]
<7> [416.493021] hangcheck  ...skipping 21 queued requests...
<7> [416.493027] hangcheck  Q  20ffa:17010  prio=-4094 @ 8307ms: 
[i915]
<7> [416.493081] hangcheck HWSP:
<7> [416.493089] hangcheck []      
  
<7> [416.493094] hangcheck *
<7> [416.493100] hangcheck [0040] 10008002  1018  1018 
 1001 
<7> [416.493106] hangcheck [0060] 1018  1001  1018 
 1001 
<7> [416.493111] hangcheck *
<7> [416.493117] hangcheck [00a0]      
  0001
<7> [416.493123] hangcheck [00c0]      
  
<7> [416.493127] hangcheck *
<7> [416.493132] hangcheck Idle? no
<6> [416.512124] i915 :00:02.0: GPU HANG: ecode 11:0:0x, hang on 
rcs0
<6> [416.512205] [drm] GPU hangs can indicate a bug anywhere in the entire gfx 
stack, including userspace.
<6> [416.512207] [drm] Please file a _new_ bug report on bugs.freedesktop.org 
against DRI -> DRM/Intel
<6> [416.512208] [drm] drm/i915 developers can then reassign to the right 
component if it's not a kernel issue.
<6> [416.512210] [drm] The gpu crash dump is required to analyze gpu hangs, so 
please always attach it.
<6> [416.512212] [drm] GPU crash dump saved to /sys/class/drm/card0/error
<5> [416.513602] i915 :00:02.0: Resetting rcs0 for hang on rcs0
<7> [424.489258] hangcheck rcs0
<7> [424.489263] hangcheck  Awake? 1
<7> [424.489267] hangcheck  Hangcheck: 5954 ms ago
<7> [424.489271] hangcheck  Reset count: 1 (global 0)
<7> [424.489274] hangcheck  Requests:
<7> [424.490128] hangcheck  RING_START: 0x
<7> [424.490870] hangcheck  RING_HEAD:  0x
<7> [424.490877] hangcheck  RING_TAIL:  0x
<7> [424.490887] hangcheck  RING_CTL:   0x
<7> [424.490897] hangcheck  RING_MODE:  0x0200 [idle]
<7> [424.490904] hangcheck  RING_IMR: 
<7> [424.490917] hangcheck  ACTHD:  0x_
<7> [424.490930] hangcheck  BBADDR: 0x_
<7> [424.490943] hangcheck  DMA_FADDR: 0x_
<7> [424.490950] hangcheck  IPEIR: 0x
<7> [424.490956] hangcheck  IPEHR: 0x
<7> [424.490968] hangcheck  Execlist status: 0x0001 , entries 12
<7> [424.490972] hangcheck  Execlist CSB read 11, write 11, tasklet queued? 
no (enabled)
<7> [424.490983] hangcheck  Pending[0] ring:{start:0017b000, 
hwsp:fedf9000, seqno:00016fd6}, rq:  20ffa:16fd6!+  prio=-4094 @ 16305ms: 
signaled
<7> [424.490989] hangcheck  Queue priority hint: -4093
<7> [424.490996] hangcheck  Q  20ffa:16fd8-  prio=-4093 @ 16305ms: 
[i915]
<7> [424.491001] hangcheck  Q  20ffa:16fda  prio=-4094 @ 16305ms: 
[i915]