Re: [Intel-gfx] [PATCH] drm/i915/guc: Don't read SOFT_SCRATCH(15) on MMIO error

2018-07-12 Thread Michal Wajdeczko
On Thu, 12 Jul 2018 17:31:14 +0200, Chris Wilson  
 wrote:



Quoting Chris Wilson (2018-05-29 15:54:12)

Quoting Michal Wajdeczko (2018-05-28 18:16:18)
> SOFT_SCRATCH(15) is used by GuC for sending MMIO GuC events to host  
and
> those events are now handled by  
intel_guc_to_host_event_handler_mmio().

>
> We should not try to read it on MMIO action error as 1) we may be  
using

> different set of registers for GuC MMIO communication, and 2) GuC may
> use CTB mechanism for sending events to host.

Ok.

> While here, upgrade error message to DRM_ERROR.

Does the error help? What do you want to convey to the user? For error
handling, we want to propagate the result back anyway for the caller has
to decide what to do next.


Good news! We see the error in BAT,

[  542.138479] i915: unknown parameter 'enable_guc_loading' ignored
[  542.138483] i915: unknown parameter 'enable_guc_submission' ignored
[  542.138485] Setting dangerous option enable_guc - tainting kernel
[  542.138488] Setting dangerous option live_selftests - tainting kernel
[  542.173291] [drm:intel_guc_send_mmio [i915]] *ERROR* MMIO: GuC action  
0x10 failed with error -5 0xf000f000

[  542.367055] i915: probe of :00:02.0 failed with error -25

And Michał reminded me this wasn't the first time...

commit feb06c151fade9ecaa3dd410d792cce26e8b10de
Author: Michał Winiarski 
Date:   Mon Mar 19 10:53:47 2018 +0100

drm/i915/guc: Demote GuC error messages
   We're using those functions in selftests, and the callers are expected
to do the error handling anyways. Let's demote all GuC actions and
doorbell creation to DEBUG_DRIVER.

So do we kindly ask Michał to resubmit his fix?


There are more places where DRM_ERROR is used after detection GuC error
(see intel_guc_send_ct as example, more to show up shortly)

I would rather prefer to add GUC_ERROR macro that could be tweaked
under SELFTEST config and runtime flags to demote unwanted errors:

#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
#define GUC_ERROR(...) \
do { \
if (unlikely(i915_selftest.mock || i915_selftest.live)) \
DRM_DEBUG_DRIVER(__VA_ARGS__); \
else \
DRM_ERROR(__VA_ARGS__);
} while (0);
#else
#define GUC_ERROR(...) DRM_ERROR(__VA_ARGS__)
#endif

/Michal
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915/guc: Don't read SOFT_SCRATCH(15) on MMIO error

2018-07-12 Thread Chris Wilson
Quoting Chris Wilson (2018-05-29 15:54:12)
> Quoting Michal Wajdeczko (2018-05-28 18:16:18)
> > SOFT_SCRATCH(15) is used by GuC for sending MMIO GuC events to host and
> > those events are now handled by intel_guc_to_host_event_handler_mmio().
> > 
> > We should not try to read it on MMIO action error as 1) we may be using
> > different set of registers for GuC MMIO communication, and 2) GuC may
> > use CTB mechanism for sending events to host.
> 
> Ok.
>  
> > While here, upgrade error message to DRM_ERROR.
> 
> Does the error help? What do you want to convey to the user? For error
> handling, we want to propagate the result back anyway for the caller has
> to decide what to do next.

Good news! We see the error in BAT,

[  542.138479] i915: unknown parameter 'enable_guc_loading' ignored
[  542.138483] i915: unknown parameter 'enable_guc_submission' ignored
[  542.138485] Setting dangerous option enable_guc - tainting kernel
[  542.138488] Setting dangerous option live_selftests - tainting kernel
[  542.173291] [drm:intel_guc_send_mmio [i915]] *ERROR* MMIO: GuC action 0x10 
failed with error -5 0xf000f000
[  542.367055] i915: probe of :00:02.0 failed with error -25

And Michał reminded me this wasn't the first time...

commit feb06c151fade9ecaa3dd410d792cce26e8b10de
Author: Michał Winiarski 
Date:   Mon Mar 19 10:53:47 2018 +0100

drm/i915/guc: Demote GuC error messages

We're using those functions in selftests, and the callers are expected
to do the error handling anyways. Let's demote all GuC actions and
doorbell creation to DEBUG_DRIVER.

So do we kindly ask Michał to resubmit his fix?
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915/guc: Don't read SOFT_SCRATCH(15) on MMIO error

2018-05-31 Thread Chris Wilson
Quoting Chris Wilson (2018-05-31 19:13:32)
> Quoting Michal Wajdeczko (2018-05-28 18:16:18)
> > SOFT_SCRATCH(15) is used by GuC for sending MMIO GuC events to host and
> > those events are now handled by intel_guc_to_host_event_handler_mmio().
> > 
> > We should not try to read it on MMIO action error as 1) we may be using
> > different set of registers for GuC MMIO communication, and 2) GuC may
> > use CTB mechanism for sending events to host.
> > 
> > While here, upgrade error message to DRM_ERROR.
> > 
> > Signed-off-by: Michal Wajdeczko 
> > Cc: Michel Thierry 
> > Cc: Joonas Lahtinen 
> > Cc: Chris Wilson 
> 
> I'm still not totally sold on having the DRM_ERROR here improves
> debugging; it doesn't do anything to improve error handling, but
> 
> Reviewed-by: Chris Wilson 
> 
> nevertheless.

And pushed. Thank you for the patch,
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915/guc: Don't read SOFT_SCRATCH(15) on MMIO error

2018-05-31 Thread Chris Wilson
Quoting Michal Wajdeczko (2018-05-28 18:16:18)
> SOFT_SCRATCH(15) is used by GuC for sending MMIO GuC events to host and
> those events are now handled by intel_guc_to_host_event_handler_mmio().
> 
> We should not try to read it on MMIO action error as 1) we may be using
> different set of registers for GuC MMIO communication, and 2) GuC may
> use CTB mechanism for sending events to host.
> 
> While here, upgrade error message to DRM_ERROR.
> 
> Signed-off-by: Michal Wajdeczko 
> Cc: Michel Thierry 
> Cc: Joonas Lahtinen 
> Cc: Chris Wilson 

I'm still not totally sold on having the DRM_ERROR here improves
debugging; it doesn't do anything to improve error handling, but

Reviewed-by: Chris Wilson 

nevertheless.
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915/guc: Don't read SOFT_SCRATCH(15) on MMIO error

2018-05-29 Thread Michal Wajdeczko
On Tue, 29 May 2018 17:17:02 +0200, Chris Wilson  
 wrote:



Quoting Michal Wajdeczko (2018-05-29 16:10:44)

On Tue, 29 May 2018 16:54:12 +0200, Chris Wilson
 wrote:

> Quoting Michal Wajdeczko (2018-05-28 18:16:18)
>> SOFT_SCRATCH(15) is used by GuC for sending MMIO GuC events to host  
and
>> those events are now handled by  
intel_guc_to_host_event_handler_mmio().

>>
>> We should not try to read it on MMIO action error as 1) we may be  
using

>> different set of registers for GuC MMIO communication, and 2) GuC may
>> use CTB mechanism for sending events to host.
>
> Ok.
>
>> While here, upgrade error message to DRM_ERROR.
>
> Does the error help? What do you want to convey to the user? For error
> handling, we want to propagate the result back anyway for the caller  
has

> to decide what to do next.

We are propagating error code to the caller, but since any error from  
the

GuC is unexpected, we should rather always log it and don't rely on the
caller or drm debug for that. Note that in case of CTB we also log  
received

errors using DRM_ERROR (see intel_guc_send_ct).


But whose error? Ours or the hw? We expect hw errors, or should ;)


well, it can be any i915/FW/HW - hard to tell without other full logs..



But mostly from the pov of the message, is this the right information to
flag as the error or does the caller have better context?


Only caller can easily provide additional info related for failed command
(such as index/address that was rejected by FW) that could help diagnose
the problem, but in case FW/HW errors it does not matter.

At this point, we can only identify request/action ID that has failed.
But that's better than nothing.

/Michal
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915/guc: Don't read SOFT_SCRATCH(15) on MMIO error

2018-05-29 Thread Chris Wilson
Quoting Michal Wajdeczko (2018-05-29 16:10:44)
> On Tue, 29 May 2018 16:54:12 +0200, Chris Wilson  
>  wrote:
> 
> > Quoting Michal Wajdeczko (2018-05-28 18:16:18)
> >> SOFT_SCRATCH(15) is used by GuC for sending MMIO GuC events to host and
> >> those events are now handled by intel_guc_to_host_event_handler_mmio().
> >>
> >> We should not try to read it on MMIO action error as 1) we may be using
> >> different set of registers for GuC MMIO communication, and 2) GuC may
> >> use CTB mechanism for sending events to host.
> >
> > Ok.
> >
> >> While here, upgrade error message to DRM_ERROR.
> >
> > Does the error help? What do you want to convey to the user? For error
> > handling, we want to propagate the result back anyway for the caller has
> > to decide what to do next.
> 
> We are propagating error code to the caller, but since any error from the
> GuC is unexpected, we should rather always log it and don't rely on the
> caller or drm debug for that. Note that in case of CTB we also log received
> errors using DRM_ERROR (see intel_guc_send_ct).

But whose error? Ours or the hw? We expect hw errors, or should ;)

But mostly from the pov of the message, is this the right information to
flag as the error or does the caller have better context?
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915/guc: Don't read SOFT_SCRATCH(15) on MMIO error

2018-05-29 Thread Michal Wajdeczko
On Tue, 29 May 2018 16:54:12 +0200, Chris Wilson  
 wrote:



Quoting Michal Wajdeczko (2018-05-28 18:16:18)

SOFT_SCRATCH(15) is used by GuC for sending MMIO GuC events to host and
those events are now handled by intel_guc_to_host_event_handler_mmio().

We should not try to read it on MMIO action error as 1) we may be using
different set of registers for GuC MMIO communication, and 2) GuC may
use CTB mechanism for sending events to host.


Ok.


While here, upgrade error message to DRM_ERROR.


Does the error help? What do you want to convey to the user? For error
handling, we want to propagate the result back anyway for the caller has
to decide what to do next.


We are propagating error code to the caller, but since any error from the
GuC is unexpected, we should rather always log it and don't rely on the
caller or drm debug for that. Note that in case of CTB we also log received
errors using DRM_ERROR (see intel_guc_send_ct).

Michal
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915/guc: Don't read SOFT_SCRATCH(15) on MMIO error

2018-05-29 Thread Chris Wilson
Quoting Michal Wajdeczko (2018-05-28 18:16:18)
> SOFT_SCRATCH(15) is used by GuC for sending MMIO GuC events to host and
> those events are now handled by intel_guc_to_host_event_handler_mmio().
> 
> We should not try to read it on MMIO action error as 1) we may be using
> different set of registers for GuC MMIO communication, and 2) GuC may
> use CTB mechanism for sending events to host.

Ok.
 
> While here, upgrade error message to DRM_ERROR.

Does the error help? What do you want to convey to the user? For error
handling, we want to propagate the result back anyway for the caller has
to decide what to do next.
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx