Re: [PATCH] drm/i915/guc: Correct capture of EIR register on hang

2024-02-27 Thread Teres Alexis, Alan Previn
On Fri, 2024-02-23 at 12:32 -0800, john.c.harri...@intel.com wrote:
> From: John Harrison 
alan:snip

> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> @@ -51,6 +51,7 @@
> { RING_ESR(0),  0,  0, "ESR" }, \
> { RING_DMA_FADD(0), 0,  0, "RING_DMA_FADD_LDW" },
> \
> { RING_DMA_FADD_UDW(0), 0,  0, "RING_DMA_FADD_UDW" },
> \
> +   { RING_EIR(0),  0,  0, "EIR" }, \
> { RING_IPEIR(0),    0,  0, "IPEIR" }, \
> { RING_IPEHR(0),    0,  0, "IPEHR" }, \
> { RING_INSTPS(0),   0,  0, "INSTPS" }, \
> @@ -80,9 +81,6 @@
> { GEN8_RING_PDP_LDW(0, 3),  0,  0, "PDP3_LDW" }, \
> { GEN8_RING_PDP_UDW(0, 3),  0,  0, "PDP3_UDW" }
>  
> -#define COMMON_BASE_HAS_EU \
> -   { EIR,  0,  0, "EIR" }
> -
alan:snip

alan: Thanks for catching this one.
Reviewed-by: Alan Previn 


Re: [PATCH v9 0/2] Resolve suspend-resume racing with GuC destroy-context-worker

2024-01-02 Thread Teres Alexis, Alan Previn
On Wed, 2023-12-27 at 20:55 -0800, Teres Alexis, Alan Previn wrote:
> This series is the result of debugging issues root caused to
> races between the GuC's destroyed_worker_func being triggered
> vs repeating suspend-resume cycles with concurrent delayed
> fence signals for engine-freeing.
> 
alan:snip.

alan: I did not receive the CI-premerge email where the following was reported:
  IGT changes
  Possible regressions
  igt@i915_selftest@live@gt_pm:
  shard-rkl: PASS -> DMESG-FAIL
After going thru the error in dmesg and codes, i am confident this failure not
related to the series. This selftest calls rdmsrl functions (that doen't do
any requests / guc submissions) but gets a reply power of zero (the bug 
reported).
So this is unrelated.


Hi @"Vivi, Rodrigo" , just an FYI note that after the 
last
requested rebase, BAT passed twice in a row now so i am confident failures on 
rev7
and prior was unrelated and that this series is ready for merging. Thanks again
for all your help and patiences - this was a long one  :)


Re: [PATCH v8 2/2] drm/i915/guc: Close deregister-context race against CT-loss

2023-12-27 Thread Teres Alexis, Alan Previn
On Tue, 2023-12-26 at 10:11 -0500, Vivi, Rodrigo wrote:
> On Wed, Dec 20, 2023 at 11:08:59PM +0000, Teres Alexis, Alan Previn wrote:
> > On Wed, 2023-12-13 at 16:23 -0500, Vivi, Rodrigo wrote:
alan:snip

> > 
> > 
> > alan: Thanks Rodrigo for the RB last week, just quick update:
> > 
> > I've cant reproduce the BAT failures that seem to be intermittent
> > on platform and test - however, a noticable number of failures
> > do keep occuring on i915_selftest @live @requests where the
> > last test leaked a wakeref and the failing test hangs waiting
> > for gt to idle before starting its test.
> > 
> > i have to debug this further although from code inspection
> > is unrelated to the patches in this series.
> > Hopefully its a different issue.
> 
> Yeap, likely not related. Anyway, I'm sorry for not merging
> this sooner. Could you please send a rebased version? This
> on is not applying cleanly anymore.

Hi Rodrigo, i will rebase it as soon as i do a bit more testing..
I realize i was using a slighlty older guc and with newer guc am
seeing all kinds of failures but trending as not an issue with the series.



Re: [PATCH v8 2/2] drm/i915/guc: Close deregister-context race against CT-loss

2023-12-20 Thread Teres Alexis, Alan Previn
On Wed, 2023-12-13 at 16:23 -0500, Vivi, Rodrigo wrote:
> On Tue, Dec 12, 2023 at 08:57:16AM -0800, Alan Previn wrote:
> > If we are at the end of suspend or very early in resume
> > its possible an async fence signal (via rcu_call) is triggered
> > to free_engines which could lead us to the execution of
> > the context destruction worker (after a prior worker flush).
alan:snip
> 
> > Thus, do an unroll in guc_lrc_desc_unpin and deregister_destroyed_-
> > contexts if guc_lrc_desc_unpin fails due to CT send falure.
> > When unrolling, keep the context in the GuC's destroy-list so
> > it can get picked up on the next destroy worker invocation
> > (if suspend aborted) or get fully purged as part of a GuC
> > sanitization (end of suspend) or a reset flow.
> > 
> > Signed-off-by: Alan Previn 
> > Signed-off-by: Anshuman Gupta 
> > Tested-by: Mousumi Jana 
> > Acked-by: Daniele Ceraolo Spurio 
> 
> Thanks for all the explanations, patience and great work!
> 
> Reviewed-by: Rodrigo Vivi 

alan: Thanks Rodrigo for the RB last week, just quick update:

I've cant reproduce the BAT failures that seem to be intermittent
on platform and test - however, a noticable number of failures
do keep occuring on i915_selftest @live @requests where the
last test leaked a wakeref and the failing test hangs waiting
for gt to idle before starting its test.

i have to debug this further although from code inspection
is unrelated to the patches in this series.
Hopefully its a different issue.


Re: [PATCH v7 2/2] drm/i915/guc: Close deregister-context race against CT-loss

2023-11-30 Thread Teres Alexis, Alan Previn
> As far as i can tell, its only if we started resetting / wedging right after 
> this
> queued worker got started.

alan: hope Daniele can proof read my tracing and confirm if got it right.


Re: [PATCH v7 2/2] drm/i915/guc: Close deregister-context race against CT-loss

2023-11-30 Thread Teres Alexis, Alan Previn
On Thu, 2023-11-30 at 16:18 -0500, Vivi, Rodrigo wrote:
> On Wed, Nov 29, 2023 at 04:20:13PM -0800, Alan Previn wrote:
alan:snip
> > +
> > if (unlikely(disabled)) {
> > release_guc_id(guc, ce);
> > __guc_context_destroy(ce);
> > -   return;
> > +   return 0;
> 
> is success the right return case here?
alan: yes: we may discover "disabled == true" if submission_disabled
found that gt-is-wedged. I dont believe such a case will happen as
part of flushing destroyed_worker_func during suspend but may occur
as part of regular runtime context closing that just happens to
get queued in the middle of a gt-reset/wedging process. In such a case,
the reset-prepare code will sanitize everytihng including cleaning up
the pending-destructoin-contexts-link-list. So its either we pick it
from here and dump it ... or reset picks it first and dumps it there
(where both dumpings only occur if guc got disabled first).


Supplemental: How regular context cleanup leads to the same path -->

i915_sw_fence_notify -> engines_notify -> free_engines_rcu ->
intel_context_put -> kref_put(>ref..) -> ce->ops->destroy ->

(where ce->ops = engine->cops and engine->cops = guc_context_ops)
*and, guc_context_ops->destroy == guc_context_destroy so ->

ce->ops->destroy -> guc_context_destroy ->
queue_work(..>submission_state.destroyed_worker);
-> [eventually] -> the same guc_lrc_unpin above

However with additional "if (!intel_guc_is_ready(guc))" in destroyed_worker_func
as part of this patch, hitting this "disabled==true" case will be even less 
likely.
As far as i can tell, its only if we started resetting / wedging right after 
this
queued worker got started. (i ran out of time to check if reset can ever occur
from within the same system_unbound_wq but then i recall we also have CI using
debugfs to force wedging for select (/all?) igt tests) so i suspect it can still
happen in parallel. NOTE: the original checking of the "is disabled" is not new
code - its the original code.

...alan

P.S.- oh man, that took a lot of code tracing as i can't remember these paths 
by heart.



Re: [PATCH v2 1/1] drm/i915/pxp: Add missing tag for Wa_14019159160

2023-11-29 Thread Teres Alexis, Alan Previn
On Wed, 2023-11-29 at 13:13 -0800, Teres Alexis, Alan Previn wrote:
> On Mon, 2023-11-27 at 15:24 -0500, Vivi, Rodrigo wrote:
> > On Wed, Nov 22, 2023 at 12:30:03PM -0800, Alan Previn wrote:
> alan:snip
> alan: thanks for reviewing and apologize for replying to this late.
> 
> > >   /*
> > > -  * On MTL and newer platforms, protected contexts require setting
> > > -  * the LRC run-alone bit or else the encryption will not happen.
> > > +  * Wa_14019159160 - Case 2: mtl
> > > +  * On some platforms, protected contexts require setting
> > > +  * the LRC run-alone bit or else the encryption/decryption will not 
> > > happen.
> > > +  * NOTE: Case 2 only applies to PXP use-case of said workaround.
> > >*/
> > 
> > hmm, interesting enough, on the wa description I read that it is incomplete 
> > for MTL/ARL
> > and something about a fuse bit. We should probably chase for some 
> > clarification in the
> > HSD?!
> alan: yes, i went through the HSD description with the architect and we both 
> had agreed that
> that from the KMD's perspective. At that time, the checking of the fuse from 
> KMD's
> could be optimized out for Case-2-PXP: if the fuses was set a certain way, 
> KMD can skip setting
> the bit in lrc because hw will enforce runalone in pxp mode irrespective of 
> what KMD does but
> if fuse was was not set that way KMD needed to set the runalone in lrc. So 
> for case2, its simpler
> to just turn it on always when the context is protected. However, that said, 
> i realized the
> wording of the HSDs have changed since the original patch was implemented so 
> i will reclarify
> offline and reply back. NOTE: i believe John Harrison is working on case-3 
> through a
> separate series.
alan: based on conversations with the architect, a different WA number, 
14019725520,
has the details of case-2 that explains the relationship the new fuse.
However, as before it can be optimized as explained above where PXP context
need to always set this bit to ensure encryption/decryption engages.




Re: [PATCH v2 1/1] drm/i915/pxp: Add missing tag for Wa_14019159160

2023-11-29 Thread Teres Alexis, Alan Previn
On Mon, 2023-11-27 at 15:24 -0500, Vivi, Rodrigo wrote:
> On Wed, Nov 22, 2023 at 12:30:03PM -0800, Alan Previn wrote:
alan:snip
alan: thanks for reviewing and apologize for replying to this late.

> > /*
> > -* On MTL and newer platforms, protected contexts require setting
> > -* the LRC run-alone bit or else the encryption will not happen.
> > +* Wa_14019159160 - Case 2: mtl
> > +* On some platforms, protected contexts require setting
> > +* the LRC run-alone bit or else the encryption/decryption will not 
> > happen.
> > +* NOTE: Case 2 only applies to PXP use-case of said workaround.
> >  */
> 
> hmm, interesting enough, on the wa description I read that it is incomplete 
> for MTL/ARL
> and something about a fuse bit. We should probably chase for some 
> clarification in the
> HSD?!
alan: yes, i went through the HSD description with the architect and we both 
had agreed that
that from the KMD's perspective. At that time, the checking of the fuse from 
KMD's
could be optimized out for Case-2-PXP: if the fuses was set a certain way, KMD 
can skip setting
the bit in lrc because hw will enforce runalone in pxp mode irrespective of 
what KMD does but
if fuse was was not set that way KMD needed to set the runalone in lrc. So for 
case2, its simpler
to just turn it on always when the context is protected. However, that said, i 
realized the
wording of the HSDs have changed since the original patch was implemented so i 
will reclarify
offline and reply back. NOTE: i believe John Harrison is working on case-3 
through a
separate series.

> 
> > -   if (GRAPHICS_VER_FULL(ce->engine->i915) >= IP_VER(12, 70) &&
> > -   (ce->engine->class == COMPUTE_CLASS || ce->engine->class == 
> > RENDER_CLASS)) {
> > +   if (IS_METEORLAKE(ce->engine->i915) && (ce->engine->class == 
> > COMPUTE_CLASS ||
> > +   ce->engine->class == 
> > RENDER_CLASS)) {
> 
> This check now excludes the ARL with the same IP, no?!
alan: yeah - re-reved this part just now.
> 
> > rcu_read_lock();
> > gem_ctx = rcu_dereference(ce->gem_context);
> > if (gem_ctx)
> > 
> > base-commit: 5429d55de723544dfc0630cf39d96392052b27a1
> > -- 
> > 2.39.0
> > 



Re: [Intel-gfx] [PATCH v5] drm/i915/pxp: Add drm_dbgs for critical PXP events.

2023-11-29 Thread Teres Alexis, Alan Previn
On Fri, 2023-11-24 at 08:30 +, Tvrtko Ursulin wrote:
> On 22/11/2023 19:15, Alan Previn wrote:
alan:snip
alan: thanks for reviewing.
> > if (iir & GEN12_DISPLAY_STATE_RESET_COMPLETE_INTERRUPT)
> > -   pxp->session_events |= PXP_TERMINATION_COMPLETE;
> > +   pxp->session_events |= PXP_TERMINATION_COMPLETE | 
> > PXP_EVENT_TYPE_IRQ;
> 
> This looked to be doing more than adding debug logs, but then no one is 
> actually consuming this new flag?!
alan: see below hunk, inside pxp_session_work the drm_dbg that was prints
the "events" that came from above. we need this debug message to uniquely
recognize that contexts pxp keys would get invalidated in response to
a KCR IRQ (as opposed to other known flows that are being reported by other
means like banning hanging contexts, or as part of suspend etc).
NOTE: pxp->session_events does get set in at lease one other path
outside the IRQ (which triggers teardown but not coming from KCR-hw)
This flag is solely for the debug message. Hope this works.

> 
> Regards,
> 
> Tvrtko
> 
> >   
> > if (pxp->session_events)
> > queue_work(system_unbound_wq, >session_work);
> > diff --git a/drivers/gpu/drm/i915/pxp/intel_pxp_session.c 
> > b/drivers/gpu/drm/i915/pxp/intel_pxp_session.c
> > index 0a3e66b0265e..091c86e03d1a 100644
> > --- a/drivers/gpu/drm/i915/pxp/intel_pxp_session.c
> > +++ b/drivers/gpu/drm/i915/pxp/intel_pxp_session.c
> > @@ -137,8 +137,10 @@ void intel_pxp_terminate(struct intel_pxp *pxp, bool 
> > post_invalidation_needs_res
> >   static void pxp_terminate_complete(struct intel_pxp *pxp)
> >   {
> > /* Re-create the arb session after teardown handle complete */
> > -   if (fetch_and_zero(>hw_state_invalidated))
> > +   if (fetch_and_zero(>hw_state_invalidated)) {
> > +   drm_dbg(>ctrl_gt->i915->drm, "PXP: creating arb_session 
> > after invalidation");
> > pxp_create_arb_session(pxp);
> > +   }
> >   
> > complete_all(>termination);
> >   }
> > @@ -157,6 +159,8 @@ static void pxp_session_work(struct work_struct *work)
> > if (!events)
> > return;
> >   
> > +   drm_dbg(>i915->drm, "PXP: processing event-flags 0x%08x", events);
> > +
> > if (events & PXP_INVAL_REQUIRED)
> > intel_pxp_invalidate(pxp);
> >   
> > diff --git a/drivers/gpu/drm/i915/pxp/intel_pxp_types.h 
> > b/drivers/gpu/drm/i915/pxp/intel_pxp_types.h
> > index 7e11fa8034b2..07864b584cf4 100644
> > --- a/drivers/gpu/drm/i915/pxp/intel_pxp_types.h
> > +++ b/drivers/gpu/drm/i915/pxp/intel_pxp_types.h
> > @@ -124,6 +124,7 @@ struct intel_pxp {
> >   #define PXP_TERMINATION_REQUEST  BIT(0)
> >   #define PXP_TERMINATION_COMPLETE BIT(1)
> >   #define PXP_INVAL_REQUIRED   BIT(2)
> > +#define PXP_EVENT_TYPE_IRQ   BIT(3)
> >   };
> >   
> >   #endif /* __INTEL_PXP_TYPES_H__ */
> > 
> > base-commit: 5429d55de723544dfc0630cf39d96392052b27a1



Re: [PATCH v6 2/2] drm/i915/guc: Close deregister-context race against CT-loss

2023-11-29 Thread Teres Alexis, Alan Previn
On Mon, 2023-11-27 at 16:51 -0500, Vivi, Rodrigo wrote:

alan: Firstly, thanks for taking the time to review this, knowing you have a 
lot on your plate right now.
> 
alan:snip
> > @@ -3301,19 +3315,38 @@ static inline void guc_lrc_desc_unpin(struct 
> > intel_context *ce)
> > /* Seal race with Reset */
> > spin_lock_irqsave(>guc_state.lock, flags);
> > disabled = submission_disabled(guc);
> > -   if (likely(!disabled)) {
> > -   __intel_gt_pm_get(gt);
> > -   set_context_destroyed(ce);
> > -   clr_context_registered(ce);
> > -   }
> > -   spin_unlock_irqrestore(>guc_state.lock, flags);
> 
> you are now holding this spin lock for too long...
alan: my intention was to ensure that an async H2G request to change this
gucid-context state wouldnt come in while we are in the middle of attempting
to send the context-destruction H2G and later realize it needed to be
unrolled (corner case race we are solving here as discovered on
customer platform). But after discussing with Daniele and John, they
agreed that we should move the unlock back to before the deregister and then
take the lock again inside of the unrolling code (if deregister_context 
fails).
alan:snip
> 
> > -   deregister_context(ce, ce->guc_id.id);
> > +   /* GuC is active, lets destroy this context,
> 
> for multi-line comments you need to start with '/*' only
> and start the real comment below, like:
alan:my bad. will fix.
> *
>  * GuC is active, ...
> > +* but at this point we can still be racing with
> > +* suspend, so we undo everything if the H2G fails
> > +*/
> > +
> > +   /* Change context state to destroyed and get gt-pm */
> > +   __intel_gt_pm_get(gt);
> > +   set_context_destroyed(ce);
> > +   clr_context_registered(ce);
> > +
> > +   ret = deregister_context(ce, ce->guc_id.id);
> > +   if (ret) {
> > +   /* Undo the state change and put gt-pm if that failed */
> > +   set_context_registered(ce);
> > +   clr_context_destroyed(ce);
> > +   /*
> > +* Dont use might_sleep / ASYNC verion of put because
> > +* CT loss in deregister_context could mean an ongoing
> > +* reset or suspend flow. Immediately put before the unlock
> > +*/
> > +   __intel_wakeref_put(>wakeref, 0);
> 
> interesting enough you use the '__' version to bypass the might_sleep(),
> but also sending 0 as argument what might lead you in the mutex_lock inside
> the spin lock area, what is not allowed.
alan: so one thing to note, the motivation for an alternative unlock was only
driven by the fact we were holding that spinlock. However, as per the review
comment and response on the spin lock above, if we move the pm-put
outside the spin lock, we can call the intel_wakeref_put_async (which will not
actually trigger a delayed task becase this function (guc_lrc_desc_unpin) starts
with GEM_BUG_ON(!intel_gt_pm_is_awake(gt)); which means it would only decrement
the ref count.
alan:snip
> > +   spin_lock_irqsave(>submission_state.lock, flags);
> > +   list_add_tail(>destroyed_link,
> > + 
> > >submission_state.destroyed_contexts);
> > +   spin_unlock_irqrestore(>submission_state.lock, 
> > flags);
> > +   /* Bail now since the list might never be emptied if 
> > h2gs fail */
> 
> For this GuC interaction part I'd like to get an ack from John Harrison 
> please.
alan:okay - will do. I might alternatively tap on Daniele since he knows the 
guc just as well.
> 
alan:snip

> > @@ -3392,6 +3440,17 @@ static void destroyed_worker_func(struct work_struct 
> > *w)
> > struct intel_gt *gt = guc_to_gt(guc);
> > int tmp;
> >  
> > +   /*
> > +* In rare cases we can get here via async context-free fence-signals 
> > that
> > +* come very late in suspend flow or very early in resume flows. In 
> > these
> > +* cases, GuC won't be ready but just skipping it here is fine as these
> > +* pending-destroy-contexts get destroyed totally at GuC reset time at 
> > the
> > +* end of suspend.. OR.. this worker can be picked up later on the next
> > +* context destruction trigger after resume-completes
> > +*/
> > +   if (!intel_guc_is_ready(guc))
> > +   return;
> 
> is this racy?
alan: this is to reduce raciness. This worker function eventually calls
deregister_destroyed_context which calls the guc_lrc_desc_unpin that does
all the work we discussed above (taking locks, taking refs, sending h2g,
potentially unrolling). So this check an optimization to skip that without
going through the locking. So yes its a bit racy but its trying to reduce
raciness. NOTE1: Without this line of code, in theory everything would still
work fine, with the driver potentially experiencing more unroll cases
in those thousands of cycles. NOTE2: This check using intel_guc_is_ready
is done in many places in the driver knowing that it doesnt take any locks.
Of 

Re: [Intel-gfx] [PATCH v3 1/1] drm/i915/pxp: Add missing tag for Wa_14019159160

2023-11-29 Thread Teres Alexis, Alan Previn
On Tue, 2023-11-28 at 10:03 -0800, Roper, Matthew D wrote:
> On Mon, Nov 27, 2023 at 12:11:50PM -0800, Alan Previn wrote:
> > Add missing tag for "Wa_14019159160 - Case 2" (for existing
> > PXP code that ensures run alone mode bit is set to allow
> > PxP-decryption.
alan:snip.
alan: thanks for reviewing.
> > -   if (GRAPHICS_VER_FULL(ce->engine->i915) >= IP_VER(12, 70) &&
> > +   if (GRAPHICS_VER_FULL(ce->engine->i915) == IP_VER(12, 70) &&
> 
> The workaround database lists this as being needed on both 12.70 and
> 12.71.  Should this be a
> 
> IS_GFX_GT_IP_RANGE(gt, IP_VER(12, 70), IP_VER(12, 71))
alan: sure - will do this.
> 
> check instead?
> 
> The workaround is also listed in the database as applying to DG2; is
> this "case 2" subset of the workaround not relevant to that platform?
alan: i dont believe so - not the pxp case 2.



Re: [PATCH v1 1/1] drm/i915/pxp: Bail early in pxp tee backend on first teardown error

2023-11-16 Thread Teres Alexis, Alan Previn
On Thu, 2023-11-16 at 15:20 -0800, Teres Alexis, Alan Previn wrote:
> For Gen12 when using mei-pxp tee backend tranport, if we are coming
> up from a cold boot or from a resume (not runtime resume), we can
> optionally quicken the very first session cleanup that would occur
> as part of starting up a default PXP session. Typically a cleanup
> from a cold-start is expected to be quick so we can use a shorter
> timeout and skip retries (when getting non-success on calling
> backend transport for intel_pxp_tee_end_arb_fw_session).
alan:snip

> @@ -387,10 +389,14 @@ void intel_pxp_tee_end_arb_fw_session(struct intel_pxp 
> *pxp, u32 session_id)
>   ret = intel_pxp_tee_io_message(pxp,
>  _in, sizeof(msg_in),
>  _out, sizeof(msg_out),
> -NULL);
> +NULL, pxp->hw_state_coldstart ?
> +PXP_TRANSPORT_TIMEOUT_FAST_MS : 
> PXP_TRANSPORT_TIMEOUT_MS);
>  
> - /* Cleanup coherency between GT and Firmware is critical, so try again 
> if it fails */
> - if ((ret || msg_out.header.status != 0x0) && ++trials < 3)
> + /*
> +  * Cleanup coherency between GT and Firmware is critical, so try again 
> if it
> +  * fails, unless we are performing a cold-start reset
> +  */
> + if ((ret || msg_out.header.status != 0x0) && !pxp->hw_state_coldstart 
> &&  ++trials < 3)
alan: Take note I am working offline with sister teams to perform some end to
end conformance testing with more comprehensive OS stacks before we can verify
that this optimization doesnt break any existing use-cases.


Re: [PATCH v1 1/1] drm/i915/gt: Dont wait forever when idling in suspend

2023-11-14 Thread Teres Alexis, Alan Previn
On Tue, 2023-11-14 at 08:22 -0800, Teres Alexis, Alan Previn wrote:
> When suspending, add a timeout when calling
> intel_gt_pm_wait_for_idle else if we have a leaked
> wakeref (which would be indicative of a bug elsewhere
> in the driver), driver will at exit the suspend-resume
> cycle, after the kernel detects the held reference and
> prints a message to abort suspending instead of hanging
> in the kernel forever which then requires serial connection
> or ramoops dump to debug further.
NOTE: this patch originates from Patch#3 of this other series
https://patchwork.freedesktop.org/series/121916/ (rev 5 and prior)
and was decided to be moved out as its own patch since this
patch is trying to improve general debuggability as opposed
to resolving that bug being resolved in above series.
alan:snip

> +int intel_wakeref_wait_for_idle(struct intel_wakeref *wf, int timeout_ms)
>  {
> - int err;
> + int err = 0;
>  
>   might_sleep();
>  
> - err = wait_var_event_killable(>wakeref,
> -   !intel_wakeref_is_active(wf));
> + if (!timeout_ms)
> + err = wait_var_event_killable(>wakeref,
> +   !intel_wakeref_is_active(wf));
> + else if (wait_var_event_timeout(>wakeref,
> + !intel_wakeref_is_active(wf),
> + msecs_to_jiffies(timeout_ms)) < 1)
> + err = -ETIMEDOUT;
> +
alan: paraphrasing feedback from Tvrtko on the originating series this patch:
it would be good idea to add error-injection into this timeout to ensure
we dont have any other subsytem that could inadvertently leak an rpm
wakeref (and catch such bugs in future pre-merge).




Re: [Intel-gfx] [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending

2023-11-14 Thread Teres Alexis, Alan Previn
On Tue, 2023-11-14 at 17:52 +, Tvrtko Ursulin wrote:
> On 14/11/2023 17:37, Teres Alexis, Alan Previn wrote:
> > On Tue, 2023-11-14 at 17:27 +, Tvrtko Ursulin wrote:
> > > On 13/11/2023 17:57, Teres Alexis, Alan Previn wrote:
> > > > On Wed, 2023-10-25 at 13:58 +0100, Tvrtko Ursulin wrote:
> > > > > On 04/10/2023 18:59, Teres Alexis, Alan Previn wrote:
> > > > > > On Thu, 2023-09-28 at 13:46 +0100, Tvrtko Ursulin wrote:
> > > > > > > On 27/09/2023 17:36, Teres Alexis, Alan Previn wrote:
> 
alan:snip

> > alan: So i did trace back the gt->wakeref before i posted these patches and
> > see that within these runtime get/put calls, i believe the first 'get' leads
> > to __intel_wakeref_get_first which calls intel_runtime_pm_get via rpm_get
> > helper and eventually executes a pm_runtime_get_sync(rpm->kdev); (hanging 
> > off
> > i915_device). (naturally there is a corresponding mirros for the 
> > '_put_last').
> > So this means the first-get and last-put lets the kernel know. Thats why 
> > when
> > i tested this patch, it did actually cause the suspend to abort from kernel 
> > side
> > and the kernel would print a message indicating i915 was the one that didnt
> > release all refs.
> 
> Ah that would be much better then.
> 
> Do you know if everything gets resumed/restored correctly in that case 
> or we would need some additional work to maybe early exit from callers 
> of wait_for_suspend()?
alan: So assuming we are still discussing about a "potentially new
future leaked-wakeref bug" (i.e. putting aside the fact that
Patch #1 + #2 resolves this specific series' bug), based on the
previous testing we did, after this timeout-bail trigger,
the suspend flow bails and gt/guc operation does actually continue
as normal. However, its been a long time since we tested this so
i am not sure of how accidental-new-future bugs might play to this
assumption especially if some other subsystem that leaked the rpm
wakref but that subsystem did NOT get reset like how GuC is reset
at the end of suspend.

> 
> What I would also ask is to see if something like injecting a probing 
> failure is feasible, so we can have this new timeout exit path 
> constantly/regularly tested in CI.
alan: Thats a good idea. In line with this, i would like to point out that
rev6 of this series has been posted but i removed this Patch #3. However i did
post this Patch #3 as a standalone patch here: 
https://patchwork.freedesktop.org/series/126414/
as i anticipate this patch will truly help with future issue debuggability.

That said, i shall post a review on that patch with your suggestion to add
an injected probe error for the suspend-resume flow and follow up on that one
separately.

> 
> Regards,
> 
> Tvrtko
> 
> > alan: Anyways, i have pulled this patch out of rev6 of this series and 
> > created a
> > separate standalone patch for this patch #3 that we review independently.
> > 



Re: [Intel-gfx] [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending

2023-11-14 Thread Teres Alexis, Alan Previn
On Tue, 2023-11-14 at 12:36 -0500, Vivi, Rodrigo wrote:
> On Tue, Nov 14, 2023 at 05:27:18PM +, Tvrtko Ursulin wrote:
> > 
> > On 13/11/2023 17:57, Teres Alexis, Alan Previn wrote:
> > > On Wed, 2023-10-25 at 13:58 +0100, Tvrtko Ursulin wrote:
> > > > On 04/10/2023 18:59, Teres Alexis, Alan Previn wrote:
> > > > > On Thu, 2023-09-28 at 13:46 +0100, Tvrtko Ursulin wrote:
> > > > > > On 27/09/2023 17:36, Teres Alexis, Alan Previn wrote:
> 
> That's a good point.
> Well, current code is already bad and buggy on suspend-resume. We could get
> suspend stuck forever without any clue of what happened.
> At least the proposed patch add some gt_warn. But, yes, the right thing
> to do should be entirely abort the suspend in case of timeout, besides the
> gt_warn.
alan: yes - thats was the whole idea for Patch #3. Only after putting such
code did we have much better debuggability on real world production platforms
+ config that may not have serial-port or ramoops-dump by default.

Btw, as per previous comments by Tvrtko - which i agreed with, I have
moved this one single patch into a separate patch on its own.
See here: https://patchwork.freedesktop.org/series/126414/
(I also maintained the RB from you Rodrigo because the patch did not changed).
And yes, the gt_warn is still in place :)


Re: [Intel-gfx] [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending

2023-11-14 Thread Teres Alexis, Alan Previn
On Tue, 2023-11-14 at 17:27 +, Tvrtko Ursulin wrote:
> On 13/11/2023 17:57, Teres Alexis, Alan Previn wrote:
> > On Wed, 2023-10-25 at 13:58 +0100, Tvrtko Ursulin wrote:
> > > On 04/10/2023 18:59, Teres Alexis, Alan Previn wrote:
> > > > On Thu, 2023-09-28 at 13:46 +0100, Tvrtko Ursulin wrote:
> > > > > On 27/09/2023 17:36, Teres Alexis, Alan Previn wrote:
alan: snip
> 
> > alan: I won't say its NOT fixing anything - i am saying it's not fixing
> > this specific bug where we have the outstanding G2H from a context 
> > destruction
> > operation that got dropped. I am okay with dropping this patch in the next 
> > rev
> > but shall post a separate stand alone version of Patch3 - because I believe
> > all other i915 subsystems that take runtime-pm's DO NOT wait forever when 
> > entering
> > suspend - but GT is doing this. This means if there ever was a bug 
> > introduced,
> > it would require serial port or ramoops collection to debug. So i think 
> > such a
> > patch, despite not fixing this specific bug will be very helpful for 
> > debuggability
> > of future issues. After all, its better to fail our suspend when we have a 
> > bug
> > rather than to hang the kernel forever.
> 
> Issue I have is that I am not seeing how it fails the suspend when 
> nothing is passed out from *void* wait_suspend(gt). To me it looks the 
> suspend just carries on. And if so, it sounds dangerous to allow it to 
> do that with any future/unknown bugs in the suspend sequence. Existing 
> timeout wedges the GPU (and all that entails). New code just says "meh 
> I'll just carry on regardless".

alan: So i did trace back the gt->wakeref before i posted these patches and
see that within these runtime get/put calls, i believe the first 'get' leads
to __intel_wakeref_get_first which calls intel_runtime_pm_get via rpm_get
helper and eventually executes a pm_runtime_get_sync(rpm->kdev); (hanging off
i915_device). (naturally there is a corresponding mirros for the '_put_last').
So this means the first-get and last-put lets the kernel know. Thats why when
i tested this patch, it did actually cause the suspend to abort from kernel side
and the kernel would print a message indicating i915 was the one that didnt
release all refs.

alan: Anyways, i have pulled this patch out of rev6 of this series and created a
separate standalone patch for this patch #3 that we review independently.



Re: [Intel-gfx] [PATCH v3] drm/i915: Skip pxp init if gt is wedged

2023-11-13 Thread Teres Alexis, Alan Previn
On Mon, 2023-11-13 at 14:49 -0800, Zhanjun Dong wrote:
> The gt wedged could be triggered by missing guc firmware file, HW not
> working, etc. Once triggered, it means all gt usage is dead, therefore we
> can't enable pxp under this fatal error condition.
> 
> 
alan:skip
alan: this looks good (as per our offline review/discussion),
we dont mess with the current driver startup sequence (i.e. pxp
failures can never pull down the driver probing and will not
generate warnings or errors). Also, if something does break for PXP,
we only do a drm_debug if the failure returned is not -ENODEV
(since -ENODEV can happen on the majority of cases with
legacy products or with non-PXP kernel configs):

Reviewed-by: Alan Previn 


Re: [PATCH] drm/i915: Initialize residency registers earlier

2023-11-13 Thread Teres Alexis, Alan Previn
On Mon, 2023-10-30 at 16:45 -0700, Belgaumkar, Vinay wrote:
alan:skip
> +++ b/drivers/gpu/drm/i915/gt/intel_rc6.c
> @@ -608,11 +608,13 @@ void intel_rc6_init(struct intel_rc6 *rc6)
>   /* Disable runtime-pm until we can save the GPU state with rc6 pctx */
>   rpm_get(rc6);
>  
> - if (!rc6_supported(rc6))
> - return;
> -
>   rc6_res_reg_init(rc6);
>  
> + if (!rc6_supported(rc6)) {
> + rpm_put(rc6);
> + return;
> + }
> +
>   if (IS_CHERRYVIEW(i915))
>   err = chv_rc6_init(rc6);
>   else if (IS_VALLEYVIEW(i915))

alan: as far as the bug this patch is addressing  (i.e. ensuring that
intel_rc6_print_residency has valid rc6.res_reg values for correct dprc
debugfs output when rc6 is disabled) and release the rpm, this looks good
to me.

However, when looking at the other code flows around the intel_rc6_init/fini
and intel_rc6_enable/disable, i must point out that the calls to rpm_get
and rpm_put from these functions don't seem to be designed with proper
mirror-ing. For example during driver startup, intel_rc6_init (which is called
by intel_gt_pm_init) calls rpm_get at the start but doesn't call rpm_put
before it returns. But back up the callstack in intel_gt_init,
after intel_gt_pm_init, a couple of subsystems get intialized before 
intel_gt_resume
is called - which in turn calls intel_rc6_enable which does the rpm_put at its 
end.
However before that get and put, i see several error paths that trigger cleanups
(leading eventually to driver load failure), but i think some cases are 
completely
missing the put_rpm that intel_rc6_init took. Additionally, the function names 
of
rc6_init and __get_rc6 inside i915_pmu.c seems to be confusing although static.
I wish those were named pmu_rc6_init and __pmu_rc6_init and etc.

Anyways, as per offline conversation, we are not trying to solve every
bug and design gap in this patch but just one specific bug fix. So as
per the agreed condition that we create a separate internal issue
to address this "lack of a clean mirrored-function design of rpm_get/put
across the rc6 startup sequences", here is my rb:

Reviewed-by: Alan Previn 



Re: [Intel-gfx] [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending

2023-11-13 Thread Teres Alexis, Alan Previn
On Wed, 2023-10-25 at 13:58 +0100, Tvrtko Ursulin wrote:
> On 04/10/2023 18:59, Teres Alexis, Alan Previn wrote:
> > On Thu, 2023-09-28 at 13:46 +0100, Tvrtko Ursulin wrote:
> > > On 27/09/2023 17:36, Teres Alexis, Alan Previn wrote:
alan:snip
> > > It is not possible to wait for lost G2H in something like
> > > intel_uc_suspend() and simply declare "bad things happened" if it times
> > > out there, and forcibly clean it all up? (Which would include releasing
> > > all the abandoned pm refs, so this patch wouldn't be needed.)
> > > 
> > alan: I'm not sure if intel_uc_suspend should be held up by gt-level wakeref
> > check unless huc/guc/gsc-uc are the only ones ever taking a gt wakeref.
> > 
> > As we already know, what we do know from a uc-perspective:
> > -  ensure the outstanding guc related workers is flushed which we didnt 
> > before
> > (addressed by patch #1).
> > - any further late H2G-SchedDisable is not leaking wakerefs when calling H2G
> > and not realizing it failed (addressed by patch #2).
> > - (we already), "forcibly clean it all up" at the end of the 
> > intel_uc_suspend
> > when we do the guc reset and cleanup all guc-ids. (pre-existing upstream 
> > code)
> > - we previously didnt have a coherrent guarantee that "this is the end" 
> > i.e. no
> > more new request after intel_uc_suspend. I mean by code logic, we thought 
> > we did
> > (thats why intel_uc_suspend ends wth a guc reset), but we now know 
> > otherwise.
> > So we that fix by adding the additional rcu_barrier (also part of patch #2).
> 
> It is not clear to me from the above if that includes cleaning up the 
> outstanding CT replies or no. But anyway..
alan: Apologies, i should have made it clearer by citing the actual function
name calling out the steps of interest: So if you look at the function
"intel_gt_suspend_late", it calls "intel_uc_suspend" which in turn calls 
"intel_guc_suspend" which does a soft reset onto guc firmware - so after that
we can assume all outstanding G2Hs are done. Going back up the stack,
intel_gt_suspend_late finally calls gt_sanitize which calls 
"intel_uc_reset_prepare"
which calls "intel_guc_submission_reset_prepare" which calls
"scrub_guc_desc_for_outstanding_g2h" which does what we are looking for for all
types of outstanding G2H. As per prior review comments, besides closing the race
condition, we were missing an rcu_barrier (which caused new requests to come in 
way
late). So we have resolved both in Patch #2.

> > That said, patch-3 is NOT fixing a bug in guc -its about "if we ever have
> > a future racy gt-wakeref late-leak somewhere - no matter which subsystem
> > took it (guc is not the only subsystem taking gt wakerefs), we at
> > least don't hang forever in this code. Ofc, based on that, even without
> > patch-3 i am confident the issue is resolved anyway.
> > So we could just drop patch-3 is you prefer?
> 
> .. given this it does sound to me that if you are confident patch 3 
> isn't fixing anything today that it should be dropped.
alan: I won't say its NOT fixing anything - i am saying it's not fixing
this specific bug where we have the outstanding G2H from a context destruction
operation that got dropped. I am okay with dropping this patch in the next rev
but shall post a separate stand alone version of Patch3 - because I believe
all other i915 subsystems that take runtime-pm's DO NOT wait forever when 
entering
suspend - but GT is doing this. This means if there ever was a bug introduced,
it would require serial port or ramoops collection to debug. So i think such a
patch, despite not fixing this specific bug will be very helpful for 
debuggability
of future issues. After all, its better to fail our suspend when we have a bug
rather than to hang the kernel forever.





Re: [Intel-gfx] [PATCH] drm/i915: Skip pxp init if gt is wedged

2023-10-31 Thread Teres Alexis, Alan Previn
On Fri, 2023-10-27 at 10:13 +0300, Jani Nikula wrote:
> On Thu, 26 Oct 2023, Zhanjun Dong  wrote:
> 
alan:snip
> I'll note that nobody checks intel_pxp_init() return status, so this
> silently skips PXP.
> 
> BR,
> Jani.

alan:snip
> > +   if (intel_gt_is_wedged(gt))
> > +   return -ENODEV;
> > +

alan: wondering if we can add a drm_dbg in the caller of intel_pxp_init and
use a unique return value for the case of gt_is_wedged (for example: -ENXIO.).
As we know gt being wedged at startup basically means all gt usage is dead
and therefore we cant enable PXP (along with everything else that needs 
submission/
guc/ etc). With a drm-debug in the caller that prints that return value, it
helps us to differentiate between gt-is-wedged vs platform config doesnt support
PXP. However, this would mean new drm-debug 'noise' for platforms that i915 just
doesn't support PXP on at all which would be okay if dont use drm_warn or 
drm_err
and use 'softer' message like "PXP skipped with %d".

Please treat above comment as a "nit" - i.e. existing patch is good enough for 
me...
(after addressing Jani's request for more commit message info). ...alan




Re: [Intel-gfx] [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending

2023-10-04 Thread Teres Alexis, Alan Previn
On Thu, 2023-09-28 at 13:46 +0100, Tvrtko Ursulin wrote:
> On 27/09/2023 17:36, Teres Alexis, Alan Previn wrote:
> > Thanks for taking the time to review this Tvrtko, replies inline below.
alan:snip

> > > 
> > > Main concern is that we need to be sure there are no possible
> > > ill-effects, like letting the GPU/GuC scribble on some memory we
> > > unmapped (or will unmap), having let the suspend continue after timing
> > > out, and not perhaps doing the forced wedge like wait_for_suspend() does
> > > on the existing timeout path.
> > alan: this will not happen because the held wakeref is never force-released
> > after the timeout - so what happens is the kernel would bail the suspend.
> 
> How does it know to fail the suspend when there is no error code 
> returned with this timeout? Maybe a stupid question.. my knowledge of 
> suspend-resume paths was not great even before I forgot it all.
alan:Tvrtko, you and I both sir. (apologies for the tardy response yet again 
busy week).
So i did trace back the gt->wakeref before i posted these patches and 
discovered that
runtime get/put call, i believe that the first 'get' leads to 
__intel_wakeref_get_first
which calls intel_runtime_pm_get via rpm_get helper and eventually executes a
pm_runtime_get_sync(rpm->kdev); (hanging off i915). (ofc, there is a 
corresponding
for '_put_last') - so non-first, non-last increases the counter for the gt...
but this last miss will mean kernel knows i915 hasnt 'put' everything.

alan:snip
> > 
> > Recap: so in both cases (original vs this patch), if we had a buggy 
> > gt-wakeref leak,
> > we dont get invalid guc-accesses, but without this patch, we wait forever,
> > and with this patch, we get some messages and eventually bail the suspend.
> 
> It is not possible to wait for lost G2H in something like 
> intel_uc_suspend() and simply declare "bad things happened" if it times 
> out there, and forcibly clean it all up? (Which would include releasing 
> all the abandoned pm refs, so this patch wouldn't be needed.)
> 
alan: I'm not sure if intel_uc_suspend should be held up by gt-level wakeref
check unless huc/guc/gsc-uc are the only ones ever taking a gt wakeref. 

As we already know, what we do know from a uc-perspective:
-  ensure the outstanding guc related workers is flushed which we didnt before
(addressed by patch #1).
- any further late H2G-SchedDisable is not leaking wakerefs when calling H2G
and not realizing it failed (addressed by patch #2).
- (we already), "forcibly clean it all up" at the end of the intel_uc_suspend
when we do the guc reset and cleanup all guc-ids. (pre-existing upstream code)
- we previously didnt have a coherrent guarantee that "this is the end" i.e. no
more new request after intel_uc_suspend. I mean by code logic, we thought we did
(thats why intel_uc_suspend ends wth a guc reset), but we now know otherwise.
So we that fix by adding the additional rcu_barrier (also part of patch #2).

That said, patch-3 is NOT fixing a bug in guc -its about "if we ever have
a future racy gt-wakeref late-leak somewhere - no matter which subsystem
took it (guc is not the only subsystem taking gt wakerefs), we at
least don't hang forever in this code. Ofc, based on that, even without
patch-3 i am confident the issue is resolved anyway.
So we could just drop patch-3 is you prefer?

...alan


Re: [PATCH v4 2/3] drm/i915/guc: Close deregister-context race against CT-loss

2023-10-04 Thread Teres Alexis, Alan Previn
On Wed, 2023-10-04 at 06:34 +, Gupta, Anshuman wrote:
> 
> > -Original Message-
> > From: Teres Alexis, Alan Previn  > @@ -289,6 +289,13 @@ int intel_gt_resume(struct intel_gt *gt)
> > 
> >  static void wait_for_suspend(struct intel_gt *gt)  {
> > +   /*
> > +* On rare occasions, we've observed the fence completion trigger
> > +* free_engines asynchronously via rcu_call. Ensure those are done.
> > +* This path is only called on suspend, so it's an acceptable cost.
> > +*/
> > +   rcu_barrier();
> Let's add the barrier after the end of prepare suspend and at start of late 
> suspend.
> To make sure we don't have any async destroy from any user request or any 
> internal  kmd request during i915 suspend?
> Br,
> Anshuman Gupta.
alan: some thoughts: actuallly wait_fos_suspend is being called at from both 
intel_gt_suspend_prepare
and intel_gt_suspend_late. so putting the barrier in above location would get 
hit for both steps.
However, because wait_for_suspend may optionally wedge the system if 
outstanding requests were stuck
for more than I915_GT_SUSPEND_IDLE_TIMEOUT, wouldnt it be better to add the 
barrier before that
check (i.e. in above location) as opposed to after the return from 
wait_for_suspend at the end
of suspend_prepare - which would defeat the purpose of even checking that 
intel_gt_wait_for_idle
from within wait_for_suspend? (which is existing code). Perhaps we need to get 
one last test on this?



Re: [Intel-gfx] [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending

2023-09-27 Thread Teres Alexis, Alan Previn
Thanks for taking the time to review this Tvrtko, replies inline below.

On Wed, 2023-09-27 at 10:02 +0100, Tvrtko Ursulin wrote:
> On 26/09/2023 20:05, Alan Previn wrote:
> > When suspending, add a timeout when calling
> > intel_gt_pm_wait_for_idle else if we have a lost
> > G2H event that holds a wakeref (which would be
> > indicative of a bug elsewhere in the driver),
> > driver will at least complete the suspend-resume
> > cycle, (albeit not hitting all the targets for
> > low power hw counters), instead of hanging in the kernel.
alan:snip

> >   {
> > +   int timeout_ms = CONFIG_DRM_I915_MAX_REQUEST_BUSYWAIT ? : 1;
> 
> CONFIG_DRM_I915_MAX_REQUEST_BUSYWAIT is in ns so assigning it to _ms is 
> a bit to arbitrary.
> 
> Why not the existing I915_GT_SUSPEND_IDLE_TIMEOUT for instance?
alan: good catch, my bad. However I915_GT_SUSPEND_IDLE_TIMEOUT is only half a 
sec
which i feel is too quick. I guess i could create a new #define or a multiple
of I915_GT_SUSPEND_IDLE_TIMEOUT?

> > /*
> >  * On rare occasions, we've observed the fence completion trigger
> >  * free_engines asynchronously via rcu_call. Ensure those are done.
> > @@ -308,7 +309,10 @@ static void wait_for_suspend(struct intel_gt *gt)
> > intel_gt_retire_requests(gt);
> > }
> >   
> > -   intel_gt_pm_wait_for_idle(gt);
> > +   /* we are suspending, so we shouldn't be waiting forever */
> > +   if (intel_gt_pm_wait_timeout_for_idle(gt, timeout_ms) == -ETIMEDOUT)
> > +   gt_warn(gt, "bailing from %s after %d milisec timeout\n",
> > +   __func__, timeout_ms);
> 
> Does the timeout in intel_gt_pm_wait_timeout_for_idle always comes in 
> pair with the timeout first in intel_gt_wait_for_idle?
alan: Not necessarily, ... IIRC in nearly all cases, the first wait call
complete in time (i.e. no more ongoing work) and the second wait
does wait only if the last bit of work 'just-finished', then that second
wait may take a touch bit longer because of possible async gt-pm-put calls.
(which appear in several places in the driver as part of regular runtime
operation including request completion). NOTE: this not high end workloads
so the
> 
> Also, is the timeout here hit from the intel_gt_suspend_prepare, 
> intel_gt_suspend_late, or can be both?
> 
> Main concern is that we need to be sure there are no possible 
> ill-effects, like letting the GPU/GuC scribble on some memory we 
> unmapped (or will unmap), having let the suspend continue after timing 
> out, and not perhaps doing the forced wedge like wait_for_suspend() does 
> on the existing timeout path.
alan: this will not happen because the held wakeref is never force-released
after the timeout - so what happens is the kernel would bail the suspend.
> 
> Would it be possible to handle the lost G2H events directly in the 
> respective component instead of here? Like apply the timeout during the 
> step which explicitly idles the CT for suspend (presumably that 
> exists?), and so cleanup from there once declared a lost event.
alan: So yes, we don't solve the problem with this patch - Patch#2
is addressing the root cause - and verification is still ongoing - because its
hard to thoroughly test / reproduce. (i.e. Patch 2 could get re-rev'd).

Intent of this patch, is to be informed that gt-pm wait failed in prep for
suspending and kernel will eventually bail the suspend, that's all.
Because without this timed-out version of gt-pm-wait, if we did have a leaky
gt-wakeref, kernel will hang here forever and we will need to get serial
console or ramoops to eventually discover the kernel's cpu hung error with
call-stack pointing to this location. So the goal here is to help speed up
future debug process (if let's say some future code change with another
async gt-pm-put came along and caused new problems after Patch #2 resolved
this time).

Recap: so in both cases (original vs this patch), if we had a buggy gt-wakeref 
leak,
we dont get invalid guc-accesses, but without this patch, we wait forever,
and with this patch, we get some messages and eventually bail the suspend.

alan:snip



Re: [PATCH v6 1/3] drm/i915/pxp/mtl: Update pxp-firmware response timeout

2023-09-17 Thread Teres Alexis, Alan Previn
On Sat, 2023-09-16 at 10:25 +0800, lkp wrote:
> Hi Alan,
> 
> kernel test robot noticed the following build errors:
> 
> [auto build test ERROR on cf1e91e884bb1113c653e654e9de1754fc1d4488]
> 
> aAll errors (new ones prefixed by >>):
> 
> 
alan:snip
alan: missed building with PXP config after that last change below - will 
re-rev this.

>drivers/gpu/drm/i915/pxp/intel_pxp_gsccs.c: In function 
> 'gsccs_send_message':
> > > drivers/gpu/drm/i915/pxp/intel_pxp_gsccs.c:115:52: error: 
> > > 'GSC_REPLY_LATENCY_MS' undeclared (first use in this function); did you 
> > > mean 'GSC_HECI_REPLY_LATENCY_MS'?
>  115 |
> GSC_REPLY_LATENCY_MS);
>  |
> ^~~~
>  |
> GSC_HECI_REPLY_LATENCY_MS
>drivers/gpu/drm/i915/pxp/intel_pxp_gsccs.c:115:52: note: each undeclared 
> identifier is reported only once for each function it appears in
> 




Re: [PATCH v3] drm/i915/pxp: Add drm_dbgs for critical PXP events.

2023-09-15 Thread Teres Alexis, Alan Previn
On Fri, 2023-09-15 at 13:15 -0700, Teres Alexis, Alan Previn wrote:
> Debugging PXP issues can't even begin without understanding precedding
> sequence of important events. Add drm_dbg into the most important PXP events.
> 
>  v3 : - move gt_dbg to after mutex block in function
> i915_gsc_proxy_component_bind. (Vivaik)
>  v2 : - remove __func__ since drm_dbg covers that (Jani).
>   - add timeout dbg of the restart from front-end (Alan).
> 
alan:snip

> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_proxy.c
> @@ -322,6 +322,7 @@ static int i915_gsc_proxy_component_bind(struct device 
> *i915_kdev,
>   gsc->proxy.component = data;
>   gsc->proxy.component->mei_dev = mei_kdev;
>   mutex_unlock(>proxy.mutex);
> + gt_dbg(gt, "GSC proxy mei component bound\n");

forgot to include RB from Vivaik, per condition of fixing above hunk, from rev2 
dri-devel 
https://lists.freedesktop.org/archives/dri-devel/2023-September/422858.html :

Reviewed-by: Balasubrawmanian, Vivaik  


Re: [PATCH v5 3/3] drm/i915/lrc: User PXP contexts requires runalone bit in lrc

2023-09-15 Thread Teres Alexis, Alan Previn
On Sat, 2023-09-09 at 15:38 -0700, Teres Alexis, Alan Previn wrote:
> On Meteorlake onwards, HW specs require that all user contexts that
> run on render or compute engines and require PXP must enforce
> run-alone bit in lrc. Add this enforcement for protected contexts.
alan:snip
> 
> Signed-off-by: Alan Previn 
> @@ -860,6 +881,8 @@ static void init_common_regs(u32 * const regs,
>   if (GRAPHICS_VER(engine->i915) < 11)
>   ctl |= _MASKED_BIT_DISABLE(CTX_CTRL_ENGINE_CTX_SAVE_INHIBIT |
>  CTX_CTRL_RS_CTX_ENABLE);
> + if (ctx_needs_runalone(ce))
> + ctl |= _MASKED_BIT_ENABLE(BIT(7));
>   regs[CTX_CONTEXT_CONTROL] = ctl;
>  
>   regs[CTX_TIMESTAMP] = ce->stats.runtime.last;
alan: to align intel-gfx ml, Vivaik reviewed this on dri-devel at 
https://lists.freedesktop.org/archives/dri-devel/2023-September/422875.html - 
snippet:
thus, will rerev this (with the others fixes in this series).

>> Can we please get the bit defined in intel_engine_regs.h with a define 
>> instead of a number identification?
>> 
>> Review completed conditional to the above fix.
>> 
>> Reviewed-by: Balasubrawmanian, Vivaik 
>>  
>> <mailto:vivaik.balasubrawmanian at intel.com>




Re: [PATCH v5 2/3] drm/i915/pxp/mtl: Update pxp-firmware packet size

2023-09-15 Thread Teres Alexis, Alan Previn
On Sat, 2023-09-09 at 15:38 -0700, Teres Alexis, Alan Previn wrote:
> Update the GSC-fw input/output HECI packet size to match
> updated internal fw specs.
> 
> Signed-off-by: Alan Previn 
> ---
>  drivers/gpu/drm/i915/pxp/intel_pxp_cmd_interface_43.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/pxp/intel_pxp_cmd_interface_43.h 
> b/drivers/gpu/drm/i915/pxp/intel_pxp_cmd_interface_43.h
> index 0165d38fbead..e017a7d952e9 100644
> --- a/drivers/gpu/drm/i915/pxp/intel_pxp_cmd_interface_43.h
> +++ b/drivers/gpu/drm/i915/pxp/intel_pxp_cmd_interface_43.h
> @@ -14,8 +14,8 @@
>  #define PXP43_CMDID_NEW_HUC_AUTH 0x003F /* MTL+ */
>  #define PXP43_CMDID_INIT_SESSION 0x0036
>  
> -/* PXP-Packet sizes for MTL's GSCCS-HECI instruction */
> -#define PXP43_MAX_HECI_INOUT_SIZE (SZ_32K)
> +/* PXP-Packet sizes for MTL's GSCCS-HECI instruction is spec'd at 65K before 
> page alignment*/
> +#define PXP43_MAX_HECI_INOUT_SIZE (PAGE_ALIGNED(SZ_64K + SZ_1K))
>  
>  /* PXP-Packet size for MTL's NEW_HUC_AUTH instruction */
>  #define PXP43_HUC_AUTH_INOUT_SIZE (SZ_4K)
Vivaik replied with RB on dri-devel: 
https://lists.freedesktop.org/archives/dri-devel/2023-September/422862.htmll
we connected offline and agreed that his RB can remain standing on condition i 
fix the PAGE_ALIGNED -> PAGE_ALIGN fix.
Thanks Vivaik for reviewing.



Re: [PATCH v5 1/3] drm/i915/pxp/mtl: Update pxp-firmware response timeout

2023-09-15 Thread Teres Alexis, Alan Previn
On Sat, 2023-09-09 at 15:38 -0700, Teres Alexis, Alan Previn wrote:
> Update the max GSC-fw response time to match updated internal
> fw specs. Because this response time is an SLA on the firmware,
> not inclusive of i915->GuC->HW handoff latency, when submitting
> requests to the GSC fw via intel_gsc_uc_heci_cmd_submit helpers,
alan:snip
Vivaik replied with RB on dri-devel: 
https://lists.freedesktop.org/archives/dri-devel/2023-September/422861.html
we connected offline and agreed that his RB can remain standing on condition i 
fix the PAGE_ALIGNED -> PAGE_ALIGN fix.
Thanks Vivaik for reviewing.


Re: [PATCH v5 1/3] drm/i915/pxp/mtl: Update pxp-firmware response timeout

2023-09-15 Thread Teres Alexis, Alan Previn
On Sat, 2023-09-09 at 15:38 -0700, Teres Alexis, Alan Previn wrote:
> Update the max GSC-fw response time to match updated internal
> fw specs. Because this response time is an SLA on the firmware,
> not inclusive of i915->GuC->HW handoff latency, when submitting
> requests to the GSC fw via intel_gsc_uc_heci_cmd_submit helpers,
> start the count after the request hits the GSC command streamer.
> Also, move GSC_REPLY_LATENCY_MS definition from pxp header to
> intel_gsc_uc_heci_cmd_submit.h since its for any GSC HECI packet.
> 
> Signed-off-by: Alan Previn 
> ---
>  .../i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c | 20 +--
>  .../i915/gt/uc/intel_gsc_uc_heci_cmd_submit.h |  6 ++
>  drivers/gpu/drm/i915/pxp/intel_pxp_gsccs.h| 11 ++
>  3 files changed, 31 insertions(+), 6 deletions(-)
alan: snip


> index 09d3fbdad05a..5ae5c5d9608b 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.h
> @@ -12,6 +12,12 @@ struct i915_vma;
>  struct intel_context;
>  struct intel_gsc_uc;
>  
> +#define GSC_HECI_REPLY_LATENCY_MS 350
> +/*
> + * Max FW response time is 350ms, but this should be counted from the time 
> the
> + * command has hit the GSC-CS hardware, not the preceding handoff to GuC CTB.
> + */
alan: continue to face timeout issues - so increasing this to ~500 to absorb 
other hw/sw system latencies.
this also matches what the gsc-proxy code was doing - so i could use the same 
macro for that other code path.



Re: [PATCH v5 2/3] drm/i915/pxp/mtl: Update pxp-firmware packet size

2023-09-15 Thread Teres Alexis, Alan Previn
On Sat, 2023-09-09 at 15:38 -0700, Teres Alexis, Alan Previn wrote:
> Update the GSC-fw input/output HECI packet size to match
> updated internal fw specs.
> 
> Signed-off-by: Alan Previn 
> 
alan:snip

> -/* PXP-Packet sizes for MTL's GSCCS-HECI instruction */
> -#define PXP43_MAX_HECI_INOUT_SIZE (SZ_32K)
> +/* PXP-Packet sizes for MTL's GSCCS-HECI instruction is spec'd at 65K before 
> page alignment*/
> +#define PXP43_MAX_HECI_INOUT_SIZE (PAGE_ALIGNED(SZ_64K + SZ_1K))
alan: silly ctrl-c/v bug on my part - should be PAGE_ALIGN, not ALIGNED
>  
>  /* PXP-Packet size for MTL's NEW_HUC_AUTH instruction */
>  #define PXP43_HUC_AUTH_INOUT_SIZE (SZ_4K)



Re: [PATCH v1 1/1] drm/i915/pxp: Add drm_dbgs for critical PXP events.

2023-09-13 Thread Teres Alexis, Alan Previn
On Mon, 2023-09-11 at 12:26 +0300, Jani Nikula wrote:
> On Wed, 06 Sep 2023, Alan Previn  wrote:
> > Debugging PXP issues can't even begin without understanding precedding
> > sequence of events. Add drm_dbg into the most important PXP events.
> > 
> > Signed-off-by: Alan Previn 
alan:snip

> 
> > +
> > +   drm_dbg(>ctrl_gt->i915->drm, "PXP: %s invoked", __func__);
> 
> drm_dbg already covers __func__ (via __builtin_return_address(0) in
> __drm_dev_dbg), it's redundant.
> 
> Ditto for all added debugs below.

My bad - yup - will fix them.
Thanks for taking time to review this patch.
...alan
> 



Re: [PATCH v4 2/3] drm/i915/pxp/mtl: Update pxp-firmware packet size

2023-09-06 Thread Teres Alexis, Alan Previn
On Wed, 2023-09-06 at 17:15 -0700, Teres Alexis, Alan Previn wrote:
> Update the GSC-fw input/output HECI packet size to match
> updated internal fw specs.
alan:snip

> +++ b/drivers/gpu/drm/i915/pxp/intel_pxp_cmd_interface_43.h
> @@ -14,8 +14,8 @@
> 


> +/* PXP-Packet sizes for MTL's GSCCS-HECI instruction is 65K*/
> +#define PXP43_MAX_HECI_INOUT_SIZE (SZ_64K + SZ_1K)
alan: my mistake - didnt fix this before posting - should have been 
PAGE_ALIGN(65k)




Re: [PATCH v2 2/3] drm/i915/guc: Close deregister-context race against CT-loss

2023-08-28 Thread Teres Alexis, Alan Previn
Additional update from the most recent testing.

When relying solely on guc_lrc_desc_unpin getting a failure from 
deregister_context
as a means for identifying that we are in the 
"deregister-context-vs-suspend-late" race,
it is too late a location to handle this safely. This is because one of the
first things destroyed_worker_func does it to take a gt pm wakeref - which 
triggers
the gt_unpark function that does a whole lot bunch of other flows including 
triggering more
workers and taking additional refs. That said, its best to not even call
deregister_destroyed_contexts from the worker when !intel_guc_is_ready 
(ct-is-disabled).

...alan


On Fri, 2023-08-25 at 11:54 -0700, Teres Alexis, Alan Previn wrote:
> just a follow up note-to-self:
> 
> On Tue, 2023-08-15 at 12:08 -0700, Teres Alexis, Alan Previn wrote:
> > On Tue, 2023-08-15 at 09:56 -0400, Vivi, Rodrigo wrote:
> > > On Mon, Aug 14, 2023 at 06:12:09PM -0700, Alan Previn wrote:
> > > > 
> [snip]
> 
> in guc_submission_send_busy_loop, we are incrementing the following
> that needs to be decremented if the function fails.
> 
> atomic_inc(>outstanding_submission_g2h);
> 
> also, it seems that even with thie unroll design - we are still
> leaking a wakeref elsewhere. this is despite a cleaner redesign of
> flows in function "guc_lrc_desc_unpin"
> (discussed earlier that wasnt very readible).
> 
> will re-rev today but will probably need more follow ups
> tracking that one more leaking gt-wakeref (one in thousands-cycles)
> but at least now we are not hanging mid-suspend.. we bail from suspend
> with useful kernel messages.
> 
> 
> 
> 



Re: [PATCH v2 2/3] drm/i915/guc: Close deregister-context race against CT-loss

2023-08-25 Thread Teres Alexis, Alan Previn
just a follow up note-to-self:

On Tue, 2023-08-15 at 12:08 -0700, Teres Alexis, Alan Previn wrote:
> On Tue, 2023-08-15 at 09:56 -0400, Vivi, Rodrigo wrote:
> > On Mon, Aug 14, 2023 at 06:12:09PM -0700, Alan Previn wrote:
> > > 
[snip]

in guc_submission_send_busy_loop, we are incrementing the following
that needs to be decremented if the function fails.

atomic_inc(>outstanding_submission_g2h);

also, it seems that even with thie unroll design - we are still
leaking a wakeref elsewhere. this is despite a cleaner redesign of
flows in function "guc_lrc_desc_unpin"
(discussed earlier that wasnt very readible).

will re-rev today but will probably need more follow ups
tracking that one more leaking gt-wakeref (one in thousands-cycles)
but at least now we are not hanging mid-suspend.. we bail from suspend
with useful kernel messages.






Re: [PATCH v2 1/3] drm/i915/guc: Flush context destruction worker at suspend

2023-08-25 Thread Teres Alexis, Alan Previn
Thanks again Rodrigo for reviewing and apologies for my tardy replies.
We are stil testing on shipping platforms and these latest patches seemed
to have reduced the frequency and solved the "system hangs" while suspending
but its still causing issues so we continue to debug. (issue is that
its running thousands of cycles overnight on multiple systems and takes
time). 

[snip]

> > @@ -693,6 +693,8 @@ void intel_uc_suspend(struct intel_uc *uc)
> > return;
> > }
> >  
> > +   intel_guc_submission_flush_work(guc);
> > +
> 
> what happens if a new job comes exactly here?
> This still sounds a bit racy, although this already looks
> much cleaner than the previous version.
alan: intel_uc_suspend is called from suspend-late or init-failure.
and the very next function -> intel_guc_suspend will soft reset the guc
and eventually nuke it. this is the most appropriate "latests as
possible" place to put this.
> 
> > with_intel_runtime_pm(_to_gt(uc)->i915->runtime_pm, wakeref) {
> > err = intel_guc_suspend(guc);
> > if (err)






Re: [PATCH v3 1/3] drm/i915/pxp/mtl: Update pxp-firmware response timeout

2023-08-15 Thread Teres Alexis, Alan Previn
On Tue, 2023-08-15 at 13:29 -0700, Teres Alexis, Alan Previn wrote:
> Update the max GSC-fw response time to match updated internal
> fw specs. Because this response time is an SLA on the firmware,
> not inclusive of i915->GuC->HW handoff latency, when submitting
> requests to the GSC fw via intel_gsc_uc_heci_cmd_submit_nonpriv,
> start the count after the request hits the GSC command streamer.
> 
> Signed-off-by: Alan Previn 
> ---
>  drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c | 3 +++
>  drivers/gpu/drm/i915/pxp/intel_pxp_gsccs.h| 6 +++---
>  2 files changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c
> index 89ed5ee9cded..ae45855594ac 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c
> @@ -186,6 +186,9 @@ intel_gsc_uc_heci_cmd_submit_nonpriv(struct intel_gsc_uc 
> *gsc,
>   i915_request_add(rq);
>  
>   if (!err) {
> + if (wait_for(i915_request_started(rq), 200))
> + drm_dbg(_uc_to_gt(gsc)->i915->drm,
> + "Delay in gsc-heci-non-priv submission to 
> gsccs-hw");
alan: offline discussion with Daniele, Daniele provided the following review 
comments:
We should add this wait-check for both priv and non-priv but we should increase 
the
timeout to be more than the guaranteed fw response time of 1 other message 
(since we
have a total of 2 contexts that could be sending messages concurrently at any 
time
including this one)... so maybe timeout of the new GSC_REPLY_LATENCY_MS + 150.
More importantly, he highlighted the fact that we should wait for the 
request-started
AND ensure there as no error in request status.

[snip]


Re: [PATCH v2 2/3] drm/i915/guc: Close deregister-context race against CT-loss

2023-08-15 Thread Teres Alexis, Alan Previn
On Tue, 2023-08-15 at 09:56 -0400, Vivi, Rodrigo wrote:
> On Mon, Aug 14, 2023 at 06:12:09PM -0700, Alan Previn wrote:
> > If we are at the end of suspend or very early in resume
> > its possible an async fence signal could lead us to the
> > execution of the context destruction worker (after the
> > prior worker flush).

[snip]
> 
> > @@ -3199,10 +3206,20 @@ static inline void guc_lrc_desc_unpin(struct 
> > intel_context *ce)
> > if (unlikely(disabled)) {
> > release_guc_id(guc, ce);
> > __guc_context_destroy(ce);
> > -   return;
> > +   return 0;
> > +   }
> > +
> > +   if (deregister_context(ce, ce->guc_id.id)) {
> > +   /* Seal race with concurrent suspend by unrolling */
> > +   spin_lock_irqsave(>guc_state.lock, flags);
> > +   set_context_registered(ce);
> > +   clr_context_destroyed(ce);
> > +   intel_gt_pm_put(gt);
> 
> how sure we are this is not calling unbalanced puts?
alan: in this function (guc_lrc_desc_unpin), the summarized flow is:

check-status-stuff
if guc-is-not-disabled, take-pm, change ctx-state-bits
[implied else] if guc-is-disabled, scub-ctx and return

thus derigster_context is only called if we didnt exit, i.e. when 
guc-is-not-disabled, i.e. after the pm was taken.
> could we wrap this in some existent function to make this clear?
alan: yeah - not so readible as it now - let me tweak this function and make it 
cleaner

> 
> > +   spin_unlock_irqrestore(>guc_state.lock, flags);
> > +   return -ENODEV;
> > }
> >  
> > -   deregister_context(ce, ce->guc_id.id);
> > +   return 0;
> >  }



Re: [PATCH v2 3/3] drm/i915/gt: Timeout when waiting for idle in suspending

2023-08-15 Thread Teres Alexis, Alan Previn
Thanks Rodrigo - agreed on everything below - will re-rev.

On Tue, 2023-08-15 at 09:51 -0400, Vivi, Rodrigo wrote:
> On Mon, Aug 14, 2023 at 06:12:10PM -0700, Alan Previn wrote:
> > When suspending, add a timeout when calling
> > intel_gt_pm_wait_for_idle else if we have a lost
> > G2H event that holds a wakeref (which would be
> > 
[snip]

> > @@ -301,7 +303,10 @@ static void wait_for_suspend(struct intel_gt *gt)
> > intel_gt_retire_requests(gt);
> > }
> >  
> > -   intel_gt_pm_wait_for_idle(gt);
> > +   /* we are suspending, so we shouldn't be waiting forever */
> > +   if (intel_gt_pm_wait_timeout_for_idle(gt, timeout_ms) == -ETIME)
> 
> you forgot to change the error code here..^
> 
> but maybe we don't even need this here and a simple
> if (intel_gt_pm_wait_timeout_for_idle(gt, timeout_ms)) should be enough
> since the error from the killable one is unlikely and the only place
> I error I could check on that path would be a catastrophic -ERESTARTSYS.
> 
> but up to you.
alan: my bad - I'll fix it - but i agree with not needing to check the failure 
type.
and I'll keep the error the same ("bailing from...")
[snip]

> > +static inline int intel_gt_pm_wait_timeout_for_idle(struct intel_gt *gt, 
> > int timeout_ms)
> > +{
> > +   return intel_wakeref_wait_for_idle(>wakeref, timeout_ms);
> >  }
> 
> I was going to ask why you created a single use function here, but then I
> noticed the above one. So it makes sense.
> Then I was going to ask why in here you didn't use the same change of
> timeout = 0 in the existent function like you used below, but then I
> noticed that the above function is called in multiple places and the
> patch with this change is much cleaner and the function is static inline
> so your approach was good here.
alan: yes that was my exact reason - thus no impact across other callers.
[snip]


> Please add a documentation for this function making sure you have the 
> following
> mentions:
alan: good catch -will do.

> 
> /**
> [snip]
> * @timeout_ms: Timeout in ums, 0 means never timeout.
> *
> * Returns 0 on success, -ETIMEDOUT upon a timeout, or the unlikely
> * error propagation from wait_var_event_killable if timeout_ms is 0.
> */
> 
> with the return error fixed above and the documentation in place:
> 
> Reviewed-by: Rodrigo Vivi 
> 
> > -int intel_wakeref_wait_for_idle(struct intel_wakeref *wf)
> > +int intel_wakeref_wait_for_idle(struct intel_wakeref *wf, int timeout_ms)
> >  {
> > -   int err;
> > 
[snip]


Re: [PATCH v2 0/3] Resolve suspend-resume racing with GuC destroy-context-worker

2023-08-14 Thread Teres Alexis, Alan Previn
On Mon, 2023-08-14 at 18:12 -0700, Teres Alexis, Alan Previn wrote:
> This series is the result of debugging issues root caused to
> races between the GuC's destroyed_worker_func being triggered
> vs repeating suspend-resume cycles with concurrent delayed
> fence signals for engine-freeing.
alan: forgot credit:
Tested-by: Mousumi Jana 

alan:snip.
> 
> 
> Alan Previn (3):
>   drm/i915/guc: Flush context destruction worker at suspend
>   drm/i915/guc: Close deregister-context race against CT-loss
>   drm/i915/gt: Timeout when waiting for idle in suspending
> 
>  drivers/gpu/drm/i915/gt/intel_engine_cs.c |  2 +-
>  drivers/gpu/drm/i915/gt/intel_gt_pm.c |  7 ++-
>  drivers/gpu/drm/i915/gt/intel_gt_pm.h |  7 ++-
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 45 +--
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.h |  2 +
>  drivers/gpu/drm/i915/gt/uc/intel_uc.c |  2 +
>  drivers/gpu/drm/i915/intel_wakeref.c  | 14 --
>  drivers/gpu/drm/i915/intel_wakeref.h  |  5 ++-
>  8 files changed, 71 insertions(+), 13 deletions(-)
> 
> 
> base-commit: 85f20fb339f05ec4221bb295c13e46061c5c566f



Re: [Intel-gfx] [PATCH v1 1/3] drm/i915/guc: Flush context destruction worker at suspend

2023-08-14 Thread Teres Alexis, Alan Previn
> 

> > Rodrigo: And why here and not some upper layer? like in prepare
> alan: wait_for_suspend does both the checking for idle as well as the 
> potential
> wedging if guc or hw has hung at this late state. if i call from the upper
> layer before this wait_for_suspend, it might be too early because the
> wait-for-idle could experience workloads completing and new 
> contexts-derigtrations
> being queued. If i call it from upper layer after wait_for_suspend, then it 
> would
> be unnecessary if wait_for_suspend decided to nuke everything. Hmmm.. but i 
> guess
> the latter could work too - since the nuke case would have emptied out the 
> link-list
> of that worker and so it would either run and do nothing or would not even be 
> queued.
> Would you rather i go that way? (i'll recheck with my team mates for 
> i-dotting/t-crossing
> discussion.

actually, after going up a layer, i realize the right place might be to insert
late stage worker-flushing into intel_uc_suspend (called from 
intel_gt_suspend_late)
which is also where the gsc worker is flushed. This will also mean we don't 
need to
create intel_uc_suspend_prepare for new plumbing.


Re: [PATCH v1] drm/i915/pxp/mtl: Update gsc-heci cmd size and timeout

2023-08-10 Thread Teres Alexis, Alan Previn
On Fri, 2023-07-07 at 11:34 -0700, Teres Alexis, Alan Previn wrote:
> Update the max GSC-HECI packet size and the max firmware
> response timeout to match internal fw specs.
> 
> Signed-off-by: Alan Previn 

I'm going to re-rev this and change the subject slightly to "Update gsc-heci 
cmd submission
to align with spec". The two changes in this patch will be included but the 
cmd-packet size
is now to be increased to 64K. Also, the counting of the timeout needs to start 
from when
the request has his the cmd streamer hardware (not from when the PXP subsystem 
has thrown it
over the wall to the GuC). Although this latter change would imply a longer 
timeout period,
the change to observe this longer timeout should be applicable to code that is 
actually
triggering a PXP session creation/teardown.

In addition to that, we also need to update the LRC common defaults to inclure 
forcing
the runalone bit for PXP contexts (as that is the updated hardware spec'd
expectation).

...alan


 



Re: [PATCH v1 2/3] drm/i915/guc: Close deregister-context race against CT-loss

2023-08-09 Thread Teres Alexis, Alan Previn
On Wed, 2023-08-02 at 16:35 -0700, Teres Alexis, Alan Previn wrote:
> If we are at the end of suspend or very early in resume
> its possible an async fence signal could lead us to the
> execution of the context destruction worker (after the
> prior worker flush).
> 
alan:snip
>  
>  static void __guc_context_destroy(struct intel_context *ce)
> @@ -3270,7 +3287,20 @@ static void deregister_destroyed_contexts(struct 
> intel_guc *guc)
>   if (!ce)
>   break;
>  
> - guc_lrc_desc_unpin(ce);
> + if (guc_lrc_desc_unpin(ce)) {
> + /*
> +  * This means GuC's CT link severed mid-way which only 
> happens
> +  * in suspend-resume corner cases. In this case, put the
> +  * context back into the destroyed_contexts list which 
> will
> +  * get picked up on the next context deregistration 
> event or
> +  * purged in a GuC sanitization event 
> (reset/unload/wedged/...).
> +  */
> + spin_lock_irqsave(>submission_state.lock, flags);
> + list_add_tail(>destroyed_link,
> +   
> >submission_state.destroyed_contexts);
alan: i completely missed the fact this new code is sitting within a 
while (!list_empty(>submission_state.submission_state.destroyed_contexts) 
block
so putting it back will cause it to while loop forever.

will fix and rerev.

> + spin_unlock_irqrestore(>submission_state.lock, 
> flags);
> + }
> +
>   }
>  }
>  



Re: [PATCH v1 1/3] drm/i915/guc: Flush context destruction worker at suspend

2023-08-09 Thread Teres Alexis, Alan Previn
Thanks Rodrigo for reviewing this.

On Mon, 2023-08-07 at 13:52 -0400, Vivi, Rodrigo wrote:
> On Wed, Aug 02, 2023 at 04:34:59PM -0700, Alan Previn wrote:
> > Suspend is not like reset, it can unroll, so we have to properly
> > flush pending context-guc-id deregistrations to complete before
> > we return from suspend calls.
> 
> But if is 'unrolls' the execution should just continue, no?!
> In other words, why is this flush needed? What happens if we
> don't flush, but resume doesn't proceed? in in which case
> of resume you are thinking that it returns and not having flushed?

alan: I apologize, i realize I should have explained it better.
The flush is needed for when we DON'T unroll. I wanted to point
out that at in intel_gt_suspend_foo we dont actually know
if the suspend is going to unroll or not so we should always flush
properly in the case. I will re-rev with better comment on this patch.

alan:snip
> 
> >  
> >  static void wait_for_suspend(struct intel_gt *gt)
> >  {
> > -   if (!intel_gt_pm_is_awake(gt))
> > +   if (!intel_gt_pm_is_awake(gt)) {
> > +   intel_uc_suspend_prepare(>uc);
> 
> why only on idle?

alan: actually no - i am flushing for both idle and non-idle cases (see farther
below) but for the non-idle case, i am skipping the flush if we decide to wedge
(since the wedge will clear up everything -> all guc-contexts that are inflight
are nuked and freed).
> 
> Well, I know, if we are in idle it is because all the requests had
> already ended and gt will be wedged, but why do we need to do anything
> if we are in idle?

Because the issue that is seen as a very late context-deregister that
is triggered by a, orthogonal thread via the following callstack:
drm-fence signal triggers a FENCE_FREE via__i915_sw_fence_notify that
connects to engines_notify -> free_engines_rcu -> (thread) ->
intel_context_put -> kref_put(>ref..) that queues the
context-destruction worker. That said, eventhough the gt is idle because
the last workload has been retired a context-destruction worker
may have has just gotten queued. 

> 
> And why here and not some upper layer? like in prepare
alan: wait_for_suspend does both the checking for idle as well as the potential
wedging if guc or hw has hung at this late state. if i call from the upper
layer before this wait_for_suspend, it might be too early because the
wait-for-idle could experience workloads completing and new 
contexts-derigtrations
being queued. If i call it from upper layer after wait_for_suspend, then it 
would
be unnecessary if wait_for_suspend decided to nuke everything. Hmmm.. but i 
guess
the latter could work too - since the nuke case would have emptied out the 
link-list
of that worker and so it would either run and do nothing or would not even be 
queued.
Would you rather i go that way? (i'll recheck with my team mates for 
i-dotting/t-crossing
discussion.
> 
> > return;
> > +   }
> >  
> > if (intel_gt_wait_for_idle(gt, I915_GT_SUSPEND_IDLE_TIMEOUT) == -ETIME) 
> > {
> > /*
> > @@ -299,6 +301,8 @@ static void wait_for_suspend(struct intel_gt *gt)
> >  */
> > intel_gt_set_wedged(gt);
> > intel_gt_retire_requests(gt);
> > +   } else {
> > +   intel_uc_suspend_prepare(>uc);
> > }
> >  
> > intel_gt_pm_wait_for_idle(gt);
alan:snip


Re: [Intel-gfx] [PATCH v1 3/3] drm/i915/gt: Timeout when waiting for idle in suspending

2023-08-09 Thread Teres Alexis, Alan Previn
Thanks Rodrigo for reviewing this.

On Mon, 2023-08-07 at 13:56 -0400, Vivi, Rodrigo wrote:
> On Wed, Aug 02, 2023 at 04:35:01PM -0700, Alan Previn wrote:
> > When suspending, add a timeout when calling
> > intel_gt_pm_wait_for_idle else if we have a lost
> > G2H event that holds a wakeref (which would be
> > indicating of a bug elsewhere in the driver), we
> > get to complete the suspend-resume cycle, albeit
> > without all the lower power hw counters hitting
> > its targets, instead of hanging in the kernel.
> > 
alan:snip

> > -int intel_wakeref_wait_for_idle(struct intel_wakeref *wf)
> > +int intel_wakeref_wait_for_idle(struct intel_wakeref *wf, int timeout_ms)
> >  {
> > -   int err;
> > +   int err = 0;
> >  
> > might_sleep();
> >  
> > -   err = wait_var_event_killable(>wakeref,
> > - !intel_wakeref_is_active(wf));
> > +   if (!timeout_ms)
> > +   err = wait_var_event_killable(>wakeref,
> > + !intel_wakeref_is_active(wf));
> > +   else if (wait_var_event_timeout(>wakeref,
> > +   !intel_wakeref_is_active(wf),
> > +   msecs_to_jiffies(timeout_ms)) < 1)
> > +   err = -ETIME;
> 
> it looks to me that -ETIMEDOUT would be a better error.
alan: okay - will do, thanks.



Re: [PATCH v4 1/1] drm/i915/pxp: Optimize GET_PARAM:PXP_STATUS

2023-08-03 Thread Teres Alexis, Alan Previn
On Wed, 2023-08-02 at 11:25 -0700, Teres Alexis, Alan Previn wrote:
> After recent discussions with Mesa folks, it was requested
> that we optimize i915's GET_PARAM for the PXP_STATUS without
> changing the UAPI spec.
> 
> Add these additional optimizations:
>- If any PXP initializatoin flow failed, then ensure that
>  we catch it so that we can change the returned PXP_STATUS
>  from "2" (i.e. 'PXP is supported but not yet ready')
>  to "-ENODEV". This typically should not happen and if it
>  does, we have a platform configuration issue.
>- If a PXP arbitration session creation event failed
>  due to incorrect firmware version or blocking SOC fusing
>  or blocking BIOS configuration (platform reasons that won't
>  change if we retry), then reflect that blockage by also
>  returning -ENODEV in the GET_PARAM:PXP_STATUS.
>- GET_PARAM:PXP_STATUS should not wait at all if PXP is
>  supported but non-i915 dependencies (component-driver /
>  firmware) we are still pending to complete the init flows.
>  In this case, just return "2" immediately (i.e. 'PXP is
>  supported but not yet ready').
> 
> Difference from prio revs:
>   v3: - Rebase with latest tip that has pulled in setting the
> gsc fw load to fail if proxy init fails.
>   v2: - Use a #define for the default readiness timeout (Vivaik).
>   - Improve comments around the failing of proxy-init.
>   v1: - Change the commit msg style to be imperative. (Jani)
>   - Rename timeout to timeout_ms. (Jani)
>   - Fix is_fw_err_platform_config to use higher order
> param (pxp) first. (Jani)
> 
> Signed-off-by: Alan Previn 

alan: Daniele pointed out that i removed Vivaik's RB from rev-3.
The difference from this rev vs rev is a hunk of code got removed
and went into a different patch (that reused the same code, is
higher priority - CI related, and had to go first). So this rev
is identical to rev3 except a hunk that has been removed.
I've checked with Vivaik and he is good with keeping his R-b
since the rest of the patch is the same. Thus repasting his
R-b from rev3 (Thanks Daniele and Vivaik):

Reviewed-by: Balasubrawmanian, Vivaik 


Re: [PATCH v1 0/3] Resolve suspend-resume racing with GuC destroy-context-worker

2023-08-02 Thread Teres Alexis, Alan Previn
On Wed, 2023-08-02 at 16:34 -0700, Teres Alexis, Alan Previn wrote:
> This series is the result of debugging issues root caused to
> races between the GuC's destroyed_worker_func being triggered vs
> repeating suspend-resume cycles with concurrent delayed
> fence signals for engine-freeing.
> 
> The reproduction steps require that an app is created right before
> the start of the suspend cycle where it creates a new gem
> context and submits a tiny workload that would complete in the
> middle of the suspend cycle. However this app uses dma-buffer
> sharing or dma-fence with non-GPU objects or signals that
> eventually triggers a FENCE_FREE via__i915_sw_fence_notify that
> connects to engines_notify -> free_engines_rcu ->
> intel_context_put -> kref_put(>ref..) that queues the
> worker after the GuCs CTB has been disabled (i.e. after
> i915-gem's suspend-late flows).
> 
As an FYI - in offline conversations with John and Daniele, we have agreed
that at least the first two of the patches in this are necessary improvements
but the last patch may remain open as further offline debug is continuing
to pin down the src of the above fence-signal-flow. For now we are hoping
to proceed with reviewing the first two patches and only look into the 3rd
patch if there are system level fence signalling that truly can trigger
this anomaly or if its just a straddling request somewhere within i915
that has appeared or hung at the wrong time which needs to be fixed.

alan:snip


Re: [PATCH] drm/i915/pxp/mtl: intel_pxp_init_hw needs runtime-pm inside pm-complete

2023-08-02 Thread Teres Alexis, Alan Previn
> > 
> > 
alan:snip

Thanks Vinay and Daniele - i'll respin with below fix.


> > @@ -48,7 +50,8 @@ void intel_pxp_resume_complete()
> > if (!HAS_ENGINE(pxp->ctrl_gt, GSC0) && !pxp->pxp_component)
> > return;
> >   
> > -   intel_pxp_init_hw(pxp);
> > +   with_intel_runtime_pm(>ctrl_gt->i915->runtime_pm, wakeref)
> 
> This is called from within the rpm resume path, so you can't do an rpm 
> get or it will deadlock. Maybe have:
> 
> __pxp_resume_complete(struct intel_pxp *pxp, bool needs_rpm);
> 
> intel_pxp_resume_complete(..)
> {
>      return __pxp_resume_complete(pxp, true);
> }
> 
> intel_pxp_runtime_resume(..)
> {
>      return __pxp_resume_complete(pxp, false);
> }
> 
> 
> or something like that.
> Daniele



Re: [PATCH v6] drm/i915/selftest/gsc: Ensure GSC Proxy init completes before selftests

2023-07-20 Thread Teres Alexis, Alan Previn
On Thu, 2023-07-20 at 14:52 -0700, Ceraolo Spurio, Daniele wrote:
> 
> On 7/20/2023 2:40 PM, Alan Previn wrote:
> > On MTL, if the GSC Proxy init flows haven't completed, submissions to the
> > GSC engine will fail. Those init flows are dependent on the mei's
> > gsc_proxy component that is loaded in parallel with i915 and a
> > worker that could potentially start after i915 driver init is done.
> > 
> > That said, all subsytems that access the GSC engine today does check
> > for such init flow completion before using the GSC engine. However,
> > selftests currently don't wait on anything before starting.
> > 
> > To fix this, add a waiter function at the start of __run_selftests
> > that waits for gsc-proxy init flows to complete. Selftests shouldn't
> > care if the proxy-init failed as that should be flagged elsewhere.
> > 
> > Difference from prior versions:
> > v6: - Add a helper that returns something more than a boolean
> >   so we selftest can stop waiting if proxy-init hadn't
> >   completed but failed (Daniele).

alan:snip
> 

> > +int intel_gsc_uc_fw_proxy_get_status(struct intel_gsc_uc *gsc)
> > +{
> > +   if (!(IS_ENABLED(CONFIG_INTEL_MEI_GSC_PROXY)))
> > +   return -ENODEV;
> > +   if (!intel_uc_fw_is_loadable(>fw))
> > +   return -ENODEV;
> > +   if (__intel_uc_fw_status(>fw) == INTEL_UC_FIRMWARE_LOAD_FAIL)
> 
> You're missing the change to move the FW status to LOAD_FAIL if the 
> proxy fails to initialize. Or are you expecting 
> https://patchwork.freedesktop.org/series/118723/, which included that 
> change, to be merged first?
> 
> Daniele

alan: as per our offline sync, I'll respin this one and move it away from the
other patch (since this is more critical) and we can respin the other after
this is done so we get a smooth merge. Also, as i move that "change fw status
to fail" from that PXP patch to this patch, I'll fix that issue where i missed
the 2nd failure point in the proxy init flow.

Thanks for your help. :)



Re: [Intel-gfx] [PATCH v4] drm/i915/selftest/gsc: Ensure GSC Proxy init completes before selftests

2023-07-12 Thread Teres Alexis, Alan Previn
On Wed, 2023-07-12 at 10:19 +0100, Tvrtko Ursulin wrote:
> On 11/07/2023 23:02, Alan Previn wrote:
> > On MTL, if the GSC Proxy init flows haven't completed, submissions to the
> > GSC engine will fail. Those init flows are dependent on the mei's
> > gsc_proxy component that is loaded in parallel with i915 and a
> > worker that could potentially start after i915 driver init is done.
> > 
> > That said, all subsytems that access the GSC engine today does check
> > for such init flow completion before using the GSC engine. However,
> > selftests currently don't wait on anything before starting.
> > 
> > 
> > 
alan:snip

> > +   /*
> > +* The gsc proxy component depends on the kernel component driver load 
> > ordering
> > +* and in corner cases (the first time after an IFWI flash), 
> > init-completion
> > +* firmware flows take longer.
> > +*/
> > +   unsigned long timeout_ms = 8000;
> > +
> > +   if (need_to_wait &&
> > +   (wait_for(intel_gsc_uc_fw_proxy_init_done(>media_gt->uc.gsc, 
> > true),
> > +   timeout_ms)))
> > +   pr_info(DRIVER_NAME "Timed out waiting for 
> > gsc_proxy_completion!\n");
> 
> Would it make sense to error out here? Or at least upgrade to pr_warn or 
> something?
alan: agree on pr_warn (especially since need_for_wait being true and we tried 
for 8 secs - this should never happen).

> 
> I didn't quite understand the points Daniele raised about engine loops 
> and resets - in my mind GSC engine is this special thing exercised for 
> highly specialized operations and not touched in random for_each_engine 
> loop tests, but I also did not really look so might be totally wrong.

alan: after consulting with Daniele further, the concern in the case of
having gsc-proxy-init mid-execution while other selttests
are running (when thinking of tests that have nothing to do with GSC
but has indirect effect like memory-pressure, engine-resets,
etc) is that the proxy-init will bail-out print an error but
that would result in CI reporting a false-negative. We can't skip
that error since CONFIG_INTEL_MEI_GSC_PROXY tells us that the kernel
owner wants GSC-PROXY working.

> 
> In any case, v4 reads clear - no confusing comments and not 
> over-engineered so is acceptable to me.
> 
alan: thanks Tvrtko.


> P.S. Maybe the check *could* be moved to i915_live_selftests, where hw 
> dependencies conceptually fit better, and maybe i915_perf_selftests 
> would need it too then (?), but it is up to you.
alan: i can do this quickly as a rev5 (after a bit of manual check if perf 
needs it)

> 
> Maybe even in the array selftests/i915_live_selftests.h if we could add 
> a facility to make unskippable tests and add this one after the sanity 
> check. Which would then achieve the same generalized thing you had in 
> the previous version without needing to add a new array/loop.
alan: i rather not attempt this as part of the current patch but will
consider it as a separate patch if you are okay with that?



Re: [Intel-gfx] [PATCH v3] drm/i915/selftest/gsc: Ensure GSC Proxy init completes before selftests

2023-07-11 Thread Teres Alexis, Alan Previn
On Tue, 2023-07-11 at 11:49 -0700, Ceraolo Spurio, Daniele wrote:
> 
> > > > @@ -134,6 +193,8 @@ static int __run_selftests(const char *name,
> > > >{
> > > > int err = 0;
> > > >
> > > > +   __wait_on_all_system_dependencies(data);
> > > Why does this need to be top level selftests and not just a wait for
> > > intel_gsc_uc_fw_proxy_init_done in the tests where it is relevant, via
> > > some helper or something?
> > Alan: it was an offline decision because we didn't want to repeat
> > the same check for all permutations of selftests' subtests (i.e. considering
> > module params can dictate to skip some subtests but execute others).
> > 
> > Anyways, let me get back to you on how how many selftests' subtests 
> > actually excercise the
> > need for proxy-init to complete - if its just 1-to-2 subtest I'll move the 
> > remove the code
> > from here and move them into the individual subtests.
> 
> I don't think it is going to be easy to figure out which selftest are 
> impacted. All selftests looping on all engines of course, but also tests 
> triggering GT resets and/or messing with the system in other ways. Any 
> new tests added will also need to be evaluated.
> 
> IMO there is minimal impact of having this check on every test. When 
> running selftests we load i915 after the rest of the system has already 
> fully booted, so there are no delays in getting the mei component up and 
> therefore proxy init is sometimes completed even before the selftest 
> code starts; when we do have to wait, it's usually for a very short 
> time, because the expected total execution time for the GSC worker when 
> not having to wait for the mei component to load is ~750ms (~200ms for 
> GSC load + 20ms for HuC auth + ~500ms for proxy init). Having a few 
> seconds added to the total selftests runtime is IMO a better option that 
> having to maintain a list of impacted tests.
> 
> Daniele
> 

Thanks Daniele - I completely forgot about reset or other system disruptive 
tests.
For now I'll re-rev to address Tvrtko's other comments but will keep the waiter
as 'once-top-down' for now and wait for Tvrtko's thoughts on that next rev.
...alan


Re: [Intel-gfx] [PATCH v3] drm/i915/selftest/gsc: Ensure GSC Proxy init completes before selftests

2023-07-11 Thread Teres Alexis, Alan Previn
Thanks fore reviewing Tvrtko, below are my responses.
I'll rerev without generalized func ptr and only for the subtests that need it.
...alan

On Thu, 2023-06-29 at 22:44 +0100, Tvrtko Ursulin wrote:
> On 29/06/2023 21:42, Alan Previn wrote:
> > On MTL, if the GSC Proxy init flows haven't completed, submissions to the
> > GSC engine will fail. Those init flows are dependent on the mei's
> > gsc_proxy component that is loaded in parallel with i915 and a
> > worker that could potentially start after i915 driver init is done.
> > 
> > That said, all subsytems that access the GSC engine today does check
> > for such init flow completion before using the GSC engine. However,
> > selftests currently don't wait on anything before starting.

alan:snip
> > +static int
> > +__wait_gsc_proxy_completed(struct drm_i915_private *i915,
> > +  unsigned long timeout_ms)
> > +{
> > +   bool need_to_wait = (IS_ENABLED(CONFIG_INTEL_MEI_GSC_PROXY) &&
> > +i915->media_gt &&
> > +HAS_ENGINE(i915->media_gt, GSC0) &&
> > +
> > intel_uc_fw_is_loadable(>media_gt->uc.gsc.fw));
> > +
> > +   /*
> > +* For gsc proxy component loading + init, we need a much longer timeout
> > +* than what CI selftest infrastrucutre currently uses. This longer wait
> > +* period depends on the kernel config and component driver load 
> > ordering
> > +*/
> 
> How is a CI timeout value relevant?
> 
> Plus from the commit message it sounds like the point of wait is so 
> submission to gsc does not fail if loading is still in progress, not 
> that the CI times out. So what is the main problem?

Alan: The comment was meant to explain why we override the CI selftest timeout 
(an input param
to the generalized func ptr loop) to something much larger specially for 
gsc-proxy-waiting.
However, since your other review comment below is to remove the generalization, 
this comment
therefore will not make sense so I'll remove it accordingly. The point was that 
CI might
have a system level selftest timeout of something much smaller like 500 
milisecs (used to
have some control over the execution time), but for the gsc-proxy waiting, its 
not in i915's
control but depends on the kernel component driver loading flow (and in rare 
occasions, after
a fresh IFWI was flashed which causes a 1-time longer period for fw-proxy flows 
to complete).
In any case, I'll remove the comment as per your direction.

> 
> > +   if (timeout_ms < 8000)
> > +   timeout_ms = 8000;
> > +
> 

alan:snip
> > +static int __wait_on_all_system_dependencies(struct drm_i915_private *i915)
> > +{
> > +   struct __startup_waiter *waiter = all_startup_waiters;
> > +   int count = ARRAY_SIZE(all_startup_waiters);
> > +   int ret;
> > +
> > +   if (!waiter || !count || !i915)
> > +   return 0;
> 
> Ugh.
> 
> If it ever becomes an empty array just zap this whole code and not have 
> these checks.
Alan: Okay sure - will remove these check except the i915 - see below.
> 
> Also, no i915 is a possibility?
Alan: i915_mock_selftests passes in NULL for i915. This checking of the i915
aligns with the existing __run_selftests code - but in that function the
param is called "data" eventhough in all callers of __run_selftests, that
"data" is actually i915 when its not null.
> 
> But actually.. please don't add the function table generalization unless 
> it is already known something else is coming to be plugged into it.
Alan: Okay- I'll remove it.

alan:snip
> 

> > @@ -134,6 +193,8 @@ static int __run_selftests(const char *name,
> >   {
> > int err = 0;
> >   
> > +   __wait_on_all_system_dependencies(data);
> 
> Why does this need to be top level selftests and not just a wait for 
> intel_gsc_uc_fw_proxy_init_done in the tests where it is relevant, via 
> some helper or something?
Alan: it was an offline decision because we didn't want to repeat
the same check for all permutations of selftests' subtests (i.e. considering
module params can dictate to skip some subtests but execute others).

Anyways, let me get back to you on how how many selftests' subtests actually 
excercise the
need for proxy-init to complete - if its just 1-to-2 subtest I'll move the 
remove the code
from here and move them into the individual subtests.

alan:snip


Re: [Intel-gfx] [PATCH] drm/i915/pxp: Optimize GET_PARAM:PXP_STATUS

2023-06-29 Thread Teres Alexis, Alan Previn
On Tue, 2023-06-20 at 09:30 -0500, Balasubrawmanian, Vivaik wrote:
> On 6/1/2023 12:45 PM, Alan Previn wrote:
> > After recent discussions with Mesa folks, it was requested
> > that we optimize i915's GET_PARAM for the PXP_STATUS without
> > changing the UAPI spec.
> > 
> > This patch adds this additional optimizations:
> > - If any PXP initializatoin flow failed, then ensure that
> >   we catch it so that we can change the returned PXP_STATUS
> >   from "2" (i.e. 'PXP is supported but not yet ready')
> >   to "-ENODEV". This typically should not happen and if it
> >   does, we have a platform configuration.
> > - If a PXP arbitration session creation event failed
> >   due to incorrect firmware version or blocking SOC fusing
> >   or blocking BIOS configuration (platform reasons that won't
> >   change if we retry), then reflect that blockage by also
> >   returning -ENODEV in the GET_PARAM-PXP_STATUS.
> > - GET_PARAM:PXP_STATUS should not wait at all if PXP is
> >   supported but non-i915 dependencies (component-driver /
> >   firmware) we are still pending to complete the init flows.
> >   In this case, just return "2" immediately (i.e. 'PXP is
> >   supported but not yet ready').
> > 
> > Signed-off-by: Alan Previn 
> > ---
> >   drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c  | 11 +-
> >   drivers/gpu/drm/i915/i915_getparam.c   |  2 +-
> >   drivers/gpu/drm/i915/pxp/intel_pxp.c   | 25 ++
> >   drivers/gpu/drm/i915/pxp/intel_pxp.h   |  2 +-
> >   drivers/gpu/drm/i915/pxp/intel_pxp_gsccs.c |  7 +++---
> >   drivers/gpu/drm/i915/pxp/intel_pxp_tee.c   |  7 +++---
> >   drivers/gpu/drm/i915/pxp/intel_pxp_types.h |  9 
> >   7 files changed, 50 insertions(+), 13 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c 
> > b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c
> > index fb0984f875f9..4dd744c96a37 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c
> > @@ -42,8 +42,17 @@ static void gsc_work(struct work_struct *work)
> > }
> >   
> > ret = intel_gsc_proxy_request_handler(gsc);
> > -   if (ret)
> > +   if (ret) {
> > +   if (actions & GSC_ACTION_FW_LOAD) {
> > +   /*
> > +* a proxy request failure that came together 
> > with the
> > +* firmware load action means the last part of 
> > init has
> > +* failed so GSC fw won't be usable after this
> > +*/
> > +   intel_uc_fw_change_status(>fw, 
> > INTEL_UC_FIRMWARE_LOAD_FAIL);
> > +   }
> > goto out_put;
> > +   }
> >   
> > /* mark the GSC FW init as done the first time we run this */
> > if (actions & GSC_ACTION_FW_LOAD) {
> 
> On the huc authentication comment block above this snippet, the last 
> statement: "Note that we can only do the GSC auth if the GuC auth was" 
> is confusing as the code below is only dealing with HuC Authentication.
alan: i believe what he meant was "can only do the GSC-based auth if
the GuC-based auth"... but I can't change that code as part
of this patch - I believe the rules for kernel patch is to ensure each
single patch is modular (not mixing unrelated changes) and focuses just
on what its described to do. IIRC, we would need to create a separate
patch review for that change.

> 
> This function seems to have a section to deal with FW load action and 
> another to deal with SW Proxy requests, but we seem to be mixing both 
> actions in the SW proxy section. instead, can we move this call to 
> intel_gsc_proxy_request_handler to the FW load section itself instead of 
> handling it as an additional check in the SW_proxy section? In the same 
> vein, we should also move the intel_uc_fw_change_status() call into the 
> above FW Load action section. i think that way the code reads better.
alan: GSC_ACTION_FW_LOAD is used for loading the GSC firmware which is a
one-time thing per i915 load. However, GSC_ACTION_SW_PROXY events can happen
any time the GSC fw needs to communicate with CSE firmware (or vice versa)
due to platform events that may have not been triggered by i915 long after
init. However, the rule is after GSC FW is loaded, i915 is required
to do a 1-time proxy-init step to prime both GSC and CSE fws that proxy
comms is avail. without this step, we can't use the gsc-fw for other ops.

So to recap the rules:
1. we launch the worker to do the one-time the GSC firmware load.
2. after the GSC firmware load is successful, we have to do a one-time SW-proxy 
init.
-> this is why we add the GSC_ACTION_SW_PROXY flag successful load 
completion.
3. If we are doing proxy-handling for the very first time, we ensure
   -> FW status is only set to RUNNING if proxy int was 

Re: [PATCH v3] drm/xe/guc: Fix h2g_write usage of GUC_CTB_MSG_MAX_LEN

2023-06-29 Thread Teres Alexis, Alan Previn
On Wed, 2023-06-28 at 21:44 +, Brost, Matthew wrote:
> On Wed, Jun 28, 2023 at 11:17:18AM -0700, Alan Previn wrote:
> > In the ABI header, GUC_CTB_MSG_MIN_LEN is '1' because
> > GUC_CTB_HDR_LEN is 1. This aligns with H2G/G2H CTB specification
> > where all command formats are defined in units of dwords so that '1'
> > is a dword. Accordingly, GUC_CTB_MSG_MAX_LEN is 256-1 (i.e. 255
> > dwords). However, h2g_write was incorrectly assuming that
> > GUC_CTB_MSG_MAX_LEN was in bytes. Fix this.
> 
alan:snip
> > diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> > index 22bc9ce846db..aa04b5c4822f 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> > @@ -401,19 +401,21 @@ static int h2g_write(struct xe_guc_ct *ct, const u32 
> > *action, u32 len,
> >  {
> > struct xe_device *xe = ct_to_xe(ct);
> > struct guc_ctb *h2g = >ctbs.h2g;
> > -   u32 cmd[GUC_CTB_MSG_MAX_LEN / sizeof(u32)];
> > -   u32 cmd_len = len + GUC_CTB_HDR_LEN;
> > -   u32 cmd_idx = 0, i;
> > +#define H2G_CT_HEADERS (GUC_CTB_HDR_LEN + 1) /* one DW CTB header and one 
> > DW HxG header */
> 
> Hate to nit pick but again this should be above the h2g_write per
> feedback from Oden on Xe in general.
> 
> Otherwise LGTM.
> 
> With the nit addressed:
> 
> Reviewed-by: Matthew Brost 

Thanks for reviewing. My bad on the #define - you mentioned that before.
Will fix that now.
...alan


Re: [PATCH v2 3/5] drm/i915/mtl/gsc: query the GSC FW for its compatibility version

2023-06-08 Thread Teres Alexis, Alan Previn
On Mon, 2023-06-05 at 19:24 -0700, Ceraolo Spurio, Daniele wrote:
> The compatibility version is queried via an MKHI command. Right now, the
> only existing interface is 1.0
> This is basically the interface version for the GSC FW, so the plan is
> to use it as the main tracked version, including for the binary naming
> in the fetch code.
> 
> v2: use define for the object size (Alan)
> 
> Cc: Alan Previn 
> Cc: John Harrison 
> Reviewed-by: Alan Previn  #v1


Diff'd v1 vs v2 of this patch - differences are good. So my R'B still holds:
Reviewed-by: Alan Previn 

> ---



Re: [PATCH v1] drm/i915/gsc: take a wakeref for the proxy-init-completion check

2023-06-08 Thread Teres Alexis, Alan Previn
On Thu, 2023-06-08 at 11:19 -0700, Ceraolo Spurio, Daniele wrote:
> On 6/8/2023 11:04 AM, Alan Previn wrote:
> > Ensure intel_gsc_uc_fw_init_done and intel_gsc_uc_fw_proxy_init
> > takes a wakeref before reading GSC Shim registers.
alan:snip
> 
> >   bool intel_gsc_uc_fw_proxy_init_done(struct intel_gsc_uc *gsc)
> >   {
> > struct intel_uncore *uncore = gsc_uc_to_gt(gsc)->uncore;
> > -   u32 fw_status = intel_uncore_read(uncore, GSC_FW_STATUS_REG);
> > +   intel_wakeref_t wakeref;
> > +   u32 fw_status;
> > +
> > +   with_intel_runtime_pm(uncore->rpm, wakeref)
> > +   fw_status = intel_uncore_read(uncore, GSC_FW_STATUS_REG);
> 
> I think this could be moved to an helper (gsc_uc_get_fw_status?), so we 
> don't have to re-do the wakeref in all the callers.
alan: thanks - will fix.



Re: [Intel-xe] [v1] drm/xe/guc: Fix h2g_write usage of GUC_CTB_MSG_MAX_LEN

2023-06-08 Thread Teres Alexis, Alan Previn
On Thu, 2023-06-08 at 00:12 +, Teres Alexis, Alan Previn wrote:
> On Fri, 2023-06-02 at 11:52 -0700, Alan Previn wrote:
> 
> alan: good point - i will go back and find if we have this value internally 
> spec'd before we continue.
alan: actually, even if the spec allowed very large cmds (as its internally 
spec'd to allow that), we shouldnt consumer the stack like that - will fix this 
that way.



Re: [PATCH v2 2/5] drm/i915/mtl/gsc: extract release and security versions from the gsc binary

2023-06-08 Thread Teres Alexis, Alan Previn
Everything looks good to me, so
Reviewed-by: Alan Previn 

On Mon, 2023-06-05 at 19:23 -0700, Ceraolo Spurio, Daniele wrote:
> The release and security versions of the GSC binary are not used at
> runtime to decide interface compatibility (there is a separate version
> for that), but they're still useful for debug, so it is still worth
> extracting them and printing them out in dmesg.
alan:snip

> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_binary_headers.h 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_binary_headers.h
> index 714f0c256118..ad80afcafd23 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_binary_headers.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_binary_headers.h
alan:snip


> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c
alan:snip


> @@ -42,6 +43,157 @@ bool intel_gsc_uc_fw_init_done(struct intel_gsc_uc *gsc)
alan:snip

> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.h 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.h
alan:snip


> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h
alan:snip

> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_huc_fw.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_huc_fw.c
alan:snip

> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c
alan:snip

> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.h 
> b/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.h
alan:snip



>  /*
> @@ -289,6 +290,8 @@ static inline u32 intel_uc_fw_get_upload_size(struct 
> intel_uc_fw *uc_fw)
alan:snip


Re: [v2] drm/i915/selftest/gsc: Ensure GSC Proxy init completes before selftests

2023-06-08 Thread Teres Alexis, Alan Previn
On Thu, 2023-06-08 at 18:14 +, Dong, Zhanjun wrote:
> See my comments below.
> 
> > -Original Message-
> > From: Alan Previn 
alan:snip

> > +static int
> > +__wait_gsc_proxy_completed(struct drm_i915_private *i915,
> > +  unsigned long timeout_ms)
> > +{
> > +   bool need_to_wait = (IS_ENABLED(CONFIG_INTEL_MEI_GSC_PROXY)
> > &&
> > +i915->media_gt &&
> > +HAS_ENGINE(i915->media_gt, GSC0) &&
> > +intel_uc_fw_is_loadable(>media_gt-
> > > uc.gsc.fw));
> > +
> > +   /*
> > +* For gsc proxy component loading + init, we need a much longer
> > timeout
> > +* than what CI selftest infrastrucutre currently uses. This longer wait
> > +* period depends on the kernel config and component driver load
> > ordering
> > +*/
> > +   if (timeout_ms < 8000)
> > +   timeout_ms = 8000;
> 
> 
> Lgtm, just an concern about the fixed number here, shall we set the minimal 
> here, or let i915_selftest.timeout_ms take control? Thus no longer need 
> coding change here in the future.
> 
> Reviewed-by: Zhanjun Dong 

Thanks Zhanjun, unfortunately, based on internal testing, 
i915_selftest.timeout_ms default is too
low that it does occasionally timeout for CI. From experience, with a lean 
ubuntu config, it typically
takes about ~1 seconds for the mei-gsc-sw-proxy component driver to load after 
i915 loads.
Since CI regular unloads and reloads i915, the timeout observed ends up being 
reported as issue.

8 seconds was based on internal testing of the worst case scenario - which 
hardly ever happens.
We've only seen the 8 second happen when the kernel config has configs enabled 
for very many SOC IP
drivers and component driver (seen one at least one customer config) or if the 
MTL board IFWI was only
just reflashed (this would be a one-off 8 seconds, we suspect due to the 
firmware doing additional steps)




Re: [v1] drm/xe/guc: Fix h2g_write usage of GUC_CTB_MSG_MAX_LEN

2023-06-07 Thread Teres Alexis, Alan Previn
On Fri, 2023-06-02 at 11:52 -0700, Alan Previn wrote:
> In the ABI header, GUC_CTB_MSG_MIN_LEN is '1' because
> GUC_CTB_HDR_LEN is 1. This aligns with H2G/G2H CTB specification
> where all command formats are defined in units of dwords so that '1'
> is a dword. Accordingly, GUC_CTB_MSG_MAX_LEN is 256-1 (i.e. 255
> dwords). However, h2g_write was incorrectly assuming that
> GUC_CTB_MSG_MAX_LEN was in bytes. Fix this.
> 
> 
alan:snip

> The patch itself make sense but concerned about the value of
> GUC_CTB_MSG_MAX_LEN. Is the max size of CTB really 256 DWs? That seems
> way to large. Also I think some tools are going to complain with the
> stack being this large too.
> 
> Matt

alan: good point - i will go back and find if we have this value internally 
spec'd before we continue.


Re: [PATCH v3] drm/i915/mtl/gsc: Add a gsc_info debugfs

2023-06-07 Thread Teres Alexis, Alan Previn
On Mon, 2023-06-05 at 21:32 -0700, Ceraolo Spurio, Daniele wrote:
> Add a new debugfs to dump information about the GSC. This includes:
> 
> - the FW path and SW tracking status;
> - the release, security and compatibility versions;
> - the HECI1 status registers.
> 
> Note that those are the same registers that the mei driver dumps in
> their own status sysfs on DG2 (where mei owns the GSC).
> 
alan:snip


all looks good. (ofc we do have to follow up on those 2 actions from last rev's
conversation (that we agreed should be separate patch). For now, this patch is

Reviewed-by: Alan Previn 



Re: [PATCH v2 1/5] drm/i915/gsc: fixes and updates for GSC memory allocation

2023-06-07 Thread Teres Alexis, Alan Previn
On Mon, 2023-06-05 at 19:23 -0700, Ceraolo Spurio, Daniele wrote:
> A few fixes/updates are required around the GSC memory allocation and it
> is easier to do them all at the same time. The changes are as follows:
> 
> 1 - Switch the memory allocation to stolen memory. We need to avoid
> accesses from GSC FW to normal memory after the suspend function has
> completed and to do so we can either switch to using stolen or make sure
> the GSC is gone to sleep before the end of the suspend function. Given
> that the GSC waits for a bit before going idle even if there are no
> pending operations, it is easier and quicker to just use stolen memory.
> 
> 2 - Reduce the GSC allocation size to 4MBs, which is the POR requirement.
> The 8MBs were needed only for early FW and I had misunderstood that as
> being the expected POR size when I sent the original patch.
> 
> 3 - Perma-map the GSC allocation. This isn't required immediately, but it
> will be needed later to be able to quickly extract the GSC logs, which are
> inside the allocation. Since the mapping code needs to be rewritten due to
> switching to stolen, it makes sense to do the switch immediately to avoid
> having to change it again later.
> 
> Note that the explicit setting of CACHE_NONE for Wa_22016122933 has been
> dropped because that's the default setting for stolen memory on !LLC
> platforms.
> 
> v2: only memset the memory we're not overwriting (Alan)
> 
LGTM so ..
Reviewed-by: Alan Previn 


Re: [PATCH 2/6] drm/i915/uc/gsc: fixes and updates for GSC memory allocation

2023-06-05 Thread Teres Alexis, Alan Previn
On Tue, 2023-05-23 at 08:21 -0700, Ceraolo Spurio, Daniele wrote:
> 
> > 
> > > +static int gsc_allocate_and_map_vma(struct intel_gsc_uc *gsc, u32 size)
> > alan:snip
> > > + obj = i915_gem_object_create_stolen(gt->i915, s0ize);
> > > + if (IS_ERR(obj))
> > > + return PTR_ERR(obj);
> > > +
> > > + vma = i915_gem_object_ggtt_pin(obj, NULL, 0, 0, 0);
> > alan: should we be passing in the PIN_MAPPABLE flag into the last param?
> 
> No, PIN_MAPPABLE is only for legacy platform that used the aperture BAR 
> for stolen mem access via GGTT. MTL doesn't have it and stolen is 
> directly accessible via the LMEM BAR (which is actually the same BAR 2, 
> but now behaves differently).

alan: thanks - sounds good  - i forgot about those differentiations

> 
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h
> > > @@ -18,6 +18,7 @@ struct intel_gsc_uc {
> > >   
> > >   /* GSC-specific additions */
> > >   struct i915_vma *local; /* private memory for GSC usage */
> > > + void __iomem *local_vaddr; /* pointer to access the private memory */
> > alan:nit: relooking at the these variable names that originate from
> > last year's patch you worked on introducing gsc_uc, i am wondering now
> > if we should rename "local" to "privmem" and local_vaddr becomes 
> > privmem_vaddr.
> > (no significant reason other than improving readibility of the code)
> 
> IIRC I used local because one of the GSC docs referred to it that way. I 
> don't mind the renaming, but I don't think it should be done as part of 
> this patch.
alan: sure - sounds good.




Re: [PATCH 3/6] drm/i915/uc/gsc: extract release and security versions from the gsc binary

2023-06-05 Thread Teres Alexis, Alan Previn
On Fri, 2023-05-26 at 18:27 -0700, Ceraolo Spurio, Daniele wrote:
> 
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_meu_headers.h 
> > > b/drivers/gpu/drm/i915/gt/uc/intel_gsc_meu_headers.h
> > > index d55a66202576..8bce2b8aed84 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_meu_headers.h
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_meu_headers.h
alan:snip

> > alan: structure layout seems unnecessarily repetitive... why not ->
> > struct partition_info {
> > u32 offset;
> > u32 size;
> > };
> > struct intel_gsc_layout_pointers {
> > u8 rom_bypass_vector[16];
> > ...
> > struct partition_info datap;
> > struct partition_info bootregion[5];
> > struct partition_info trace;
> > }__packed;
> > 
> 
> I've just realized that I didn't reply to this comment. The specs have 
> the structure defined that way, so I tried to keep a 1:1 match like we 
> usually do. I think switching to a partition_info structure is ok, but 
> I'll avoid the array because the GSC partition are 1-based, which could 
> cause confusion (i.e. partition boot1 would be bootregion[0]).
alan: sure - that's fine - let's stick to align with the spec definitions



Re: [PATCH 6/6] drm/i915/uc/gsc: Add a gsc_info debugfs

2023-06-05 Thread Teres Alexis, Alan Previn
On Wed, 2023-05-31 at 17:25 -0700, Ceraolo Spurio, Daniele wrote:
> 
> On 5/26/2023 3:57 PM, Teres Alexis, Alan Previn wrote:
> > On Fri, 2023-05-05 at 09:04 -0700, Ceraolo Spurio, Daniele wrote:
> > > Add a new debugfs to dump information about the GSC. This includes:
> > alan:snip
> > Actually everything looks good except for a couple of questions + asks - 
> > hope we can close on this patch in next rev.
> > 
> > > - the FW path and SW tracking status;
> > > - the release, security and compatibility versions;
> > > - the HECI1 status registers.
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c 
> > > b/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c
> > > index 0b6dcd982b14..3014e982aab2 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c
> > > @@ -12,36 +12,31 @@
> > >   #include "intel_gsc_fw.h"
> > >   #include "intel_gsc_meu_headers.h"
> > >   #include "intel_gsc_uc_heci_cmd_submit.h"
> > > -
> > > -#define GSC_FW_STATUS_REG_MMIO(0x116C40)
> > > -#define GSC_FW_CURRENT_STATE REG_GENMASK(3, 0)
> > > -#define   GSC_FW_CURRENT_STATE_RESET 0
> > > -#define   GSC_FW_PROXY_STATE_NORMAL  5
> > > -#define GSC_FW_INIT_COMPLETE_BIT REG_BIT(9)
> > > +#include "i915_reg.h"
> > >   
> > alan:snip
> >   
> > alan: btw, just to be consistent with other top-level "intel_foo_is..." 
> > checking functions,
> > why don't we take the runtime wakeref inside the following functions and 
> > make it easier for any callers?
> > (just like what we do for "intel_huc_is_authenticated"):
> >  static bool gsc_is_in_reset(struct intel_uncore *uncore)
> >  bool intel_gsc_uc_fw_proxy_init_done(struct intel_gsc_uc *gsc)
> >  bool intel_gsc_uc_fw_init_done(struct intel_gsc_uc *gsc)
> 
> The idea was that we shouldn't check the FW status if we're not planning 
> to do something with it, in which case we should already have a wakeref. 
> HuC is a special case because userspace can query that when the HW is 
> idle. This said, I have nothing against adding an extra wakeref , but I 
> don't think it should be in this patch.
alan: i believe intel_pxp_gsccs_is_ready_for_sessions is being used in a
similar way where one of the uses it to check huc-status and gsc-proxy
status without actually doing any operation. If u still wanna keep it
differently then I'll have to update the PXP code. Or perhaps you could
help me fix that on the PXP side?

alna:snip
> > 

> > > +void intel_gsc_uc_load_status(struct intel_gsc_uc *gsc, struct 
> > > drm_printer *p)
> > > +{
> > > + struct intel_gt *gt = gsc_uc_to_gt(gsc);
> > > + struct intel_uncore *uncore = gt->uncore;
> > > + intel_wakeref_t wakeref;
> > > +
> > > + if (!intel_gsc_uc_is_supported(gsc)) {
> > alan: this was already checked in caller so we'll never get here. i think 
> > we should remove the check in the caller, let below msg appear.
> 
> I did the same as what we do for GuC and HuC. I'd prefer to be 
> consistent in behavior with those.
alan: okay - sounds good
> 
> > > + drm_printf(p, "GSC not supported\n");
> > > + return;
> > > + }
> > alan:snip

> > > + drm_printf(p, "HECI1 FWSTST%u = 0x%08x\n", i, status);
> > alan:nit: do you we could add those additional shim regs? (seemed useful in 
> > recent offline debugs).
> 
> Agreed that it would be useful; I'll try to get a complete list from 
> arch and/or the GSC FW team. Are you ok if we go ahead with this in the 
> meantime?
alan: yes sure.
> 
> > 
> 



Re: [Intel-gfx] [PATCH] drm/i915/pxp: Optimize GET_PARAM:PXP_STATUS

2023-06-02 Thread Teres Alexis, Alan Previn
Thanks Jani - will rev this up and fix these.

On Fri, 2023-06-02 at 16:03 +0300, Jani Nikula wrote:
> On Thu, 01 Jun 2023, Alan Previn  wrote:
> > After recent discussions with Mesa folks, it was requested
> > that we optimize i915's GET_PARAM for the PXP_STATUS without
> > changing the UAPI spec.
> > 
> > This patch adds this additional optimizations:
> 
> Nitpick, please avoid "This patch". It's redundant, and after the patch
> gets applied it becomes a commit, not a patch.
> 
> Instead, use the imperative mood, e.g. "Add these additional
> optimizations".
> 
> See 
> https://docs.kernel.org/process/submitting-patches.html#describe-your-changes
alan:snip

> 
> > -int intel_pxp_get_readiness_status(struct intel_pxp *pxp)
> > +int intel_pxp_get_readiness_status(struct intel_pxp *pxp, int timeout)
> 
> It would help the reader if you named the parameter timeout_ms. Assuming
> the unit is ms.
alan:snip

> > -is_fw_err_platform_config(u32 type)
> > +is_fw_err_platform_config(u32 type, struct intel_pxp *pxp)
> 
> It's customary to have the parameters ordered from higher context to
> lower.
> 

alan:snip



Re: [PATCH v5 5/7] drm/i915/mtl/huc: auth HuC via GSC

2023-06-01 Thread Teres Alexis, Alan Previn
On Wed, 2023-05-31 at 16:54 -0700, Ceraolo Spurio, Daniele wrote:
> The full authentication via the GSC requires an heci packet submission
> to the GSC FW via the GSC CS. The GSC has new PXP command for this
> (literally called NEW_HUC_AUTH).
> The intel_huc_auth function is also updated to handle both authentication
> types.
> 
> 
alan:snip

> @@ -399,6 +416,9 @@ void intel_huc_fini(struct intel_huc *huc)
>*/
>   delayed_huc_load_fini(huc);
>  
> + if (huc->heci_pkt)
> + i915_vma_unpin_and_release(>heci_pkt, 0);
alan: nit: i just realized that for consistency (init mirror-ing fini), we 
should be doing this heci releasing after the intel_uc_fw_fini below.
But since the below object isnt referencing the heci packet, this doens't 
matter, so consider a nit.

alan:snip
> @@ -470,31 +491,41 @@ int intel_huc_auth(struct intel_huc *huc)
>   if (!intel_uc_fw_is_loaded(>fw))
>   return -ENOEXEC;
>  
> - /* GSC will do the auth */
> + /* GSC will do the auth with the load */
>   if (intel_huc_is_loaded_by_gsc(huc))
>   return -ENODEV;
alan: nit: sorry for another late comment but merely a nit - wondering if we 
should add a warn in here (to catch
if we could end up here) but its okay - this in theory shouldnt happen anyway.

alan:snip


> +++ b/drivers/gpu/drm/i915/gt/uc/intel_huc_fw.c
alan:snip



Only a couple of late-comer-nits that you can ignore, else LGTM:
Reviewed-by: Alan Previn 


Re: [Intel-gfx] [PATCH] drm/i915/pxp: use correct format string for size_t

2023-06-01 Thread Teres Alexis, Alan Previn
On Thu, 2023-06-01 at 23:36 +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> While 'unsigned long' needs the %ld format string, size_t needs the %z
> modifier:

alan:snip


> +++ b/drivers/gpu/drm/i915/pxp/intel_pxp_gsccs.c
> @@ -143,7 +143,7 @@ gsccs_send_message(struct intel_pxp *pxp,
>  
>   reply_size = header->message_size - sizeof(*header);
>   if (reply_size > msg_out_size_max) {
> - drm_warn(>drm, "caller with insufficient PXP reply size 
> %u (%ld)\n",
> + drm_warn(>drm, "caller with insufficient PXP reply size 
> %u (%zd)\n",
>reply_size, msg_out_size_max);
>   reply_size = msg_out_size_max;
>   }
Thanks Arnd for catching this, although i believe Nathan sumbmitted a patch for 
the same fix yesterday and received an RB from Andi.


Re: [PATCH v3 5/7] drm/i915/mtl/huc: auth HuC via GSC

2023-05-30 Thread Teres Alexis, Alan Previn
On Fri, 2023-05-26 at 17:52 -0700, Ceraolo Spurio, Daniele wrote:
> The full authentication via the GSC requires an heci packet submission
> to the GSC FW via the GSC CS. The GSC has new PXP command for this
> (literally called NEW_HUC_AUTH).
> The intel_huc_auth fuction is also updated to handle both authentication
> types.
> 


alan:snip

> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c
> index b26f493f86fa..c659cc01f32f 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c
> @@ -29,13 +29,32 @@ static void gsc_work(struct work_struct *work)
>  
>   if (actions & GSC_ACTION_FW_LOAD) {
>   ret = intel_gsc_uc_fw_upload(gsc);
> - if (ret == -EEXIST) /* skip proxy if not a new load */
> - actions &= ~GSC_ACTION_FW_LOAD;
> - else if (ret)
> + if (!ret)
> + /* setup proxy on a new load */
> + actions |= GSC_ACTION_SW_PROXY;
> + else if (ret != -EEXIST)
>   goto out_put;
alan: I realize that this worker doesnt print error values that
come back from intel_gsc_uc_fw_upload - wonder if we should print
it before goto out_put.

> +
> + /*
> +  * The HuC auth can be done both before or after the proxy init;
> +  * if done after, a proxy request will be issued and must be
> +  * serviced before the authentication can complete.
> +  * Since this worker also handles proxy requests, we can't
> +  * perform an action that requires the proxy from within it and
> +  * then stall waiting for it, because we'd be blocking the
> +  * service path. Therefore, it is easier for us to load HuC
> +  * first and do proxy later. The GSC will ack the HuC auth and
> +  * then send the HuC proxy request as part of the proxy init
> +  * flow.
> +  * Note that we can only do the GSC auth if the GuC auth was
> +  * successful.
> +  */
> + if (intel_uc_uses_huc(>uc) &&
> + intel_huc_is_authenticated(>uc.huc, 
> INTEL_HUC_AUTH_BY_GUC))
> + intel_huc_auth(>uc.huc, INTEL_HUC_AUTH_BY_GSC);
>   }
>  
> - if (actions & (GSC_ACTION_FW_LOAD | GSC_ACTION_SW_PROXY)) {
> + if (actions & GSC_ACTION_SW_PROXY) {
>   if (!intel_gsc_uc_fw_init_done(gsc)) {
>   gt_err(gt, "Proxy request received with GSC not 
> loaded!\n");
>   goto out_put;
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c
> index 579c0f5a1438..0ad090304ca0 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c
> 
alan:snip

> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_huc.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_huc.c
> index ab5246ae3979..5a4058d39550 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_huc.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_huc.c
> 

alan:snip

> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_huc_fw.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_huc_fw.c
> index d2b4176c81d6..8e538d639b05 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_huc_fw.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_huc_fw.c
> 
alan:snip


> +int intel_huc_fw_auth_via_gsccs(struct intel_huc *huc)
> +{
> + struct drm_i915_gem_object *obj;
> + void *pkt_vaddr;
> + u64 pkt_offset;
> + int retry = 5;
> + int err = 0;
> +
> + if (!huc->heci_pkt)
> + return -ENODEV;
> +
> + obj = huc->heci_pkt->obj;
> + pkt_offset = i915_ggtt_offset(huc->heci_pkt);
> +
> + pkt_vaddr = i915_gem_object_pin_map_unlocked(obj,
> +  
> i915_coherent_map_type(i915, obj, true));
> + if (IS_ERR(pkt_vaddr))
> + return PTR_ERR(pkt_vaddr);
> +
> + msg_in = pkt_vaddr;
> + msg_out = pkt_vaddr + SZ_4K;
this 4K*2 (4k for input and 4K for output) seems to be hardcoded in two 
different locations.
One is here in intel_huc_fw_auth_via_gsccs and the other location in 
intel_huc_init which was
even in a different file. Perhaps its better to use a #define after to the 
definition of
PXP43_CMDID_NEW_HUC_AUTH... maybe something like a "#define 
PXP43_NEW_HUC_AUTH_INOUT_PKT_SIZE (SZ_4K)"
> +
> + intel_gsc_uc_heci_cmd_emit_mtl_header(_in->header,
> +   HECI_MEADDRESS_PXP,
> +   sizeof(*msg_in), 0);
> +
> + msg_in->huc_in.header.api_version = PXP_APIVER(4, 3);
> + msg_in->huc_in.header.command_id = PXP43_CMDID_NEW_HUC_AUTH;
> + msg_in->huc_in.header.status = 0;
> + msg_in->huc_in.header.buffer_len = sizeof(msg_in->huc_in) -
> +   

Re: [PATCH 6/6] drm/i915/uc/gsc: Add a gsc_info debugfs

2023-05-26 Thread Teres Alexis, Alan Previn
On Fri, 2023-05-05 at 09:04 -0700, Ceraolo Spurio, Daniele wrote:
> Add a new debugfs to dump information about the GSC. This includes:
alan:snip
Actually everything looks good except for a couple of questions + asks - hope 
we can close on this patch in next rev.

> 
> - the FW path and SW tracking status;
> - the release, security and compatibility versions;
> - the HECI1 status registers.
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c
> index 0b6dcd982b14..3014e982aab2 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c
> @@ -12,36 +12,31 @@
>  #include "intel_gsc_fw.h"
>  #include "intel_gsc_meu_headers.h"
>  #include "intel_gsc_uc_heci_cmd_submit.h"
> -
> -#define GSC_FW_STATUS_REG_MMIO(0x116C40)
> -#define GSC_FW_CURRENT_STATE REG_GENMASK(3, 0)
> -#define   GSC_FW_CURRENT_STATE_RESET 0
> -#define   GSC_FW_PROXY_STATE_NORMAL  5
> -#define GSC_FW_INIT_COMPLETE_BIT REG_BIT(9)
> +#include "i915_reg.h"
>  
alan:snip
 
alan: btw, just to be consistent with other top-level "intel_foo_is..." 
checking functions,
why don't we take the runtime wakeref inside the following functions and make 
it easier for any callers?
(just like what we do for "intel_huc_is_authenticated"):
static bool gsc_is_in_reset(struct intel_uncore *uncore)
bool intel_gsc_uc_fw_proxy_init_done(struct intel_gsc_uc *gsc)
bool intel_gsc_uc_fw_init_done(struct intel_gsc_uc *gsc)


> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c
> index 2ae693b01b49..5475e95d61c6 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c
> @@ -9,8 +9,9 @@
>  #include "gt/intel_gt_print.h"
>  #include "intel_gsc_uc.h"
alan: nit: not this patch's fault, alphabetically, intel_gsc_uc.h is after 
intel_gsc_proxy.h
>  #include "intel_gsc_fw.h"
> -#include "i915_drv.h"
>  #include "intel_gsc_proxy.h"
> +#include "i915_drv.h"
> +#include "i915_reg.h"
>  
>  static void gsc_work(struct work_struct *work)
>  {
> @@ -301,3 +302,46 @@ void intel_gsc_uc_load_start(struct intel_gsc_uc *gsc)
>   queue_work(gsc->wq, >work);
>  }
> +
alan: btw, why are we putting intel_gsc_uc_load_status in intel_gsc_uc.c if the 
only caller is gsc_uc's debugfs?
why not just make it a static in there? unless u plan to call it from 
"err_print_uc" - then can we add that in next rev?

> +void intel_gsc_uc_load_status(struct intel_gsc_uc *gsc, struct drm_printer 
> *p)
> +{
> + struct intel_gt *gt = gsc_uc_to_gt(gsc);
> + struct intel_uncore *uncore = gt->uncore;
> + intel_wakeref_t wakeref;
> +
> + if (!intel_gsc_uc_is_supported(gsc)) {
alan: this was already checked in caller so we'll never get here. i think we 
should remove the check in the caller, let below msg appear.
> + drm_printf(p, "GSC not supported\n");
> + return;
> + }
alan:snip



> + drm_printf(p, "HECI1 FWSTST%u = 0x%08x\n", i, status);
alan:nit: do you we could add those additional shim regs? (seemed useful in 
recent offline debugs).
alan:snip

> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_debugfs.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_debugfs.c
> new file mode 100644
> index ..da9f96b72291
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_debugfs.c
> @@ -0,0 +1,38 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2020 Intel Corporation
alan:2023?

alan:snip

> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_debugfs.h 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_debugfs.h
> new file mode 100644
> index ..c405e5574253
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_debugfs.h
> @@ -0,0 +1,14 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2020 Intel Corporation
alan:2023?
alan:snip

> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_huc.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_huc.c
alan:snip

 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc_debugfs.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_uc_debugfs.c
> index 2f93cc4e408a..6d541c866edb 100644
alan:snip





Re: [PATCH 5/6] drm/i915/uc/gsc: define gsc fw

2023-05-25 Thread Teres Alexis, Alan Previn
Considering the only request i have below is touching up of existing comments 
(as
far as this patch is concerned), and since the rest of the code looks good, 
here is
my R-b - but i hope you can anwser my newbie question at the bottom:

Reviewed-by: Alan Previn 

On Fri, 2023-05-05 at 09:04 -0700, Ceraolo Spurio, Daniele wrote:
> Add FW definition and the matching override modparam.
> 
> The GSC FW has both a release version, based on platform and a rolling
> counter, and a compatibility version, which is the one tracking
> interface changes. Since what we care about is the interface, we use
> the compatibility version in the buinary names.
alan :s/buinary/binary

> 
> Same as with the GuC, a major version bump indicate a
> backward-incompatible change, while a minor version bump indicates a
> backward-compatible one, so we use only the former in the file name.
> 
> Signed-off-by: Daniele Ceraolo Spurio 
> Cc: Alan Previn 
> Cc: John Harrison 
> ---
>  drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c | 32 ++--
>  1 file changed, 24 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c
> index 36ee96c02d74..531cd172151d 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c
> @@ -124,6 +124,18 @@ void intel_uc_fw_change_status(struct intel_uc_fw *uc_fw,
>   fw_def(BROXTON,  0, huc_mmp(bxt,  2, 0, 0)) \
>   fw_def(SKYLAKE,  0, huc_mmp(skl,  2, 0, 0))
>  
> +/*
> + * The GSC FW has both a release version, based on platform and a rolling
> + * counter, and a compatibility version, which is the one tracking
> + * interface changes. Since what we care about is the interface, we use
> + * the compatibility version in the buinary names.
alan:s/buinary/binary
also, since we will (i hope) be adding several comments (alongside the new
version objects under intel_gsc_uc structure) in the patch #3 about what
their differences are and which one we care about and when they get populated,
perhaps we can minimize the information here and redirect to that other
comment... OR ... we can minimize the comments in patch #3 and redirect here
(will be good to have a single location with detailed explaination in the
comments and a redirect-ptr from the other location since a reader would
most likely stumble onto those questions from either of these locations).

> + * Same as with the GuC, a major version bump indicate a
> + * backward-incompatible change, while a minor version bump indicates a
> + * backward-compatible one, so we use only the former in the file name.
> + */
> +#define INTEL_GSC_FIRMWARE_DEFS(fw_def, gsc_def) \
> + fw_def(METEORLAKE,   0, gsc_def(mtl, 1, 0))
> +
>  /*
>  
> 
alan:snip

> @@ -257,14 +281,6 @@ __uc_fw_auto_select(struct drm_i915_private *i915, 
> struct intel_uc_fw *uc_fw)
>   int i;
>   bool found;
>  
> - /*
> -  * GSC FW support is still not fully in place, so we're not defining
> -  * the FW blob yet because we don't want the driver to attempt to load
> -  * it until we're ready for it.
> -  */
> - if (uc_fw->type == INTEL_UC_FW_TYPE_GSC)
> - return;
> -
alan: more of a newbie question from myself: considering this is a new firmware
we are adding, is there some kind of requirement to provide a link to the patch
targetting the linux firmware repo that is a dependency of this series?
or perhaps we should mention in the series that merge will only happen after
that patch gets merged (with a final rev that includes the patch on
the fw-repo side?). Just trying to understand the process.


>   /*
>* The only difference between the ADL GuC FWs is the HWConfig support.
>* ADL-N does not support HWConfig, so we should use the same binary as



Re: [PATCH 4/6] drm/i915/uc/gsc: query the GSC FW for its compatibility version

2023-05-25 Thread Teres Alexis, Alan Previn
On Fri, 2023-05-05 at 09:04 -0700, Ceraolo Spurio, Daniele wrote:
> The compatibility version is queried via an MKHI command. Right now, the
> only existing interface is 1.0
> This is basically the interface version for the GSC FW, so the plan is
> to use it as the main tracked version, including for the binary naming
> in the fetch code.
> 
> Signed-off-by: Daniele Ceraolo Spurio 
> 

alan: just a couple of minor things nits below.
One ask though, in line with the clarification we had over the offline 
coversation,
I am wondering if we can document the fact that the file_selected.ver remains 
as major-minor::zero-zero
for the case of gsc until after the firmware is loaded and we query via this 
function (which happens
later at gt-late-init). However, that comment might not belong here - perhaps 
it belongs in the prior
patch together with the other comment i requested for (asking for additional 
explainations about the
different types of versions for gsc).

That said, for this patch, LGTM:
Reviewed-by: Alan Previn 

alan:snip
> +static int gsc_fw_query_compatibility_version(struct intel_gsc_uc *gsc)
> +{
> + struct intel_gt *gt = gsc_uc_to_gt(gsc);
> + struct mtl_gsc_ver_msg_in *msg_in;
> + struct mtl_gsc_ver_msg_out *msg_out;
> + struct i915_vma *vma;
> + u64 offset;
> + void *vaddr;
> + int err;
> +
> + err = intel_guc_allocate_and_map_vma(>uc.guc, PAGE_SIZE * 2,
> +  , );
alan: nit: im assuming this code will be used for future discrete cards,.. if 
so,
perhaps we should also be using "SZ_4K * 2" above since different host-cpu-arch
could have different PAGE sizes - this way we'll be consistent with exact size 
allocations.
also its more consistent in this function - maybe a #define 
GSC_UC_GET_ABI_VER_PKT_SIZE SZ_4K
at top of function is nice. either way, i consider this a nit.

> + if (err) {
> + gt_err(gt, "failed to allocate vma for GSC version query\n");
> + return err;
> + }

alan:snip

> +
>  int intel_gsc_uc_fw_upload(struct intel_gsc_uc *gsc)
>  {
>   struct intel_gt *gt = gsc_uc_to_gt(gsc);
> @@ -327,11 +406,21 @@ int intel_gsc_uc_fw_upload(struct intel_gsc_uc *gsc)
>   if (err)
>   goto fail;
>  
> + err = gsc_fw_query_compatibility_version(gsc);
> + if (err)
> + goto fail;
> +
> + /* we only support compatibility version 1.0 at the moment */
> + err = intel_uc_check_file_version(gsc_fw, NULL);
> + if (err)
> + goto fail;
> +
>   /* FW is not fully operational until we enable SW proxy */
>   intel_uc_fw_change_status(gsc_fw, INTEL_UC_FIRMWARE_TRANSFERRED);
>  
> - gt_info(gt, "Loaded GSC firmware %s (r%u.%u.%u.%u, svn%u)\n",
> + gt_info(gt, "Loaded GSC firmware %s (cv%u.%u, r%u.%u.%u.%u, svn %u)\n",
alan:nit "abi" instead of "cv"?
>   gsc_fw->file_selected.path,
> + gsc_fw->file_selected.ver.major, 
> gsc_fw->file_selected.ver.minor,
>   gsc->release.major, gsc->release.minor,
>   gsc->release.patch, gsc->release.build,
>   gsc->security_version);
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.h 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.h
> index 8f199d5f963e..fb1453ed4ecf 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.h
alan:snip

> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c
> index cd8fc194f7fa..36ee96c02d74 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c
alan:snip

> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.h 
> b/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.h
> index 279244744d43..4406e7b48b27 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.h
alan:snip


Re: [PATCH 3/6] drm/i915/uc/gsc: extract release and security versions from the gsc binary

2023-05-25 Thread Teres Alexis, Alan Previn
On Thu, 2023-05-25 at 09:56 -0700, Ceraolo Spurio, Daniele wrote:
> On 5/24/2023 10:14 PM, Teres Alexis, Alan Previn wrote:
> > On Fri, 2023-05-05 at 09:04 -0700, Ceraolo Spurio, Daniele wrote:
alan:snip
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h
> > > @@ -17,6 +17,9 @@ struct intel_gsc_uc {
> > >   struct intel_uc_fw fw;
> > > 
> > >   /* GSC-specific additions */
> > > + struct intel_uc_fw_ver release;
> > > + u32 security_version;
> > alan: for consistency and less redundancy, can't we add "security_version"
> > into 'struct intel_uc_fw_ver' (which is zero for firmware that doesn't
> > have it). That way, intel_gsc_uc can re-use intel_uc_fw.file_selected
> > just like huc?
> 
> I'm not sure what you mean by re-using intel_uc_fw.file_selected. Is 
> that for the call from intel_uc_fw_version_from_meu_manifest? I'm 
> purposely not doing that. Note that the GSC has 3 versions:
> 
> Release version (incremented with each build and encoded in the header)
> Security version (also encoded in the header)
> Compatibility version (queried via message to the GSC)
> 
> The one we care about for communicating with the GSC is the last one, so 
> that's the one I stored in intel_uc_fw.file_selected (in the next 
> patch). The other 2  versions are not strictly required to use the GSC 
> and we only fetch them for debug purposes, so if something goes wrong we 
> know exactly what we've loaded.
> 
> Daniele
alan: okay thanks - seeing that now in the next patch... (and i also forgot that
the GSC release version doesnt reflect interface versioning in anyway like GuC 
does).
In that case, above additional versions are fine. Would definitely love to see
additional comments under "GSC-specific-additions" that explain those 3 
versioning
items and what we care about as how you have explained here.


Re: [PATCH 3/6] drm/i915/uc/gsc: extract release and security versions from the gsc binary

2023-05-24 Thread Teres Alexis, Alan Previn
On Fri, 2023-05-05 at 09:04 -0700, Ceraolo Spurio, Daniele wrote:


alan: snip


> +int intel_gsc_fw_get_binary_info(struct intel_uc_fw *gsc_fw, const void 
> *data, size_t size)
> +{
alan:snip
> + /*
> +  * The GSC binary starts with the pointer layout, which contains the
> +  * locations of the various partitions of the binary. The one we're
> +  * interested in to get the version is the boot1 partition, where we can
> +  * find a BPDT header followed by entries, one of which points to the
> +  * RBE sub-section of the partition. From here, we can parse the CPD
alan: nit: could we add the meaning of 'RBE', probably not here but in the 
header file where GSC_RBE is defined?
> +  * header and the following entries to find the manifest location
> +  * (entry identified by the "RBEP.man" name), from which we can finally
> +  * extract the version.
> +  *
> +  * --
> +  * [  intel_gsc_layout_pointers ]
> +  * [  ...   ]
> +  * [  boot1_offset  >---]--o
> +  * [  ...   ]  |
> +  * --  |
> +  * |
> +  * --  |
> +  * [  intel_gsc_bpdt_header ]<-o
> +  * --
alan: special thanks for the diagram - love these! :)
alan: snip

> + min_size = layout->boot1_offset + layout->boot1_offset > size;
alan: latter is a binary so + 1? or is this a typo and supposed to be:
min_size = layout->boot1_offset;
actually since we are accessing a bpdt_header hanging off that offset, it 
should rather be:
min_size = layout->boot1_offset + sizeof(*bpdt_header);
> + if (size < min_size) {
> + gt_err(gt, "GSC FW too small for boot section! %zu < %zu\n",
> +size, min_size);
> + return -ENODATA;
> + }
> +
> + bpdt_header = data + layout->boot1_offset;
> + if (bpdt_header->signature != INTEL_GSC_BPDT_HEADER_SIGNATURE) {
> + gt_err(gt, "invalid signature for meu BPDT header: 0x%08x!\n",
> +bpdt_header->signature);
> + return -EINVAL;
> + }
> +
alan: IIUC, to be strict about the size-crawl-checking, we should check minsize
again - this time with the additional "bpdt_header->descriptor_count * 
sizeof(*bpdt_entry)".
(hope i got that right?) - adding that check before parsing through the 
(bpdt_entry++)'s ->type
> + bpdt_entry = (void *)bpdt_header + sizeof(*bpdt_header);
> + for (i = 0; i < bpdt_header->descriptor_count; i++, bpdt_entry++) {
> + if ((bpdt_entry->type & INTEL_GSC_BPDT_ENTRY_TYPE_MASK) !=
> + INTEL_GSC_BPDT_ENTRY_TYPE_GSC_RBE)
> + continue;
> +
> + cpd_header = (void *)bpdt_header + 
> bpdt_entry->sub_partition_offset;
> + break;
> + }
> +
> + if (!cpd_header) {
> + gt_err(gt, "couldn't find CPD header in GSC binary!\n");
> + return -ENODATA;
> + }
alan: same as above, so for size-crawl-checking, we should check minsize again 
with the addition of cpd_header, no?
> +
> + if (cpd_header->header_marker != INTEL_GSC_CPD_HEADER_MARKER) {
> + gt_err(gt, "invalid marker for meu CPD header in GSC bin: 
> 0x%08x!\n",
> +cpd_header->header_marker);
> + return -EINVAL;
> + }
alan: and again here, the size crawl checking with cpd_header->num_of_entries * 
*cpd_entry
> + cpd_entry = (void *)cpd_header + cpd_header->header_length;
> + for (i = 0; i < cpd_header->num_of_entries; i++, cpd_entry++) {
> + if (strcmp(cpd_entry->name, "RBEP.man") == 0) {
> + manifest = (void *)cpd_header + 
> cpd_entry_offset(cpd_entry);
> + intel_uc_fw_version_from_meu_manifest(>release,
> +   manifest);
> + gsc->security_version = manifest->security_version;
> + break;
> + }
> + }
> +
> + return 0;

alan:snip

>  
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.h 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.h
> index fff8928218df..8d7b9e4f1ffc 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.h
alan:snip


> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_meu_headers.h 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_meu_headers.h
> index d55a66202576..8bce2b8aed84 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_meu_headers.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_meu_headers.h
alan:snip



> +struct intel_gsc_layout_pointers {
> + u8 

Re: [PATCH 2/6] drm/i915/uc/gsc: fixes and updates for GSC memory allocation

2023-05-22 Thread Teres Alexis, Alan Previn
On Fri, 2023-05-05 at 09:04 -0700, Ceraolo Spurio, Daniele wrote:
> A few fixes/updates are required around the GSC memory allocation and it
> is easier to do them all at the same time. The changes are as follows:


alan:snip

> @@ -109,38 +110,21 @@ static int gsc_fw_load_prepare(struct intel_gsc_uc *gsc)
>  {
>   struct intel_gt *gt = gsc_uc_to_gt(gsc);
>   struct drm_i915_private *i915 = gt->i915;
> - struct drm_i915_gem_object *obj;
> - void *src, *dst;
> + void *src;


alan:snip

>  
> - memset(dst, 0, obj->base.size);
> - memcpy(dst, src, gsc->fw.size);
> + memset_io(gsc->local_vaddr, 0, gsc->local->size);
> + memcpy_toio(gsc->local_vaddr, src, gsc->fw.size);
alan: i wonder if it there is benefit to do the memcpy_toio first
and then do the memset_io but only for the balance of area from
offset 'gsc->fw.size' for (gsc->local->size - gsc->fw.size) bytes.

alan:snip

> --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c
> @@ -130,26 +130,85 @@ void intel_gsc_uc_init_early(struct intel_gsc_uc *gsc)
>   }
>  }
>  
> +static int gsc_allocate_and_map_vma(struct intel_gsc_uc *gsc, u32 size)
alan:snip

> + obj = i915_gem_object_create_stolen(gt->i915, size);
> + if (IS_ERR(obj))
> + return PTR_ERR(obj);
> +
> + vma = i915_gem_object_ggtt_pin(obj, NULL, 0, 0, 0);
alan: should we be passing in the PIN_MAPPABLE flag into the last param?


alan:snip

> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h
> index a2a0813b8a76..c01286dddbdb 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h
> @@ -18,6 +18,7 @@ struct intel_gsc_uc {
>  
>   /* GSC-specific additions */
>   struct i915_vma *local; /* private memory for GSC usage */
> + void __iomem *local_vaddr; /* pointer to access the private memory */
alan:nit: relooking at the these variable names that originate from
last year's patch you worked on introducing gsc_uc, i am wondering now
if we should rename "local" to "privmem" and local_vaddr becomes privmem_vaddr.
(no significant reason other than improving readibility of the code)




Re: [PATCH v2 5/8] drm/i915/huc: differentiate the 2 steps of the MTL HuC auth flow

2023-05-12 Thread Teres Alexis, Alan Previn
On Fri, 2023-04-28 at 11:58 -0700, Ceraolo Spurio, Daniele wrote:
> Before we add the second step of the MTL HuC auth (via GSC), we need to
> have the ability to differentiate between them. To do so, the huc
> authentication check is duplicated for GuC and GSC auth, with meu
> binaries being considered fully authenticated only after the GSC auth
> step.
> 
> To report the difference between the 2 auth steps, a new case is added
> to the HuC getparam. This way, the clear media driver can start
> submitting before full auth, as partial auth is enough for those
> workloads.
> 
> v2: fix authentication status check for DG2
> 
> Signed-off-by: Daniele Ceraolo Spurio 
> Cc: Alan Previn 
> ---
>  drivers/gpu/drm/i915/gt/uc/intel_huc.c| 94 +--
>  drivers/gpu/drm/i915/gt/uc/intel_huc.h| 16 +++-
>  drivers/gpu/drm/i915/gt/uc/intel_huc_fw.c |  4 +-
>  drivers/gpu/drm/i915/i915_reg.h   |  3 +
>  include/uapi/drm/i915_drm.h   |  3 +-
>  5 files changed, 91 insertions(+), 29 deletions(-)
> 
I believe you need a rebase with the PXP single session merged (the
readiness code in gsccs backend). Other than that, all looks good:

Reviewed-by: Alan Previn 


Re: [PATCH v2] drm/i915/huc: Parse the GSC-enabled HuC binary

2023-05-12 Thread Teres Alexis, Alan Previn
On Tue, 2023-05-02 at 08:27 -0700, Ceraolo Spurio, Daniele wrote:
> The new binaries that support the 2-step authentication have contain the
> legacy-style binary, which we can use for loading the HuC via DMA. To
> find out where this is located in the image, we need to parse the meu
> manifest of the GSC binary. The manifest consist of a partition header
> followed by entries, one of which contains the offset we're looking for.
> Note that the DG2 GSC binary contains entries with the same names, but
> it doesn't contain a full legacy binary, so we need to skip assigning
> the dma offset in that case (which we can do by checking the ccs).
> Also, since we're now parsing the entries, we can extract the HuC
> version that way instead of using hardcoded offsets.
> 
> Note that the meu structure will be re-used for parsing the GSC binary,
> so they've been added in their own header.
> 
> v2: fix structure names to match meu defines (s/CPT/CPD/), update commit
> message, check ccs validity, drop old version location defines.
> 
> Signed-off-by: Daniele Ceraolo Spurio 
> Cc: Alan Previn 
> ---
>  .../drm/i915/gt/uc/intel_gsc_meu_headers.h|  74 ++

Compared line by line as per internal reviews and the spec.
All looks good to me - nice to see that additional ccs validity.
LGTM, Reviewed-by: Alan Previn 


Re: [Intel-gfx] [PATCH] drm/i915/guc: Fix confused register capture list creation

2023-05-11 Thread Teres Alexis, Alan Previn
On Thu, 2023-05-11 at 18:35 -0700, john.c.harri...@intel.com wrote:
> From: John Harrison 
> 
> The GuC has a completely separate engine class enum when referring to
> register capture lists, which combines render and compute. The driver
> was using the 'normal' GuC specific engine class enum instead. That
> meant that it thought it was defining a capture list for compute
> engines, the list was actually being applied to the GSC engine. And if
> a platform didn't have a render engine, then it would get no compute
> register captures at all.
> 
> Fix that.
> 
> Signed-off-by: John Harrison 
alan:snip.

LGTM, simple and straight-forward patch - although i can only imagine the
pain of debugging this one. So for the benefit of others on the mailing
list, because the COMPUTE and RENDER enum of the i915 (not-GuC-ABI) was
different, but the GuC-ABI one was using the its own Render for both,
(coincidentially '0' == render for both GUC-ABI and i915), it means that
ADS popultion of capture-register list would only cause problems for
devices that had no render or has a GSC (all have VD/VE/Blt). So MTL
is not yet fully POR and none of the existing upstream supported devices
had no render engine so therefore the "Fixes" tag isn't needed IIUC.

That said, pending CI results,
Reviewed-by: Alan Previn 


Re: [Intel-gfx] [PATCH v9 6/8] drm/i915/uapi/pxp: Add a GET_PARAM for PXP

2023-05-10 Thread Teres Alexis, Alan Previn

alan:snip

> This is why I asked if it was it was "basically certain that in a
> production environment, then it will eventually return 1 meaning it's
> ready". Alan's response was a little ambiguous on this point.
alan: if we get a '2' and never transition to '1' - thats a kernel bug or
firmware / system issue.

> Arguably a transition from 2 to -ENODEV could be considered a kernel
> bug, but it doesn't sound like Alan would agree. :) Maybe Alan would
> agree to saying it's either a kernel, or system integration bug.
alan: agreed - that would be a kernel bug or a system integration bug.

I also agreed on the init-flow vs app-usage thoughts Jordan had.
That said MESA has many ways it can use this UAPI and we can discuss
that on the MESA patch.


hmmm.. so... ack anyone? [insert big hopeful smiley here]
apologies if I am asking too often.


Re: [PATCH v9 6/8] drm/i915/uapi/pxp: Add a GET_PARAM for PXP

2023-05-04 Thread Teres Alexis, Alan Previn
On Thu, 2023-04-27 at 16:48 -0700, Teres Alexis, Alan Previn wrote:
> Because of the additional firmware, component-driver and
> initialization depedencies required on MTL platform before a
> PXP context can be created, UMD calling for PXP creation as a
> way to get-caps can take a long time. An actual real world
> customer stack has seen this happen in the 4-to-8 second range
> after the kernel starts (which sees MESA's init appear in the
> middle of this range as the compositor comes up). To avoid
> unncessary delays experienced by the UMD for get-caps purposes,
> add a GET_PARAM for I915_PARAM_PXP_SUPPORT.
> 
alan:snip.
Progress update on the UMD side - I'm working on patch for PR here: 
https://gitlab.freedesktop.org/alan_previn_intel/mesa-alan-previn-features/-/commit/fb9d4fbfbef7dfd3f41df335cd31549fd39ddb57
but taking time to verify certain code paths


Re: [PATCH v3 4/4] drm/i915/gsc: add support for GSC proxy interrupt

2023-05-04 Thread Teres Alexis, Alan Previn
On Tue, 2023-05-02 at 09:38 -0700, Ceraolo Spurio, Daniele wrote:
> The GSC notifies us of a proxy request via the HECI2 interrupt. The
> interrupt must be enabled both in the HECI layer and in our usual gt irq
> programming; for the latter, the interrupt is enabled via the same enable
> register as the GSC CS, but it does have its own mask register. When the
> interrupt is received, we also need to de-assert it in both layers.
> 
> The handling of the proxy request is deferred to the same worker that we
> use for GSC load. New flags have been added to distinguish between the
> init case and the proxy interrupt.
> 
> v2: Make sure not to set the reset bit when enabling/disabling the GSC
> interrupts, fix defines (Alan)
> 
> v3: rebase on proxy status register check
> 
alan:snip
Reviewed-by: Alan Previn 


Re: [PATCH v3 3/4] drm/i915/gsc: add initial support for GSC proxy

2023-05-04 Thread Teres Alexis, Alan Previn

On Tue, 2023-05-02 at 09:38 -0700, Ceraolo Spurio, Daniele wrote:
> The GSC uC needs to communicate with the CSME to perform certain
> operations. Since the GSC can't perform this communication directly
> on platforms where it is integrated in GT, i915 needs to transfer the
> messages from GSC to CSME and back.
> 
alan:snip
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c
> @@ -13,6 +13,7 @@
>  #define GSC_FW_STATUS_REG_MMIO(0x116C40)
>  #define GSC_FW_CURRENT_STATE REG_GENMASK(3, 0)
>  #define   GSC_FW_CURRENT_STATE_RESET 0
> +#define   GSC_FW_PROXY_STATE_NORMAL  5
>  #define GSC_FW_INIT_COMPLETE_BIT REG_BIT(9)
>  
>  static bool gsc_is_in_reset(struct intel_uncore *uncore)
> @@ -23,6 +24,15 @@ static bool gsc_is_in_reset(struct intel_uncore *uncore)
>  GSC_FW_CURRENT_STATE_RESET;
>  }
>  
> +bool intel_gsc_uc_fw_proxy_init_done(struct intel_gsc_uc *gsc)
> +{
> + struct intel_uncore *uncore = gsc_uc_to_gt(gsc)->uncore;
> + u32 fw_status = intel_uncore_read(uncore, GSC_FW_STATUS_REG);
> +
> + return REG_FIELD_GET(GSC_FW_CURRENT_STATE, fw_status) ==
> +GSC_FW_PROXY_STATE_NORMAL;
> +}
> +
Since this function was added here, repeating rb:
Reviewed-by: Alan Previn 


Re: [PATCH v3 2/4] mei: gsc_proxy: add gsc proxy driver

2023-05-03 Thread Teres Alexis, Alan Previn
We only had nits before and all sorted now, so..
Reviewed-by: Alan Previn 

On Tue, 2023-05-02 at 09:38 -0700, Ceraolo Spurio, Daniele wrote:
> From: Alexander Usyskin 
> 
> Add GSC proxy driver. It to allows messaging between GSC component
> on Intel graphics card and CSE device.
> 
> Cc: Alan Previn 
> Signed-off-by: Alexander Usyskin 
> Signed-off-by: Tomas Winkler 
> Signed-off-by: Daniele Ceraolo Spurio 
> Acked-by: Greg Kroah-Hartman 
> ---
> 
> v2: re-order includes, drop reference to "on board" card in commit
> message and comments.



Re: [PATCH v3 1/4] drm/i915/mtl: Define GSC Proxy component interface

2023-05-03 Thread Teres Alexis, Alan Previn
LGTM
Reviewed-by: Alan Previn 

On Tue, 2023-05-02 at 09:38 -0700, Ceraolo Spurio, Daniele wrote:
> From: Alexander Usyskin 
> 
> GSC Proxy component is used for communication between the
> Intel graphics driver and MEI driver.
> 
> Cc: Alan Previn 
> Signed-off-by: Alexander Usyskin 
> Signed-off-by: Tomas Winkler 
> Signed-off-by: Daniele Ceraolo Spurio 
> Acked-by: Greg Kroah-Hartman 
> ---
> 
> v2: Improve documentation, remove unneeded includes



Re: [Intel-gfx] [PATCH v2 3/4] drm/i915/guc: Capture list naming clean up

2023-05-03 Thread Teres Alexis, Alan Previn
LGTM:
Reviewed-by: Alan Previn 

On Fri, 2023-04-28 at 11:56 -0700, john.c.harri...@intel.com wrote:
> From: John Harrison 
> 
> Don't use 'xe_lp*' prefixes for register lists that are common with
> Gen8.
> 
> Don't add Xe only GSC registers to pre-Xe devices that don't
> even have a GSC engine.
> 
> Fix Xe_LP name.
> 
> Don't use GEN9 as a prefix for register lists that contain all GEN8
> registers.
> 
> Rename the 'default_' register list prefix to 'gen8_' as that is the
> more accurate name.
alan:snip


Re: [PATCH v9 3/8] drm/i915/pxp: Add MTL helpers to submit Heci-Cmd-Packet to GSC

2023-04-27 Thread Teres Alexis, Alan Previn
On Thu, 2023-04-27 at 16:48 -0700, Teres Alexis, Alan Previn wrote:
> Add helper functions into a new file for heci-packet-submission.
> The helpers will handle generating the MTL GSC-CS Memory-Header
> and submission of the Heci-Cmd-Packet instructions to the engine.
> 
> 
alan: I accidentally forgot to carry Daniele's RB forward from rev8.
Repasting as this patch did not change:

Reviewed-by: Daniele Ceraolo Spurio 


Re: [Intel-gfx] [PATCH v8 6/8] drm/i915/uapi/pxp: Add a GET_PARAM for PXP

2023-04-27 Thread Teres Alexis, Alan Previn
(fixed email addresses again - why is my Evolution client deteorating??)

On Thu, 2023-04-27 at 17:18 +, Teres Alexis, Alan Previn wrote:
> On Wed, 2023-04-26 at 15:35 -0700, Justen, Jordan L wrote:
> > On 2023-04-26 11:17:16, Teres Alexis, Alan Previn wrote:
> alan:snip
> > Can you tell that pxp is in progress, but not ready yet, as a separate
> > state from 'it will never work on this platform'? If so, maybe the
> > status could return something like:
> > 
> > 0: It's never going to work
> > 1: It's ready to use
> > 2: It's starting and should work soon
> > 
> > I could see an argument for treating that as a case where we could
> > still advertise protected content support, but if we try to use it we
> > might be in for a nasty delay.
> > 
> alan: IIRC Lionel seemed okay with any permutation that would allow it to not
> get blocked. Daniele did ask for something similiar to what u mentioned above
> but he said that is non-blocking. But since both you AND Daniele have 
> mentioned
> the same thing, i shall re-rev this and send that change out today.
> I notice most GET_PARAMS use -ENODEV for "never gonna work" so I will stick 
> with that.
> but 1 = ready to use and 2 = starting and should work sounds good. so '0' 
> will never
> be returned - we just look for a positive value (from user space). I will also
> make a PR for mesa side as soon as i get it tested. thanks for reviewing btw.

alan: I also realize with these final touch-ups, we can go back to the original
pxp-context-creation timeout of 250 milisecs like it was on ADL since the user
space component will have this new param to check on (so even farther down from
1 sec on the last couple of revs).

Jordan, Lional - i am thinking of creating the PR on MESA side to take advantage
of GET_PARAM on both get-caps AND runtime creation (latter will be useful to 
ensure
no unnecesssary delay experienced by Mesa stuck in kernel call - which 
practically
never happenned in ADL AFAIK):

1. MESA PXP get caps: 
- use GET_PARAM (any positive number shall mean its supported).
2. MESA app-triggered PXP context creation (i.e. if caps was supported):
- use GET_PARAM to wait until positive number switches from "2" to "1".
- now call context creation. So at this point if it fails, we know its
  an actual failure.

you guys okay with above? (i'll re-rev this kernel series first and wait on your
ack or feedback before i create/ test/ submit a PR for Mesa side).



Re: [Intel-gfx] [PATCH] drm/i915/guc: Fix error capture for virtual engines

2023-04-27 Thread Teres Alexis, Alan Previn
On Fri, 2023-04-14 at 17:27 -0700, john.c.harri...@intel.com wrote:
> From: John Harrison 
> 
> GuC based register dumps in error capture logs were basically broken
> for virtual engines. This can be seen in igt@gem_exec_balancer@hang:
>   [IGT] gem_exec_balancer: starting subtest hang
>   [drm] GPU HANG: ecode 12:4:e1524110, in gem_exec_balanc [6388]
>   [drm] GT0: GUC: No register capture node found for 0x1005 / 0xFEDC311D
>   [drm] GPU HANG: ecode 12:4:, in gem_exec_balanc [6388]
>   [IGT] gem_exec_balancer: exiting, ret=0
> 
> The test causes a hang on both engines of a virtual engine context.
> The engine instance zero hang gets a valid error capture but the
> non-instance-zero hang does not.
> 
> Fix that by scanning through the list of pending register captures
> when a hang notification for a virtual engine is received. That way,
> the hang can be assigned to the correct physical engine prior to
> starting the error capture process. So later on, when the error capture
> handler tries to find the engine register list, it looks for one on
> the correct engine.
> 
> Also, sneak in a missing blank line before a comment in the node
> search code.
> 
> Signed-off-by: John Harrison 

LGTM - thanks for fixing this! :D
Reviewed-by: Alan Previn 

A side conversation - potentially requring an unrelated future patch,...
i notice that the array "error->reset_engine_count[]" (which is being
used for error state reporting and as some verification in selftests) seem
to have a size indicating of engine-instance-count but the reading/wrting
of members of this array keep using the engine->uabi_class as index...
meaning its being used to track engine class reset counts? Maybe this is
an issue or i am misundertanding that. Either way, that issue is unrelated
to the intent of this patch - i just wanted to get that highlighted for
future action if needed. I can take that onus if its in fact an issue.


Re: [Intel-gfx] [PATCH 6/6] drm/i915/guc: Capture list clean up - 5

2023-04-26 Thread Teres Alexis, Alan Previn
On Wed, 2023-04-26 at 11:24 -0700, john.c.harri...@intel.com wrote:
> From: John Harrison 
> 
> Rename the 'default_' register list prefix to 'gen8_' as that is the
> more accurate name.
> 
> Signed-off-by: John Harrison 

Reviewed-by: Alan Previn 


Re: [PATCH v8 6/8] drm/i915/uapi/pxp: Add a GET_PARAM for PXP

2023-04-26 Thread Teres Alexis, Alan Previn
On Thu, 2023-04-20 at 22:34 -0700, Teres Alexis, Alan Previn wrote:
> Because of the additional firmware, component-driver and
> initialization depedencies required on MTL platform before a
> PXP context can be created, UMD calling for PXP creation as a
> way to get-caps can take a long time. An actual real world
> customer stack has seen this happen in the 4-to-8 second range
> after the kernel starts (which sees MESA's init appear in the
> middle of this range as the compositor comes up). To avoid
> unncessary delays experienced by the UMD for get-caps purposes,
> add a GET_PARAM for I915_PARAM_PXP_SUPPORT.
> 
> However, some failures can still occur after all the depedencies
> are met (such as firmware init flow failure, bios configurations
> or SOC fusing not allowing PXP enablement). Those scenarios will
> only be known to user space when it attempts creating a PXP context.
> 
> With this change, large delays are only met by user-space processes
> that explicitly need to create a PXP context and boot very early.
> There is no way to avoid this today.
> 
> Signed-off-by: Alan Previn 
> Reviewed-by: Daniele Ceraolo Spurio 
> Acked-by: Lionel Landwerlin 
> ---
alan: Hi Jordan, Tvrtko, Daniel Vetter and Lionel,... 
how to proceed on this series (which have required Rb/Ack's) in light of
the recent discussion on the other series here:
https://patchwork.freedesktop.org/patch/532994/?series=115980=8
it sounds like:
- Jordan still wants the extension query
- Daniel recapped the overview of historical discussions and kernel UAPI 
guidelines
  and in summary is okay with the GET_PARAM option. However Daniel cites PXP 
taking
  8 seconds to create a context is broken.
- I corrected above assumption -> current timeout is 1 second. 8 seconds is not 
the
  time taken to init the context, its the max-possible-time to ensure 
dependencies
  like the gsc-proxy-component is loaded so that protected context could be 
attempted
  successfully. Else, it would return an error that Mesa could interpret as not 
supported.

Should I:
(a) rerev this, drop this patch #6 and work towards merging the rest of the 
series
to allow the get-param vs extensions-query to continue independently?
(b) go ahead and merge this series as is with the GET_PARAM? (i need to get the 
Mesa
UMD change tested and PR requested first)

alan:snip


Re: IOCTL feature detection (Was: Re: [Intel-gfx] [PATCH 8/8] drm/i915: Allow user to set cache at BO creation)

2023-04-26 Thread Teres Alexis, Alan Previn
On Wed, 2023-04-26 at 13:52 +0200, Daniel Vetter wrote:
> On Tue, Apr 25, 2023 at 04:41:54PM +0300, Joonas Lahtinen wrote:
> > (+ Faith and Daniel as they have been involved in previous discussions)
> > Quoting Jordan Justen (2023-04-24 20:13:00)
> > > On 2023-04-24 02:08:43, Tvrtko Ursulin wrote:
> > > > 
> > > > 
> 
alan:snip

> - the more a feature spans drivers/modules, the more it should be
>   discovered by trying it out, e.g. dma-buf fence import/export was a huge
>   discussion, luckily mesa devs figured out how to transparantly fall back
>   at runtime so we didn't end up merging the separate feature flag (I
>   think at least, can't find it). pxp being split across i915/me/fw/who
>   knows what else is kinda similar so I'd heavily lean towards discovery
>   by creating a context
> 
> - pxp taking 8s to init a ctx sounds very broken, irrespective of anything
>   else
> 

Alan: Please be aware that:
1. the wait-timeout was changed to 1 second sometime back.
2. the I'm not deciding the time-out. I initially wanted to keep it at the same
timeout as ADL (250 milisec) - and ask the UMD to retry if user needs it. (as 
per
same ADL behavior). Daniele requested to move it to 8 seconds - but thru review
process, we reduced it to 1 second.
3. In anycase, thats just the wait-timeout - and we know it wont succeed until
~6 seconds after i915 (~9 secs after boot). The issue isnt our hardware or i915
- its the component driver load <-- this is what's broken. 

Details: PXP context is dependent on gsc-fw load, huc-firmware load, 
mei-gsc-proxy
component driver load + bind, huc-authentication and gsc-proxy-init-handshake.
Most of above steps begin rather quickly during i915 driver load - the delay
seems to come from a very late mei-gsc-proxy component driver load. In fact the
parent mei-me driver is only getting ~6 seconds after i915 init is done. That
blocks the gsc-proxy-init-handshake and huc-authentication and lastly PXP.

That said, what is broken is why it takes so long to get the component drivers
to come up. NOTE: PXP isnt really doing anything differently in the context
creation flow (in terms of time-consuming-steps compared to ADL) besides the
extra dependency waits these.

We can actually go back to the original timeout of 250 milisecs like we have in 
ADL
but will fail if MESA calls in too early (but will succeed later) ... or...
we can create the GET_PARAMs.

A better idea would be to figure out how to control the driver load order and
force mei driver + components to get called right after i915. I was informed
there is no way to control this and changes here will likely not be accepted
upstream.

++ Daniele - can you chime in?

Take note that ADL has the same issue but for whatever reason, the dependant
mei component on ADL loaded much sooner - so it was never an issue that was
caught but still existed on ADL time merge (if users customize the kernel + 
compositor for fastboot it will happen).(i realize I havent tested ADL with the
new kernel configs that we use to also boot PXP on MTL - wonder if the new
mei configs are causing the delay - i.e. ADL customer could suddenly see this
6 sec delay too. - something i have to check now)


Re: [Intel-gfx] [PATCH 5/5] drm/i915/guc: Capture list clean up - 4

2023-04-25 Thread Teres Alexis, Alan Previn
On Thu, 2023-04-06 at 15:26 -0700, john.c.harri...@intel.com wrote:
> From: John Harrison 
> 
> Don't use GEN9 as a prefix for register lists that contain all GEN8
> registers.
alan:snip

alan: This patch as a stand-along looks good, so I'll provide the RB but take 
note of the comment below
that should be reflected by decision on the review comments of patch #1 so this 
patch might
change from GEN9_foo-from-patch-1 to GEN8_foo-from-patch-1.

Reviewed-by: Alan Previn 

> -/* GEN9 - Global */
> +/* GEN8 - Global */
>  static const struct __guc_mmio_reg_descr default_global_regs[] = {
>   COMMON_BASE_GLOBAL,
> - COMMON_GEN9BASE_GLOBAL,
> - GEN9_GLOBAL,
> + COMMON_GEN8BASE_GLOBAL,
> + GEN8_GLOBAL,
alan: see patch comment about "COMMON_GLOBAL" vs "GLOBAL" confusion.



Re: [Intel-gfx] [PATCH 4/5] drm/i915/guc: Capture list clean up - 3

2023-04-25 Thread Teres Alexis, Alan Previn
On Thu, 2023-04-06 at 15:26 -0700, john.c.harri...@intel.com wrote:
> From: John Harrison 
> 
> Fix Xe_LP name.
> 
> Signed-off-by: John Harrison 
alan:snip


> -/* GEN9/XE_LPD - Render / Compute Per-Engine-Instance */
> +/* GEN8+ Render / Compute Per-Engine-Instance */
alan: two comments on this:
1. shouldnt this go with the earlier patch?
2. i agree with renaming the names of the register to reflect the when
   all of those registers got first introduced... however, considering
   we only support GuC on Gen12 and beyond (we do have select CI-tests
   that enable GuC on Gen9 but not on Gen8 and before), should we also
   change the comments? I think the comment should reflect the usage
   not just follow the same name of the registe #define - else why even
   add the comments. (please apply this same comment for gen8_vd_inst_regs,
   gen8_vec_inst_regs and gen8_blt_inst_regs).
alternatively, we could keep those GEN8+ comments above the list but maybe
add just one comment in the default list - see below.

alan: snip

> @@ -366,7 +364,7 @@ guc_capture_get_device_reglist(struct intel_guc *guc)
>   const struct __guc_mmio_reg_descr_group *lists;
>  
>   if (GRAPHICS_VER(i915) >= 12)
> - lists = xe_lpd_lists;
> + lists = xe_lp_lists;
>   else
>   lists = default_lists;
alan: perhaps add a comment that we really don't support any of this
on anything below Gen9?

>  



Re: [Intel-gfx] [PATCH 3/5] drm/i915/guc: Capture list clean up - 2

2023-04-25 Thread Teres Alexis, Alan Previn
On Thu, 2023-04-06 at 15:26 -0700, john.c.harri...@intel.com wrote:
> From: John Harrison 
> 
> Don't use 'xe_lp*' prefixes for register lists that are common with
> Gen8.

alan:snip


> @@ -177,32 +177,32 @@ static const struct __guc_mmio_reg_descr 
> empty_regs_list[] = {
>  static const struct __guc_mmio_reg_descr_group default_lists[] = {
alan:snip
> - MAKE_REGLIST(xe_lpd_gsc_inst_regs, PF, ENGINE_INSTANCE, 
> GUC_GSC_OTHER_CLASS),
> + MAKE_REGLIST(empty_regs_list, PF, ENGINE_INSTANCE, GUC_GSC_OTHER_CLASS),
alan: i missed from the review of Daniele's GSC enabling patch - thanks for 
catching this.
>   {}
>  };
> 
> 
simple patch - all looks good:
Reviewed-by: Alan Previn 


Re: [Intel-gfx] [PATCH 2/5] drm/i915/guc: Capture list clean up - 1

2023-04-25 Thread Teres Alexis, Alan Previn
On Thu, 2023-04-06 at 15:26 -0700, john.c.harri...@intel.com wrote:
> From: John Harrison 
> 
> Remove 99% duplicated steered register list code. Also, include the
> pre-Xe steered registers in the pre-Xe list generation.
> 
> Signed-off-by: John Harrison 

alan: Nice work - good cleanup. Thanks so much.
Reviewed-by: Alan Previn 



Re: [PATCH 1/5] drm/i915/guc: Don't capture Gen8 regs on Xe devices

2023-04-25 Thread Teres Alexis, Alan Previn
On Thu, 2023-04-06 at 15:26 -0700, Harrison, John C wrote:
> From: John Harrison 
> 
> A pair of pre-Xe registers were being included in the Xe capture list.
> GuC was rejecting those as being invalid and logging errors about
> them. So, stop doing it.
> 
alan:snip
>  #define COMMON_GEN9BASE_GLOBAL \
> - { GEN8_FAULT_TLB_DATA0, 0,  0, "GEN8_FAULT_TLB_DATA0" }, \
> - { GEN8_FAULT_TLB_DATA1, 0,  0, "GEN8_FAULT_TLB_DATA1" }, \
>   { ERROR_GEN6,   0,  0, "ERROR_GEN6" }, \
>   { DONE_REG, 0,  0, "DONE_REG" }, \
>   { HSW_GTT_CACHE_EN, 0,  0, "HSW_GTT_CACHE_EN" }
>  
> +#define GEN9_GLOBAL \
> + { GEN8_FAULT_TLB_DATA0, 0,  0, "GEN8_FAULT_TLB_DATA0" }, \
> + { GEN8_FAULT_TLB_DATA1, 0,  0, "GEN8_FAULT_TLB_DATA1" }
> +
>  #define COMMON_GEN12BASE_GLOBAL \
>   { GEN12_FAULT_TLB_DATA0,0,  0, "GEN12_FAULT_TLB_DATA0" }, \
>   { GEN12_FAULT_TLB_DATA1,0,  0, "GEN12_FAULT_TLB_DATA1" }, \
> @@ -142,6 +144,7 @@ static const struct __guc_mmio_reg_descr 
> xe_lpd_gsc_inst_regs[] = {
>  static const struct __guc_mmio_reg_descr default_global_regs[] = {
>   COMMON_BASE_GLOBAL,
>   COMMON_GEN9BASE_GLOBAL,
> + GEN9_GLOBAL,
>  };
>  
alan: splitting out a couple registers from COMMON_GEN9BASE_GLOBAL into 
GEN9_GLOBAL
doesn't seem to communicate the intent of fix for this patch. This is more of a 
naming,
thing and i am not sure what counter-proposal will work well in terms of 
readibility.
One idea: perhaps we rename "COMMON_GEN9BASE_GLOBAL" to 
"COMMON_GEN9PLUS_BASE_GLOBAL"
and rename GEN9_GLOBAL to COMMON_GEN9LEGACY_GLOBAL. so we would have two 
gen9-global
with a clear distinction in naming where one is "GEN9PLUS" and the other is 
"GEN9LEGACY".

But since this is a list-naming thing, i am okay either above change... OR...
keeping the same but with the condition of adding a comment under
COMMON_GEN9BASE_GLOBAL and GEN9_GLOBAL names that explain the differences where 
one
is gen9-legacy and the other is gen9-and-future that carries over to beyond 
Gen9.
(side note: coding style wise, is it possible to add the comment right under 
the #define
line as opposed to under the entire list?)

(conditional) Reviewed-by: Alan Previn 



Re: IOCTL feature detection (Was: Re: [Intel-gfx] [PATCH 8/8] drm/i915: Allow user to set cache at BO creation)

2023-04-25 Thread Teres Alexis, Alan Previn
On Tue, 2023-04-25 at 16:41 +0300, Joonas Lahtinen wrote:
> (+ Faith and Daniel as they have been involved in previous discussions)
An orthogonal (but losely related) question: Is PXP the only subsystem that has
the unique problem of: Uses a delayed worker to complete all dependencies for 
init..
but they take so long that by the time compositors-mesa-init comes up, creation
of PXP context blocks too long and may still likely failing and incorrectly
assuming PXP is not supported. (when we don't have a GET_PARAM for it).
I believe HuC has a similiar issue - but doesnt reflect in the UAPI but rather 
the cmd buffers.
We don't have other subsystems that have this problem? (dependency on other 
startups?)



Re: [PATCH 4/4] drm/i915/gsc: add support for GSC proxy interrupt

2023-04-20 Thread Teres Alexis, Alan Previn
I think we are also bottom-ing on the opens fo this patch too:


On Thu, 2023-04-20 at 13:21 -0700, Ceraolo Spurio, Daniele wrote:
> On 4/20/2023 11:49 AM, Teres Alexis, Alan Previn wrote:
> > On Wed, 2023-03-29 at 09:56 -0700, Ceraolo Spurio, Daniele wrote:
> > > The GSC notifies us of a proxy request via the HECI2 interrupt. The
> > alan:snip
> > 
> > > @@ -256,6 +262,7 @@ void gen11_gt_irq_postinstall(struct intel_gt *gt)
> > >   u32 irqs = GT_RENDER_USER_INTERRUPT;
> > >   u32 guc_mask = intel_uc_wants_guc(>uc) ? GUC_INTR_GUC2HOST 
> > > : 0;
> > >   u32 gsc_mask = 0;
> > > + u32 heci_mask = 0;
> > alan: nit: should we call this heci2_mask - uniquely identifiable from 
> > legacy.
> 
> The mask is theoretically not just for HECI2, the bit we set in it is. 
> If future platforms add back the HECI1 interrupt then it'd go in the 
> same mask.
alan: yeah - im good with that - no change needed then - was a nit anyways.


> > alan:snip
> > 
> > > - else if (HAS_HECI_GSC(gt->i915))
> > > + heci_mask = GSC_IRQ_INTF(1); /* HECI2 IRQ for SW Proxy*/
> > alan: does this GSC_IRQ_INTF macro still make sense for this newer 
> > platforms with GSC-CS / HECI2?
> > i dont think i see other bit definitions for this mask register, so might 
> > else well just define it as BIT14?
> 
> GSC_IRQ_INTF(1) is the HECI2 interrupt on DG2 and the bit has remained 
> the same exactly to allow SW to re-use the same code/defines, so IMO we 
> should do so.
alan: okay sure - sounds good.


> > > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h 
> > > b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
> > > index 4aecb5a7b631..da11ce5ca99e 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h
> > > +++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
> > > @@ -1577,6 +1577,7 @@
> > >   
> > >   #define GEN11_GT_INTR_DW(x) _MMIO(0x190018 + ((x) * 
> > > 4))
> > >   #define   GEN11_CSME(31)
> > > +#define   GEN12_HECI_2   (30)
> > alan: I dont see this being used anywhere - should remove this.
> 
> A few of the defines for this register here are not used. I've added 
> this one in as a way of documenting where the bit was, but I can remove 
> it if you think it's unneeded.
alan: Oh i actually didnt notice that earlier - in that case, please keep it
would be nice to add a comment for all of those such bits (consider this a nit).

> > 

> > > +void intel_gsc_proxy_irq_handler(struct intel_gsc_uc *gsc, u32 iir)
> > > +{
> > > + struct intel_gt *gt = gsc_uc_to_gt(gsc);
> > > +
> > > + if (unlikely(!iir))
> > > + return;
> > > +
> > > + lockdep_assert_held(gt->irq_lock);
> > > +
> > > + if (!gsc->proxy.component) {
> > > + gt_err(gt, "GSC proxy irq received without the component being 
> > > bound!\n");
> > alan: So although proxy init is only a one-time thing (even surviving 
> > suspend-resume), we
> > know that proxy runtime irqs could happen anytime (depending on other 
> > gpu-security-related
> > system interactions). However, would the component driver be bound/unbound 
> > through a
> > suspend-resume cycle? If yes, then would this check fail if an IRQ for a 
> > proxy request
> > came too early after a resume cycle. If yes, then should we not do 
> > somethign here to ensure that
> > when the component gets bound later, we know there is something waiting to 
> > be processed?
> > (in bind, if proxy-init was done before, but we have outstanding IRQs, then 
> > we can trigger
> > the worker from there, i.e. the bind func?)
> 
> A proxy request can only be triggered in response to a userspace ask, 
> which in turn can only happen once we've completed the resume flow and 
> the component is re-bound. Therefore, we should never have a situation 
> where we get a proxy interrupt without the component being bound.
alan: thanks -my bad.
> > 


> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h
> > > @@ -23,6 +23,9 @@ struct intel_gsc_uc {
> > >   /* for delayed load and proxy handling */
> > >   struct workqueue_struct *wq;
> > >   struct work_struct work;
> > > + u32 gsc_work_actions; /* protected by gt->irq_lock */
> > > +#define GSC_ACTION_FW_LOAD BIT(0)
> > > +#define GSC_ACTION_SW_PROXY BIT(1
> > > 
> > alan: perhaps its time to have a substruct for "proxy_worker" and include
> > 'wq' and 'work' in additional to actions?
> 
> The worker is not just for proxy, we use it for a GSC and HuC loading as 
> well. It's the main way we handle the gsc_uc, so IMO it's better off 
> staying in the main struct. However, if you still think a substructure 
> will make things cleaner I can add it in, but please suggest a name 
> because I have no idea what to call it (something like handler?).
alan: i was thinking "task_handler" - but since its not specific to proxy,
then let's not change it.



Re: [PATCH 2/4] mei: gsc_proxy: add gsc proxy driver

2023-04-20 Thread Teres Alexis, Alan Previn
i guess we are settled with this patch...

On Thu, 2023-04-20 at 15:04 -0700, Ceraolo Spurio, Daniele wrote:
> On 4/18/2023 11:57 PM, Teres Alexis, Alan Previn wrote:
> > On Wed, 2023-03-29 at 09:56 -0700, Ceraolo Spurio, Daniele wrote:
> > > From: Alexander Usyskin 
> > > 
> > > Add GSC proxy driver. It to allows messaging between GSC component
> > > on Intel on board graphics card and CSE device.
> > alan:nit: isn't "Intel integrated GPU" clearer than "Intel on-board 
> > graphics card"?
> > Same thing for the Kconfig description later (or am i missing something 
> > else here).
> 
> Will change
> 
alan: saw your reply on better alternative for both 'i' and 'd'

alan:snip
> > > +static int mei_gsc_proxy_recv(struct device *dev, void *buf, size_t size)
> > > +{
> > > + ssize_t ret;
> > > +
> > > + if (!dev || !buf)
> > alan: nit: same as in the 'send' above,.. not sure if we should be checking 
> > for !size here...
> > or perhaps 0 sized recv is supported.
> 
> AFAICS the lower level of the mei code do allow for size 0 for both send 
> and recv. Also, this is the same check as what we do for the PXP component.
alan: agreed - thus the nit as per my earlier email.

> > 
alan:snip

> > > + if (subcomponent != I915_COMPONENT_GSC_PROXY)
> > > + return 0;
> > > +
> > > + return component_compare_dev(dev->parent, ((struct device 
> > > *)data)->parent);
> > alan: do we care if both these parents are non-null? i notice in other mei 
> > component
> > drivers match functions we do check that.
> 
> Those should always both be non-NULL, since both the mei and the GFX 
> device have the PCI bus as parent (and the previous check on pdev 
> ensures those are the 2 devices we're handling at this point).
alan: sounds good.



> > > +#define MEI_UUID_GSC_PROXY UUID_LE(0xf73db04, 0x97ab, 0x4125, \
> > > +0xb8, 0x93, 0xe9, 0x4, 0xad, 0xd, 0x54, 0x64)
> > alan: apologies for the newbie question, but why are we using UUID for the 
> > gsc_proxy
> > as opposed to GUID like the other mei components? i am not sure if i read 
> > the right
> > archived patch review but it sounded like GUID is for internal to kernel 
> > only whereas
> > UUID is for external too? 
[snip]
> AFAICS all other mei components use UUID_LE as well. The code was 
> updated from GUID to UUID_LE in:
> https://lore.kernel.org/all/20221228160558.21311-1-andriy.shevche...@linux.intel.com/
alan: sounds good- thanks for the URL.





Re: [PATCH 4/4] drm/i915/gsc: add support for GSC proxy interrupt

2023-04-20 Thread Teres Alexis, Alan Previn
On Wed, 2023-03-29 at 09:56 -0700, Ceraolo Spurio, Daniele wrote:
> The GSC notifies us of a proxy request via the HECI2 interrupt. The

alan:snip

> @@ -256,6 +262,7 @@ void gen11_gt_irq_postinstall(struct intel_gt *gt)
>   u32 irqs = GT_RENDER_USER_INTERRUPT;
>   u32 guc_mask = intel_uc_wants_guc(>uc) ? GUC_INTR_GUC2HOST : 0;
>   u32 gsc_mask = 0;
> + u32 heci_mask = 0;
alan: nit: should we call this heci2_mask - uniquely identifiable from legacy.


alan:snip

> - else if (HAS_HECI_GSC(gt->i915))
> + heci_mask = GSC_IRQ_INTF(1); /* HECI2 IRQ for SW Proxy*/
alan: does this GSC_IRQ_INTF macro still make sense for this newer platforms 
with GSC-CS / HECI2?
i dont think i see other bit definitions for this mask register, so might else 
well just define it as BIT14?

alan:snip



> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h 
> b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
> index 4aecb5a7b631..da11ce5ca99e 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
> @@ -1577,6 +1577,7 @@
>  
>  #define GEN11_GT_INTR_DW(x)  _MMIO(0x190018 + ((x) * 4))
>  #define   GEN11_CSME (31)
> +#define   GEN12_HECI_2   (30)
alan: I dont see this being used anywhere - should remove this.

> +#define GEN11_HECI2_RSVD_INTR_MASK   _MMIO(0x1900e4)
alan: GEN11? don't you mean GEN12?





> +void intel_gsc_proxy_irq_handler(struct intel_gsc_uc *gsc, u32 iir)
> +{
> + struct intel_gt *gt = gsc_uc_to_gt(gsc);
> +
> + if (unlikely(!iir))
> + return;
> +
> + lockdep_assert_held(gt->irq_lock);
> +
> + if (!gsc->proxy.component) {
> + gt_err(gt, "GSC proxy irq received without the component being 
> bound!\n");
alan: So although proxy init is only a one-time thing (even surviving 
suspend-resume), we
know that proxy runtime irqs could happen anytime (depending on other 
gpu-security-related
system interactions). However, would the component driver be bound/unbound 
through a
suspend-resume cycle? If yes, then would this check fail if an IRQ for a proxy 
request
came too early after a resume cycle. If yes, then should we not do somethign 
here to ensure that
when the component gets bound later, we know there is something waiting to be 
processed?
(in bind, if proxy-init was done before, but we have outstanding IRQs, then we 
can trigger
the worker from there, i.e. the bind func?)

alan:snip

>  static int i915_gsc_proxy_component_bind(struct device *i915_kdev,
>struct device *tee_kdev, void *data)
>  {
>   struct drm_i915_private *i915 = kdev_to_i915(i915_kdev);
> - struct intel_gsc_uc *gsc = >media_gt->uc.gsc;
> + struct intel_gt *gt = i915->media_gt;
> + struct intel_gsc_uc *gsc = >uc.gsc;
> + intel_wakeref_t wakeref;
> +
> + /* enable HECI2 IRQs */
> + with_intel_runtime_pm(>runtime_pm, wakeref)
> + intel_uncore_rmw(gt->uncore, HECI_H_CSR(MTL_GSC_HECI2_BASE),
> +  0, HECI_H_CSR_IE);
alan: i notice we don't seem to care about potentially re-writing a '1' into 
reset
if it was midst reset when we read. Shouldn't we also protect against that here?

alan:snip


> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h
> index 023bded10dde..a2a0813b8a76 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h
> @@ -23,6 +23,9 @@ struct intel_gsc_uc {
>   /* for delayed load and proxy handling */
>   struct workqueue_struct *wq;
>   struct work_struct work;
> + u32 gsc_work_actions; /* protected by gt->irq_lock */
> +#define GSC_ACTION_FW_LOAD BIT(0)
> +#define GSC_ACTION_SW_PROXY BIT(1
> 
alan: perhaps its time to have a substruct for "proxy_worker" and include
'wq' and 'work' in additional to actions?
> )
>  
>   struct {
>   struct i915_gsc_proxy_component *component;



Re: [PATCH 3/4] drm/i915/gsc: add initial support for GSC proxy

2023-04-19 Thread Teres Alexis, Alan Previn
I have a number of comments but most are personal preferences and so i labelled 
them nits.
I did catch a few minor coding styling issues and am assuming those need to be 
enforced as per i915/kernel rules?
That said, since they are so minor (or maybe they are not strict), I'm 
providing a conditional RB to fix those 4 issues
(i.e. the header inclusion alphabetical ordering and struct '{' bracket 
position)

Reviewed-by: Alan Previn 

On Wed, 2023-03-29 at 09:56 -0700, Ceraolo Spurio, Daniele wrote:
> The GSC uC needs to communicate with the CSME to perform certain
> operations. Since the GSC can't perform this communication directly
> on platforms where it is integrated in GT, i915 needs to transfer the
> messages from GSC to CSME and back.
> The proxy flow is as follow:
> 1 - i915 submits a request to GSC asking for the message to CSME
> 2 - GSC replies with the proxy header + payload for CSME
> 3 - i915 sends the reply from GSC as-is to CSME via the mei proxy
> component
> 4 - CSME replies with the proxy header + payload for GSC
> 5 - i915 submits a request to GSC with the reply from CSME
> 6 - GSC replies either with a new header + payload (same as step 2,
> so we restart from there) or with an end message.
> 
> After GSC load, i915 is expected to start the first proxy message chain,
> while all subsequent ones will be triggered by the GSC via interrupt.
> 
> To communicate with the CSME, we use a dedicated mei component, which
> means that we need to wait for it to bind before we can initialize the
> proxies. This usually happens quite fast, but given that there is a
> chance that we'll have to wait a few seconds the GSC work has been moved
> to a dedicated WQ to not stall other processes.
> 
> Signed-off-by: Daniele Ceraolo Spurio 
> Cc: Alan Previn 
> ---
>  drivers/gpu/drm/i915/Makefile |   1 +
>  drivers/gpu/drm/i915/gt/uc/intel_gsc_proxy.c  | 384 ++
>  drivers/gpu/drm/i915/gt/uc/intel_gsc_proxy.h  |  17 +
>  drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c |  40 +-
>  drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h |  14 +-
>  .../i915/gt/uc/intel_gsc_uc_heci_cmd_submit.h |   1 +
>  6 files changed, 452 insertions(+), 5 deletions(-)
>  create mode 100644 drivers/gpu/drm/i915/gt/uc/intel_gsc_proxy.c
>  create mode 100644 drivers/gpu/drm/i915/gt/uc/intel_gsc_proxy.h
> 
alan:snip

> new file mode 100644
> index ..ed8f68e78c26
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_proxy.c
> @@ -0,0 +1,384 @@
> +#include "intel_gsc_proxy.h"
> +
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2022 Intel Corporation
alan: nit: 2022 - 2023?

> + */
> +
> +#include 
> +#include "drm/i915_gsc_proxy_mei_interface.h"
alan: alphabetical
> +#include "drm/i915_component.h"
alan: snip

> +/*
> + * GSC proxy:
> + * The GSC uC needs to communicate with the CSME to perform certain 
> operations.
> + * Since the GSC can't perform this communication directly on platforms 
> where it
> + * is integrated in GT, i915 needs to transfer the messages from GSC to CSME
> + * and back. i915 must manually start the proxy flow after the GSC is loaded 
> to
> + * signal to GSC that we're ready to handle its messages and allow it to 
> query
> + * its init data from CSME; GSC will then trigger an HECI2 interrupt if it 
> needs
> + * to send messages to CSME again.
> + * The proxy flow is as follow:
> + * 1 - i915 submits a request to GSC asking for the message to CSME
> + * 2 - GSC replies with the proxy header + payload for CSME
> + * 3 - i915 sends the reply from GSC as-is to CSME via the mei proxy 
> component
> + * 4 - CSME replies with the proxy header + payload for GSC
> + * 5 - i915 submits a request to GSC with the reply from CSME
> + * 6 - GSC replies either with a new header + payload (same as step 2, so we
> + * restart from there) or with an end message.
> + */
> +
> +/*
> + * The component should load quite quickly in most cases, but it could take
> + * a bit. Using a very big timeout just to cover the worst case scenario
> + */
> +#define GSC_PROXY_INIT_TIMEOUT_MS 2
> +
> +/* the protocol supports up to 32K in each direction */
> +#define GSC_PROXY_BUFFER_SIZE SZ_32K
> +#define GSC_PROXY_CHANNEL_SIZE (GSC_PROXY_BUFFER_SIZE * 2)
> +#define GSC_PROXY_MAX_MSG_SIZE (GSC_PROXY_BUFFER_SIZE - sizeof(struct 
> intel_gsc_mtl_header))
> +
> +/* FW-defined proxy header */
> +struct intel_gsc_proxy_header
> +{
alan: i thought we typically put the '{' on the same line as the struct name 
> 
alan:snip

> +struct gsc_proxy_msg
> +{
alan: shouldnt the '{' be above?
> + struct intel_gsc_mtl_header header;
> + struct intel_gsc_proxy_header proxy_header;
> +} __packed;
> +
> +static int proxy_send_to_csme(struct intel_gsc_uc *gsc)
> +{
> + struct intel_gt *gt = gsc_uc_to_gt(gsc);
> + struct i915_gsc_proxy_component *comp = gsc->proxy.component;
> + struct intel_gsc_mtl_header *hdr;
> + void *in = gsc->proxy.to_csme;
> + void *out = 

  1   2   3   >