Re: [PATCH] drm/i915/guc: Correct capture of EIR register on hang
On Fri, 2024-02-23 at 12:32 -0800, john.c.harri...@intel.com wrote: > From: John Harrison alan:snip > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c > @@ -51,6 +51,7 @@ > { RING_ESR(0), 0, 0, "ESR" }, \ > { RING_DMA_FADD(0), 0, 0, "RING_DMA_FADD_LDW" }, > \ > { RING_DMA_FADD_UDW(0), 0, 0, "RING_DMA_FADD_UDW" }, > \ > + { RING_EIR(0), 0, 0, "EIR" }, \ > { RING_IPEIR(0), 0, 0, "IPEIR" }, \ > { RING_IPEHR(0), 0, 0, "IPEHR" }, \ > { RING_INSTPS(0), 0, 0, "INSTPS" }, \ > @@ -80,9 +81,6 @@ > { GEN8_RING_PDP_LDW(0, 3), 0, 0, "PDP3_LDW" }, \ > { GEN8_RING_PDP_UDW(0, 3), 0, 0, "PDP3_UDW" } > > -#define COMMON_BASE_HAS_EU \ > - { EIR, 0, 0, "EIR" } > - alan:snip alan: Thanks for catching this one. Reviewed-by: Alan Previn
Re: [PATCH v9 0/2] Resolve suspend-resume racing with GuC destroy-context-worker
On Wed, 2023-12-27 at 20:55 -0800, Teres Alexis, Alan Previn wrote: > This series is the result of debugging issues root caused to > races between the GuC's destroyed_worker_func being triggered > vs repeating suspend-resume cycles with concurrent delayed > fence signals for engine-freeing. > alan:snip. alan: I did not receive the CI-premerge email where the following was reported: IGT changes Possible regressions igt@i915_selftest@live@gt_pm: shard-rkl: PASS -> DMESG-FAIL After going thru the error in dmesg and codes, i am confident this failure not related to the series. This selftest calls rdmsrl functions (that doen't do any requests / guc submissions) but gets a reply power of zero (the bug reported). So this is unrelated. Hi @"Vivi, Rodrigo" , just an FYI note that after the last requested rebase, BAT passed twice in a row now so i am confident failures on rev7 and prior was unrelated and that this series is ready for merging. Thanks again for all your help and patiences - this was a long one :)
Re: [PATCH v8 2/2] drm/i915/guc: Close deregister-context race against CT-loss
On Tue, 2023-12-26 at 10:11 -0500, Vivi, Rodrigo wrote: > On Wed, Dec 20, 2023 at 11:08:59PM +0000, Teres Alexis, Alan Previn wrote: > > On Wed, 2023-12-13 at 16:23 -0500, Vivi, Rodrigo wrote: alan:snip > > > > > > alan: Thanks Rodrigo for the RB last week, just quick update: > > > > I've cant reproduce the BAT failures that seem to be intermittent > > on platform and test - however, a noticable number of failures > > do keep occuring on i915_selftest @live @requests where the > > last test leaked a wakeref and the failing test hangs waiting > > for gt to idle before starting its test. > > > > i have to debug this further although from code inspection > > is unrelated to the patches in this series. > > Hopefully its a different issue. > > Yeap, likely not related. Anyway, I'm sorry for not merging > this sooner. Could you please send a rebased version? This > on is not applying cleanly anymore. Hi Rodrigo, i will rebase it as soon as i do a bit more testing.. I realize i was using a slighlty older guc and with newer guc am seeing all kinds of failures but trending as not an issue with the series.
Re: [PATCH v8 2/2] drm/i915/guc: Close deregister-context race against CT-loss
On Wed, 2023-12-13 at 16:23 -0500, Vivi, Rodrigo wrote: > On Tue, Dec 12, 2023 at 08:57:16AM -0800, Alan Previn wrote: > > If we are at the end of suspend or very early in resume > > its possible an async fence signal (via rcu_call) is triggered > > to free_engines which could lead us to the execution of > > the context destruction worker (after a prior worker flush). alan:snip > > > Thus, do an unroll in guc_lrc_desc_unpin and deregister_destroyed_- > > contexts if guc_lrc_desc_unpin fails due to CT send falure. > > When unrolling, keep the context in the GuC's destroy-list so > > it can get picked up on the next destroy worker invocation > > (if suspend aborted) or get fully purged as part of a GuC > > sanitization (end of suspend) or a reset flow. > > > > Signed-off-by: Alan Previn > > Signed-off-by: Anshuman Gupta > > Tested-by: Mousumi Jana > > Acked-by: Daniele Ceraolo Spurio > > Thanks for all the explanations, patience and great work! > > Reviewed-by: Rodrigo Vivi alan: Thanks Rodrigo for the RB last week, just quick update: I've cant reproduce the BAT failures that seem to be intermittent on platform and test - however, a noticable number of failures do keep occuring on i915_selftest @live @requests where the last test leaked a wakeref and the failing test hangs waiting for gt to idle before starting its test. i have to debug this further although from code inspection is unrelated to the patches in this series. Hopefully its a different issue.
Re: [PATCH v7 2/2] drm/i915/guc: Close deregister-context race against CT-loss
> As far as i can tell, its only if we started resetting / wedging right after > this > queued worker got started. alan: hope Daniele can proof read my tracing and confirm if got it right.
Re: [PATCH v7 2/2] drm/i915/guc: Close deregister-context race against CT-loss
On Thu, 2023-11-30 at 16:18 -0500, Vivi, Rodrigo wrote: > On Wed, Nov 29, 2023 at 04:20:13PM -0800, Alan Previn wrote: alan:snip > > + > > if (unlikely(disabled)) { > > release_guc_id(guc, ce); > > __guc_context_destroy(ce); > > - return; > > + return 0; > > is success the right return case here? alan: yes: we may discover "disabled == true" if submission_disabled found that gt-is-wedged. I dont believe such a case will happen as part of flushing destroyed_worker_func during suspend but may occur as part of regular runtime context closing that just happens to get queued in the middle of a gt-reset/wedging process. In such a case, the reset-prepare code will sanitize everytihng including cleaning up the pending-destructoin-contexts-link-list. So its either we pick it from here and dump it ... or reset picks it first and dumps it there (where both dumpings only occur if guc got disabled first). Supplemental: How regular context cleanup leads to the same path --> i915_sw_fence_notify -> engines_notify -> free_engines_rcu -> intel_context_put -> kref_put(>ref..) -> ce->ops->destroy -> (where ce->ops = engine->cops and engine->cops = guc_context_ops) *and, guc_context_ops->destroy == guc_context_destroy so -> ce->ops->destroy -> guc_context_destroy -> queue_work(..>submission_state.destroyed_worker); -> [eventually] -> the same guc_lrc_unpin above However with additional "if (!intel_guc_is_ready(guc))" in destroyed_worker_func as part of this patch, hitting this "disabled==true" case will be even less likely. As far as i can tell, its only if we started resetting / wedging right after this queued worker got started. (i ran out of time to check if reset can ever occur from within the same system_unbound_wq but then i recall we also have CI using debugfs to force wedging for select (/all?) igt tests) so i suspect it can still happen in parallel. NOTE: the original checking of the "is disabled" is not new code - its the original code. ...alan P.S.- oh man, that took a lot of code tracing as i can't remember these paths by heart.
Re: [PATCH v2 1/1] drm/i915/pxp: Add missing tag for Wa_14019159160
On Wed, 2023-11-29 at 13:13 -0800, Teres Alexis, Alan Previn wrote: > On Mon, 2023-11-27 at 15:24 -0500, Vivi, Rodrigo wrote: > > On Wed, Nov 22, 2023 at 12:30:03PM -0800, Alan Previn wrote: > alan:snip > alan: thanks for reviewing and apologize for replying to this late. > > > > /* > > > - * On MTL and newer platforms, protected contexts require setting > > > - * the LRC run-alone bit or else the encryption will not happen. > > > + * Wa_14019159160 - Case 2: mtl > > > + * On some platforms, protected contexts require setting > > > + * the LRC run-alone bit or else the encryption/decryption will not > > > happen. > > > + * NOTE: Case 2 only applies to PXP use-case of said workaround. > > >*/ > > > > hmm, interesting enough, on the wa description I read that it is incomplete > > for MTL/ARL > > and something about a fuse bit. We should probably chase for some > > clarification in the > > HSD?! > alan: yes, i went through the HSD description with the architect and we both > had agreed that > that from the KMD's perspective. At that time, the checking of the fuse from > KMD's > could be optimized out for Case-2-PXP: if the fuses was set a certain way, > KMD can skip setting > the bit in lrc because hw will enforce runalone in pxp mode irrespective of > what KMD does but > if fuse was was not set that way KMD needed to set the runalone in lrc. So > for case2, its simpler > to just turn it on always when the context is protected. However, that said, > i realized the > wording of the HSDs have changed since the original patch was implemented so > i will reclarify > offline and reply back. NOTE: i believe John Harrison is working on case-3 > through a > separate series. alan: based on conversations with the architect, a different WA number, 14019725520, has the details of case-2 that explains the relationship the new fuse. However, as before it can be optimized as explained above where PXP context need to always set this bit to ensure encryption/decryption engages.
Re: [PATCH v2 1/1] drm/i915/pxp: Add missing tag for Wa_14019159160
On Mon, 2023-11-27 at 15:24 -0500, Vivi, Rodrigo wrote: > On Wed, Nov 22, 2023 at 12:30:03PM -0800, Alan Previn wrote: alan:snip alan: thanks for reviewing and apologize for replying to this late. > > /* > > -* On MTL and newer platforms, protected contexts require setting > > -* the LRC run-alone bit or else the encryption will not happen. > > +* Wa_14019159160 - Case 2: mtl > > +* On some platforms, protected contexts require setting > > +* the LRC run-alone bit or else the encryption/decryption will not > > happen. > > +* NOTE: Case 2 only applies to PXP use-case of said workaround. > > */ > > hmm, interesting enough, on the wa description I read that it is incomplete > for MTL/ARL > and something about a fuse bit. We should probably chase for some > clarification in the > HSD?! alan: yes, i went through the HSD description with the architect and we both had agreed that that from the KMD's perspective. At that time, the checking of the fuse from KMD's could be optimized out for Case-2-PXP: if the fuses was set a certain way, KMD can skip setting the bit in lrc because hw will enforce runalone in pxp mode irrespective of what KMD does but if fuse was was not set that way KMD needed to set the runalone in lrc. So for case2, its simpler to just turn it on always when the context is protected. However, that said, i realized the wording of the HSDs have changed since the original patch was implemented so i will reclarify offline and reply back. NOTE: i believe John Harrison is working on case-3 through a separate series. > > > - if (GRAPHICS_VER_FULL(ce->engine->i915) >= IP_VER(12, 70) && > > - (ce->engine->class == COMPUTE_CLASS || ce->engine->class == > > RENDER_CLASS)) { > > + if (IS_METEORLAKE(ce->engine->i915) && (ce->engine->class == > > COMPUTE_CLASS || > > + ce->engine->class == > > RENDER_CLASS)) { > > This check now excludes the ARL with the same IP, no?! alan: yeah - re-reved this part just now. > > > rcu_read_lock(); > > gem_ctx = rcu_dereference(ce->gem_context); > > if (gem_ctx) > > > > base-commit: 5429d55de723544dfc0630cf39d96392052b27a1 > > -- > > 2.39.0 > >
Re: [Intel-gfx] [PATCH v5] drm/i915/pxp: Add drm_dbgs for critical PXP events.
On Fri, 2023-11-24 at 08:30 +, Tvrtko Ursulin wrote: > On 22/11/2023 19:15, Alan Previn wrote: alan:snip alan: thanks for reviewing. > > if (iir & GEN12_DISPLAY_STATE_RESET_COMPLETE_INTERRUPT) > > - pxp->session_events |= PXP_TERMINATION_COMPLETE; > > + pxp->session_events |= PXP_TERMINATION_COMPLETE | > > PXP_EVENT_TYPE_IRQ; > > This looked to be doing more than adding debug logs, but then no one is > actually consuming this new flag?! alan: see below hunk, inside pxp_session_work the drm_dbg that was prints the "events" that came from above. we need this debug message to uniquely recognize that contexts pxp keys would get invalidated in response to a KCR IRQ (as opposed to other known flows that are being reported by other means like banning hanging contexts, or as part of suspend etc). NOTE: pxp->session_events does get set in at lease one other path outside the IRQ (which triggers teardown but not coming from KCR-hw) This flag is solely for the debug message. Hope this works. > > Regards, > > Tvrtko > > > > > if (pxp->session_events) > > queue_work(system_unbound_wq, >session_work); > > diff --git a/drivers/gpu/drm/i915/pxp/intel_pxp_session.c > > b/drivers/gpu/drm/i915/pxp/intel_pxp_session.c > > index 0a3e66b0265e..091c86e03d1a 100644 > > --- a/drivers/gpu/drm/i915/pxp/intel_pxp_session.c > > +++ b/drivers/gpu/drm/i915/pxp/intel_pxp_session.c > > @@ -137,8 +137,10 @@ void intel_pxp_terminate(struct intel_pxp *pxp, bool > > post_invalidation_needs_res > > static void pxp_terminate_complete(struct intel_pxp *pxp) > > { > > /* Re-create the arb session after teardown handle complete */ > > - if (fetch_and_zero(>hw_state_invalidated)) > > + if (fetch_and_zero(>hw_state_invalidated)) { > > + drm_dbg(>ctrl_gt->i915->drm, "PXP: creating arb_session > > after invalidation"); > > pxp_create_arb_session(pxp); > > + } > > > > complete_all(>termination); > > } > > @@ -157,6 +159,8 @@ static void pxp_session_work(struct work_struct *work) > > if (!events) > > return; > > > > + drm_dbg(>i915->drm, "PXP: processing event-flags 0x%08x", events); > > + > > if (events & PXP_INVAL_REQUIRED) > > intel_pxp_invalidate(pxp); > > > > diff --git a/drivers/gpu/drm/i915/pxp/intel_pxp_types.h > > b/drivers/gpu/drm/i915/pxp/intel_pxp_types.h > > index 7e11fa8034b2..07864b584cf4 100644 > > --- a/drivers/gpu/drm/i915/pxp/intel_pxp_types.h > > +++ b/drivers/gpu/drm/i915/pxp/intel_pxp_types.h > > @@ -124,6 +124,7 @@ struct intel_pxp { > > #define PXP_TERMINATION_REQUEST BIT(0) > > #define PXP_TERMINATION_COMPLETE BIT(1) > > #define PXP_INVAL_REQUIRED BIT(2) > > +#define PXP_EVENT_TYPE_IRQ BIT(3) > > }; > > > > #endif /* __INTEL_PXP_TYPES_H__ */ > > > > base-commit: 5429d55de723544dfc0630cf39d96392052b27a1
Re: [PATCH v6 2/2] drm/i915/guc: Close deregister-context race against CT-loss
On Mon, 2023-11-27 at 16:51 -0500, Vivi, Rodrigo wrote: alan: Firstly, thanks for taking the time to review this, knowing you have a lot on your plate right now. > alan:snip > > @@ -3301,19 +3315,38 @@ static inline void guc_lrc_desc_unpin(struct > > intel_context *ce) > > /* Seal race with Reset */ > > spin_lock_irqsave(>guc_state.lock, flags); > > disabled = submission_disabled(guc); > > - if (likely(!disabled)) { > > - __intel_gt_pm_get(gt); > > - set_context_destroyed(ce); > > - clr_context_registered(ce); > > - } > > - spin_unlock_irqrestore(>guc_state.lock, flags); > > you are now holding this spin lock for too long... alan: my intention was to ensure that an async H2G request to change this gucid-context state wouldnt come in while we are in the middle of attempting to send the context-destruction H2G and later realize it needed to be unrolled (corner case race we are solving here as discovered on customer platform). But after discussing with Daniele and John, they agreed that we should move the unlock back to before the deregister and then take the lock again inside of the unrolling code (if deregister_context fails). alan:snip > > > - deregister_context(ce, ce->guc_id.id); > > + /* GuC is active, lets destroy this context, > > for multi-line comments you need to start with '/*' only > and start the real comment below, like: alan:my bad. will fix. > * > * GuC is active, ... > > +* but at this point we can still be racing with > > +* suspend, so we undo everything if the H2G fails > > +*/ > > + > > + /* Change context state to destroyed and get gt-pm */ > > + __intel_gt_pm_get(gt); > > + set_context_destroyed(ce); > > + clr_context_registered(ce); > > + > > + ret = deregister_context(ce, ce->guc_id.id); > > + if (ret) { > > + /* Undo the state change and put gt-pm if that failed */ > > + set_context_registered(ce); > > + clr_context_destroyed(ce); > > + /* > > +* Dont use might_sleep / ASYNC verion of put because > > +* CT loss in deregister_context could mean an ongoing > > +* reset or suspend flow. Immediately put before the unlock > > +*/ > > + __intel_wakeref_put(>wakeref, 0); > > interesting enough you use the '__' version to bypass the might_sleep(), > but also sending 0 as argument what might lead you in the mutex_lock inside > the spin lock area, what is not allowed. alan: so one thing to note, the motivation for an alternative unlock was only driven by the fact we were holding that spinlock. However, as per the review comment and response on the spin lock above, if we move the pm-put outside the spin lock, we can call the intel_wakeref_put_async (which will not actually trigger a delayed task becase this function (guc_lrc_desc_unpin) starts with GEM_BUG_ON(!intel_gt_pm_is_awake(gt)); which means it would only decrement the ref count. alan:snip > > + spin_lock_irqsave(>submission_state.lock, flags); > > + list_add_tail(>destroyed_link, > > + > > >submission_state.destroyed_contexts); > > + spin_unlock_irqrestore(>submission_state.lock, > > flags); > > + /* Bail now since the list might never be emptied if > > h2gs fail */ > > For this GuC interaction part I'd like to get an ack from John Harrison > please. alan:okay - will do. I might alternatively tap on Daniele since he knows the guc just as well. > alan:snip > > @@ -3392,6 +3440,17 @@ static void destroyed_worker_func(struct work_struct > > *w) > > struct intel_gt *gt = guc_to_gt(guc); > > int tmp; > > > > + /* > > +* In rare cases we can get here via async context-free fence-signals > > that > > +* come very late in suspend flow or very early in resume flows. In > > these > > +* cases, GuC won't be ready but just skipping it here is fine as these > > +* pending-destroy-contexts get destroyed totally at GuC reset time at > > the > > +* end of suspend.. OR.. this worker can be picked up later on the next > > +* context destruction trigger after resume-completes > > +*/ > > + if (!intel_guc_is_ready(guc)) > > + return; > > is this racy? alan: this is to reduce raciness. This worker function eventually calls deregister_destroyed_context which calls the guc_lrc_desc_unpin that does all the work we discussed above (taking locks, taking refs, sending h2g, potentially unrolling). So this check an optimization to skip that without going through the locking. So yes its a bit racy but its trying to reduce raciness. NOTE1: Without this line of code, in theory everything would still work fine, with the driver potentially experiencing more unroll cases in those thousands of cycles. NOTE2: This check using intel_guc_is_ready is done in many places in the driver knowing that it doesnt take any locks. Of
Re: [Intel-gfx] [PATCH v3 1/1] drm/i915/pxp: Add missing tag for Wa_14019159160
On Tue, 2023-11-28 at 10:03 -0800, Roper, Matthew D wrote: > On Mon, Nov 27, 2023 at 12:11:50PM -0800, Alan Previn wrote: > > Add missing tag for "Wa_14019159160 - Case 2" (for existing > > PXP code that ensures run alone mode bit is set to allow > > PxP-decryption. alan:snip. alan: thanks for reviewing. > > - if (GRAPHICS_VER_FULL(ce->engine->i915) >= IP_VER(12, 70) && > > + if (GRAPHICS_VER_FULL(ce->engine->i915) == IP_VER(12, 70) && > > The workaround database lists this as being needed on both 12.70 and > 12.71. Should this be a > > IS_GFX_GT_IP_RANGE(gt, IP_VER(12, 70), IP_VER(12, 71)) alan: sure - will do this. > > check instead? > > The workaround is also listed in the database as applying to DG2; is > this "case 2" subset of the workaround not relevant to that platform? alan: i dont believe so - not the pxp case 2.
Re: [PATCH v1 1/1] drm/i915/pxp: Bail early in pxp tee backend on first teardown error
On Thu, 2023-11-16 at 15:20 -0800, Teres Alexis, Alan Previn wrote: > For Gen12 when using mei-pxp tee backend tranport, if we are coming > up from a cold boot or from a resume (not runtime resume), we can > optionally quicken the very first session cleanup that would occur > as part of starting up a default PXP session. Typically a cleanup > from a cold-start is expected to be quick so we can use a shorter > timeout and skip retries (when getting non-success on calling > backend transport for intel_pxp_tee_end_arb_fw_session). alan:snip > @@ -387,10 +389,14 @@ void intel_pxp_tee_end_arb_fw_session(struct intel_pxp > *pxp, u32 session_id) > ret = intel_pxp_tee_io_message(pxp, > _in, sizeof(msg_in), > _out, sizeof(msg_out), > -NULL); > +NULL, pxp->hw_state_coldstart ? > +PXP_TRANSPORT_TIMEOUT_FAST_MS : > PXP_TRANSPORT_TIMEOUT_MS); > > - /* Cleanup coherency between GT and Firmware is critical, so try again > if it fails */ > - if ((ret || msg_out.header.status != 0x0) && ++trials < 3) > + /* > + * Cleanup coherency between GT and Firmware is critical, so try again > if it > + * fails, unless we are performing a cold-start reset > + */ > + if ((ret || msg_out.header.status != 0x0) && !pxp->hw_state_coldstart > && ++trials < 3) alan: Take note I am working offline with sister teams to perform some end to end conformance testing with more comprehensive OS stacks before we can verify that this optimization doesnt break any existing use-cases.
Re: [PATCH v1 1/1] drm/i915/gt: Dont wait forever when idling in suspend
On Tue, 2023-11-14 at 08:22 -0800, Teres Alexis, Alan Previn wrote: > When suspending, add a timeout when calling > intel_gt_pm_wait_for_idle else if we have a leaked > wakeref (which would be indicative of a bug elsewhere > in the driver), driver will at exit the suspend-resume > cycle, after the kernel detects the held reference and > prints a message to abort suspending instead of hanging > in the kernel forever which then requires serial connection > or ramoops dump to debug further. NOTE: this patch originates from Patch#3 of this other series https://patchwork.freedesktop.org/series/121916/ (rev 5 and prior) and was decided to be moved out as its own patch since this patch is trying to improve general debuggability as opposed to resolving that bug being resolved in above series. alan:snip > +int intel_wakeref_wait_for_idle(struct intel_wakeref *wf, int timeout_ms) > { > - int err; > + int err = 0; > > might_sleep(); > > - err = wait_var_event_killable(>wakeref, > - !intel_wakeref_is_active(wf)); > + if (!timeout_ms) > + err = wait_var_event_killable(>wakeref, > + !intel_wakeref_is_active(wf)); > + else if (wait_var_event_timeout(>wakeref, > + !intel_wakeref_is_active(wf), > + msecs_to_jiffies(timeout_ms)) < 1) > + err = -ETIMEDOUT; > + alan: paraphrasing feedback from Tvrtko on the originating series this patch: it would be good idea to add error-injection into this timeout to ensure we dont have any other subsytem that could inadvertently leak an rpm wakeref (and catch such bugs in future pre-merge).
Re: [Intel-gfx] [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending
On Tue, 2023-11-14 at 17:52 +, Tvrtko Ursulin wrote: > On 14/11/2023 17:37, Teres Alexis, Alan Previn wrote: > > On Tue, 2023-11-14 at 17:27 +, Tvrtko Ursulin wrote: > > > On 13/11/2023 17:57, Teres Alexis, Alan Previn wrote: > > > > On Wed, 2023-10-25 at 13:58 +0100, Tvrtko Ursulin wrote: > > > > > On 04/10/2023 18:59, Teres Alexis, Alan Previn wrote: > > > > > > On Thu, 2023-09-28 at 13:46 +0100, Tvrtko Ursulin wrote: > > > > > > > On 27/09/2023 17:36, Teres Alexis, Alan Previn wrote: > alan:snip > > alan: So i did trace back the gt->wakeref before i posted these patches and > > see that within these runtime get/put calls, i believe the first 'get' leads > > to __intel_wakeref_get_first which calls intel_runtime_pm_get via rpm_get > > helper and eventually executes a pm_runtime_get_sync(rpm->kdev); (hanging > > off > > i915_device). (naturally there is a corresponding mirros for the > > '_put_last'). > > So this means the first-get and last-put lets the kernel know. Thats why > > when > > i tested this patch, it did actually cause the suspend to abort from kernel > > side > > and the kernel would print a message indicating i915 was the one that didnt > > release all refs. > > Ah that would be much better then. > > Do you know if everything gets resumed/restored correctly in that case > or we would need some additional work to maybe early exit from callers > of wait_for_suspend()? alan: So assuming we are still discussing about a "potentially new future leaked-wakeref bug" (i.e. putting aside the fact that Patch #1 + #2 resolves this specific series' bug), based on the previous testing we did, after this timeout-bail trigger, the suspend flow bails and gt/guc operation does actually continue as normal. However, its been a long time since we tested this so i am not sure of how accidental-new-future bugs might play to this assumption especially if some other subsystem that leaked the rpm wakref but that subsystem did NOT get reset like how GuC is reset at the end of suspend. > > What I would also ask is to see if something like injecting a probing > failure is feasible, so we can have this new timeout exit path > constantly/regularly tested in CI. alan: Thats a good idea. In line with this, i would like to point out that rev6 of this series has been posted but i removed this Patch #3. However i did post this Patch #3 as a standalone patch here: https://patchwork.freedesktop.org/series/126414/ as i anticipate this patch will truly help with future issue debuggability. That said, i shall post a review on that patch with your suggestion to add an injected probe error for the suspend-resume flow and follow up on that one separately. > > Regards, > > Tvrtko > > > alan: Anyways, i have pulled this patch out of rev6 of this series and > > created a > > separate standalone patch for this patch #3 that we review independently. > >
Re: [Intel-gfx] [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending
On Tue, 2023-11-14 at 12:36 -0500, Vivi, Rodrigo wrote: > On Tue, Nov 14, 2023 at 05:27:18PM +, Tvrtko Ursulin wrote: > > > > On 13/11/2023 17:57, Teres Alexis, Alan Previn wrote: > > > On Wed, 2023-10-25 at 13:58 +0100, Tvrtko Ursulin wrote: > > > > On 04/10/2023 18:59, Teres Alexis, Alan Previn wrote: > > > > > On Thu, 2023-09-28 at 13:46 +0100, Tvrtko Ursulin wrote: > > > > > > On 27/09/2023 17:36, Teres Alexis, Alan Previn wrote: > > That's a good point. > Well, current code is already bad and buggy on suspend-resume. We could get > suspend stuck forever without any clue of what happened. > At least the proposed patch add some gt_warn. But, yes, the right thing > to do should be entirely abort the suspend in case of timeout, besides the > gt_warn. alan: yes - thats was the whole idea for Patch #3. Only after putting such code did we have much better debuggability on real world production platforms + config that may not have serial-port or ramoops-dump by default. Btw, as per previous comments by Tvrtko - which i agreed with, I have moved this one single patch into a separate patch on its own. See here: https://patchwork.freedesktop.org/series/126414/ (I also maintained the RB from you Rodrigo because the patch did not changed). And yes, the gt_warn is still in place :)
Re: [Intel-gfx] [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending
On Tue, 2023-11-14 at 17:27 +, Tvrtko Ursulin wrote: > On 13/11/2023 17:57, Teres Alexis, Alan Previn wrote: > > On Wed, 2023-10-25 at 13:58 +0100, Tvrtko Ursulin wrote: > > > On 04/10/2023 18:59, Teres Alexis, Alan Previn wrote: > > > > On Thu, 2023-09-28 at 13:46 +0100, Tvrtko Ursulin wrote: > > > > > On 27/09/2023 17:36, Teres Alexis, Alan Previn wrote: alan: snip > > > alan: I won't say its NOT fixing anything - i am saying it's not fixing > > this specific bug where we have the outstanding G2H from a context > > destruction > > operation that got dropped. I am okay with dropping this patch in the next > > rev > > but shall post a separate stand alone version of Patch3 - because I believe > > all other i915 subsystems that take runtime-pm's DO NOT wait forever when > > entering > > suspend - but GT is doing this. This means if there ever was a bug > > introduced, > > it would require serial port or ramoops collection to debug. So i think > > such a > > patch, despite not fixing this specific bug will be very helpful for > > debuggability > > of future issues. After all, its better to fail our suspend when we have a > > bug > > rather than to hang the kernel forever. > > Issue I have is that I am not seeing how it fails the suspend when > nothing is passed out from *void* wait_suspend(gt). To me it looks the > suspend just carries on. And if so, it sounds dangerous to allow it to > do that with any future/unknown bugs in the suspend sequence. Existing > timeout wedges the GPU (and all that entails). New code just says "meh > I'll just carry on regardless". alan: So i did trace back the gt->wakeref before i posted these patches and see that within these runtime get/put calls, i believe the first 'get' leads to __intel_wakeref_get_first which calls intel_runtime_pm_get via rpm_get helper and eventually executes a pm_runtime_get_sync(rpm->kdev); (hanging off i915_device). (naturally there is a corresponding mirros for the '_put_last'). So this means the first-get and last-put lets the kernel know. Thats why when i tested this patch, it did actually cause the suspend to abort from kernel side and the kernel would print a message indicating i915 was the one that didnt release all refs. alan: Anyways, i have pulled this patch out of rev6 of this series and created a separate standalone patch for this patch #3 that we review independently.
Re: [Intel-gfx] [PATCH v3] drm/i915: Skip pxp init if gt is wedged
On Mon, 2023-11-13 at 14:49 -0800, Zhanjun Dong wrote: > The gt wedged could be triggered by missing guc firmware file, HW not > working, etc. Once triggered, it means all gt usage is dead, therefore we > can't enable pxp under this fatal error condition. > > alan:skip alan: this looks good (as per our offline review/discussion), we dont mess with the current driver startup sequence (i.e. pxp failures can never pull down the driver probing and will not generate warnings or errors). Also, if something does break for PXP, we only do a drm_debug if the failure returned is not -ENODEV (since -ENODEV can happen on the majority of cases with legacy products or with non-PXP kernel configs): Reviewed-by: Alan Previn
Re: [PATCH] drm/i915: Initialize residency registers earlier
On Mon, 2023-10-30 at 16:45 -0700, Belgaumkar, Vinay wrote: alan:skip > +++ b/drivers/gpu/drm/i915/gt/intel_rc6.c > @@ -608,11 +608,13 @@ void intel_rc6_init(struct intel_rc6 *rc6) > /* Disable runtime-pm until we can save the GPU state with rc6 pctx */ > rpm_get(rc6); > > - if (!rc6_supported(rc6)) > - return; > - > rc6_res_reg_init(rc6); > > + if (!rc6_supported(rc6)) { > + rpm_put(rc6); > + return; > + } > + > if (IS_CHERRYVIEW(i915)) > err = chv_rc6_init(rc6); > else if (IS_VALLEYVIEW(i915)) alan: as far as the bug this patch is addressing (i.e. ensuring that intel_rc6_print_residency has valid rc6.res_reg values for correct dprc debugfs output when rc6 is disabled) and release the rpm, this looks good to me. However, when looking at the other code flows around the intel_rc6_init/fini and intel_rc6_enable/disable, i must point out that the calls to rpm_get and rpm_put from these functions don't seem to be designed with proper mirror-ing. For example during driver startup, intel_rc6_init (which is called by intel_gt_pm_init) calls rpm_get at the start but doesn't call rpm_put before it returns. But back up the callstack in intel_gt_init, after intel_gt_pm_init, a couple of subsystems get intialized before intel_gt_resume is called - which in turn calls intel_rc6_enable which does the rpm_put at its end. However before that get and put, i see several error paths that trigger cleanups (leading eventually to driver load failure), but i think some cases are completely missing the put_rpm that intel_rc6_init took. Additionally, the function names of rc6_init and __get_rc6 inside i915_pmu.c seems to be confusing although static. I wish those were named pmu_rc6_init and __pmu_rc6_init and etc. Anyways, as per offline conversation, we are not trying to solve every bug and design gap in this patch but just one specific bug fix. So as per the agreed condition that we create a separate internal issue to address this "lack of a clean mirrored-function design of rpm_get/put across the rc6 startup sequences", here is my rb: Reviewed-by: Alan Previn
Re: [Intel-gfx] [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending
On Wed, 2023-10-25 at 13:58 +0100, Tvrtko Ursulin wrote: > On 04/10/2023 18:59, Teres Alexis, Alan Previn wrote: > > On Thu, 2023-09-28 at 13:46 +0100, Tvrtko Ursulin wrote: > > > On 27/09/2023 17:36, Teres Alexis, Alan Previn wrote: alan:snip > > > It is not possible to wait for lost G2H in something like > > > intel_uc_suspend() and simply declare "bad things happened" if it times > > > out there, and forcibly clean it all up? (Which would include releasing > > > all the abandoned pm refs, so this patch wouldn't be needed.) > > > > > alan: I'm not sure if intel_uc_suspend should be held up by gt-level wakeref > > check unless huc/guc/gsc-uc are the only ones ever taking a gt wakeref. > > > > As we already know, what we do know from a uc-perspective: > > - ensure the outstanding guc related workers is flushed which we didnt > > before > > (addressed by patch #1). > > - any further late H2G-SchedDisable is not leaking wakerefs when calling H2G > > and not realizing it failed (addressed by patch #2). > > - (we already), "forcibly clean it all up" at the end of the > > intel_uc_suspend > > when we do the guc reset and cleanup all guc-ids. (pre-existing upstream > > code) > > - we previously didnt have a coherrent guarantee that "this is the end" > > i.e. no > > more new request after intel_uc_suspend. I mean by code logic, we thought > > we did > > (thats why intel_uc_suspend ends wth a guc reset), but we now know > > otherwise. > > So we that fix by adding the additional rcu_barrier (also part of patch #2). > > It is not clear to me from the above if that includes cleaning up the > outstanding CT replies or no. But anyway.. alan: Apologies, i should have made it clearer by citing the actual function name calling out the steps of interest: So if you look at the function "intel_gt_suspend_late", it calls "intel_uc_suspend" which in turn calls "intel_guc_suspend" which does a soft reset onto guc firmware - so after that we can assume all outstanding G2Hs are done. Going back up the stack, intel_gt_suspend_late finally calls gt_sanitize which calls "intel_uc_reset_prepare" which calls "intel_guc_submission_reset_prepare" which calls "scrub_guc_desc_for_outstanding_g2h" which does what we are looking for for all types of outstanding G2H. As per prior review comments, besides closing the race condition, we were missing an rcu_barrier (which caused new requests to come in way late). So we have resolved both in Patch #2. > > That said, patch-3 is NOT fixing a bug in guc -its about "if we ever have > > a future racy gt-wakeref late-leak somewhere - no matter which subsystem > > took it (guc is not the only subsystem taking gt wakerefs), we at > > least don't hang forever in this code. Ofc, based on that, even without > > patch-3 i am confident the issue is resolved anyway. > > So we could just drop patch-3 is you prefer? > > .. given this it does sound to me that if you are confident patch 3 > isn't fixing anything today that it should be dropped. alan: I won't say its NOT fixing anything - i am saying it's not fixing this specific bug where we have the outstanding G2H from a context destruction operation that got dropped. I am okay with dropping this patch in the next rev but shall post a separate stand alone version of Patch3 - because I believe all other i915 subsystems that take runtime-pm's DO NOT wait forever when entering suspend - but GT is doing this. This means if there ever was a bug introduced, it would require serial port or ramoops collection to debug. So i think such a patch, despite not fixing this specific bug will be very helpful for debuggability of future issues. After all, its better to fail our suspend when we have a bug rather than to hang the kernel forever.
Re: [Intel-gfx] [PATCH] drm/i915: Skip pxp init if gt is wedged
On Fri, 2023-10-27 at 10:13 +0300, Jani Nikula wrote: > On Thu, 26 Oct 2023, Zhanjun Dong wrote: > alan:snip > I'll note that nobody checks intel_pxp_init() return status, so this > silently skips PXP. > > BR, > Jani. alan:snip > > + if (intel_gt_is_wedged(gt)) > > + return -ENODEV; > > + alan: wondering if we can add a drm_dbg in the caller of intel_pxp_init and use a unique return value for the case of gt_is_wedged (for example: -ENXIO.). As we know gt being wedged at startup basically means all gt usage is dead and therefore we cant enable PXP (along with everything else that needs submission/ guc/ etc). With a drm-debug in the caller that prints that return value, it helps us to differentiate between gt-is-wedged vs platform config doesnt support PXP. However, this would mean new drm-debug 'noise' for platforms that i915 just doesn't support PXP on at all which would be okay if dont use drm_warn or drm_err and use 'softer' message like "PXP skipped with %d". Please treat above comment as a "nit" - i.e. existing patch is good enough for me... (after addressing Jani's request for more commit message info). ...alan
Re: [Intel-gfx] [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending
On Thu, 2023-09-28 at 13:46 +0100, Tvrtko Ursulin wrote: > On 27/09/2023 17:36, Teres Alexis, Alan Previn wrote: > > Thanks for taking the time to review this Tvrtko, replies inline below. alan:snip > > > > > > Main concern is that we need to be sure there are no possible > > > ill-effects, like letting the GPU/GuC scribble on some memory we > > > unmapped (or will unmap), having let the suspend continue after timing > > > out, and not perhaps doing the forced wedge like wait_for_suspend() does > > > on the existing timeout path. > > alan: this will not happen because the held wakeref is never force-released > > after the timeout - so what happens is the kernel would bail the suspend. > > How does it know to fail the suspend when there is no error code > returned with this timeout? Maybe a stupid question.. my knowledge of > suspend-resume paths was not great even before I forgot it all. alan:Tvrtko, you and I both sir. (apologies for the tardy response yet again busy week). So i did trace back the gt->wakeref before i posted these patches and discovered that runtime get/put call, i believe that the first 'get' leads to __intel_wakeref_get_first which calls intel_runtime_pm_get via rpm_get helper and eventually executes a pm_runtime_get_sync(rpm->kdev); (hanging off i915). (ofc, there is a corresponding for '_put_last') - so non-first, non-last increases the counter for the gt... but this last miss will mean kernel knows i915 hasnt 'put' everything. alan:snip > > > > Recap: so in both cases (original vs this patch), if we had a buggy > > gt-wakeref leak, > > we dont get invalid guc-accesses, but without this patch, we wait forever, > > and with this patch, we get some messages and eventually bail the suspend. > > It is not possible to wait for lost G2H in something like > intel_uc_suspend() and simply declare "bad things happened" if it times > out there, and forcibly clean it all up? (Which would include releasing > all the abandoned pm refs, so this patch wouldn't be needed.) > alan: I'm not sure if intel_uc_suspend should be held up by gt-level wakeref check unless huc/guc/gsc-uc are the only ones ever taking a gt wakeref. As we already know, what we do know from a uc-perspective: - ensure the outstanding guc related workers is flushed which we didnt before (addressed by patch #1). - any further late H2G-SchedDisable is not leaking wakerefs when calling H2G and not realizing it failed (addressed by patch #2). - (we already), "forcibly clean it all up" at the end of the intel_uc_suspend when we do the guc reset and cleanup all guc-ids. (pre-existing upstream code) - we previously didnt have a coherrent guarantee that "this is the end" i.e. no more new request after intel_uc_suspend. I mean by code logic, we thought we did (thats why intel_uc_suspend ends wth a guc reset), but we now know otherwise. So we that fix by adding the additional rcu_barrier (also part of patch #2). That said, patch-3 is NOT fixing a bug in guc -its about "if we ever have a future racy gt-wakeref late-leak somewhere - no matter which subsystem took it (guc is not the only subsystem taking gt wakerefs), we at least don't hang forever in this code. Ofc, based on that, even without patch-3 i am confident the issue is resolved anyway. So we could just drop patch-3 is you prefer? ...alan
Re: [PATCH v4 2/3] drm/i915/guc: Close deregister-context race against CT-loss
On Wed, 2023-10-04 at 06:34 +, Gupta, Anshuman wrote: > > > -Original Message- > > From: Teres Alexis, Alan Previn > @@ -289,6 +289,13 @@ int intel_gt_resume(struct intel_gt *gt) > > > > static void wait_for_suspend(struct intel_gt *gt) { > > + /* > > +* On rare occasions, we've observed the fence completion trigger > > +* free_engines asynchronously via rcu_call. Ensure those are done. > > +* This path is only called on suspend, so it's an acceptable cost. > > +*/ > > + rcu_barrier(); > Let's add the barrier after the end of prepare suspend and at start of late > suspend. > To make sure we don't have any async destroy from any user request or any > internal kmd request during i915 suspend? > Br, > Anshuman Gupta. alan: some thoughts: actuallly wait_fos_suspend is being called at from both intel_gt_suspend_prepare and intel_gt_suspend_late. so putting the barrier in above location would get hit for both steps. However, because wait_for_suspend may optionally wedge the system if outstanding requests were stuck for more than I915_GT_SUSPEND_IDLE_TIMEOUT, wouldnt it be better to add the barrier before that check (i.e. in above location) as opposed to after the return from wait_for_suspend at the end of suspend_prepare - which would defeat the purpose of even checking that intel_gt_wait_for_idle from within wait_for_suspend? (which is existing code). Perhaps we need to get one last test on this?
Re: [Intel-gfx] [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending
Thanks for taking the time to review this Tvrtko, replies inline below. On Wed, 2023-09-27 at 10:02 +0100, Tvrtko Ursulin wrote: > On 26/09/2023 20:05, Alan Previn wrote: > > When suspending, add a timeout when calling > > intel_gt_pm_wait_for_idle else if we have a lost > > G2H event that holds a wakeref (which would be > > indicative of a bug elsewhere in the driver), > > driver will at least complete the suspend-resume > > cycle, (albeit not hitting all the targets for > > low power hw counters), instead of hanging in the kernel. alan:snip > > { > > + int timeout_ms = CONFIG_DRM_I915_MAX_REQUEST_BUSYWAIT ? : 1; > > CONFIG_DRM_I915_MAX_REQUEST_BUSYWAIT is in ns so assigning it to _ms is > a bit to arbitrary. > > Why not the existing I915_GT_SUSPEND_IDLE_TIMEOUT for instance? alan: good catch, my bad. However I915_GT_SUSPEND_IDLE_TIMEOUT is only half a sec which i feel is too quick. I guess i could create a new #define or a multiple of I915_GT_SUSPEND_IDLE_TIMEOUT? > > /* > > * On rare occasions, we've observed the fence completion trigger > > * free_engines asynchronously via rcu_call. Ensure those are done. > > @@ -308,7 +309,10 @@ static void wait_for_suspend(struct intel_gt *gt) > > intel_gt_retire_requests(gt); > > } > > > > - intel_gt_pm_wait_for_idle(gt); > > + /* we are suspending, so we shouldn't be waiting forever */ > > + if (intel_gt_pm_wait_timeout_for_idle(gt, timeout_ms) == -ETIMEDOUT) > > + gt_warn(gt, "bailing from %s after %d milisec timeout\n", > > + __func__, timeout_ms); > > Does the timeout in intel_gt_pm_wait_timeout_for_idle always comes in > pair with the timeout first in intel_gt_wait_for_idle? alan: Not necessarily, ... IIRC in nearly all cases, the first wait call complete in time (i.e. no more ongoing work) and the second wait does wait only if the last bit of work 'just-finished', then that second wait may take a touch bit longer because of possible async gt-pm-put calls. (which appear in several places in the driver as part of regular runtime operation including request completion). NOTE: this not high end workloads so the > > Also, is the timeout here hit from the intel_gt_suspend_prepare, > intel_gt_suspend_late, or can be both? > > Main concern is that we need to be sure there are no possible > ill-effects, like letting the GPU/GuC scribble on some memory we > unmapped (or will unmap), having let the suspend continue after timing > out, and not perhaps doing the forced wedge like wait_for_suspend() does > on the existing timeout path. alan: this will not happen because the held wakeref is never force-released after the timeout - so what happens is the kernel would bail the suspend. > > Would it be possible to handle the lost G2H events directly in the > respective component instead of here? Like apply the timeout during the > step which explicitly idles the CT for suspend (presumably that > exists?), and so cleanup from there once declared a lost event. alan: So yes, we don't solve the problem with this patch - Patch#2 is addressing the root cause - and verification is still ongoing - because its hard to thoroughly test / reproduce. (i.e. Patch 2 could get re-rev'd). Intent of this patch, is to be informed that gt-pm wait failed in prep for suspending and kernel will eventually bail the suspend, that's all. Because without this timed-out version of gt-pm-wait, if we did have a leaky gt-wakeref, kernel will hang here forever and we will need to get serial console or ramoops to eventually discover the kernel's cpu hung error with call-stack pointing to this location. So the goal here is to help speed up future debug process (if let's say some future code change with another async gt-pm-put came along and caused new problems after Patch #2 resolved this time). Recap: so in both cases (original vs this patch), if we had a buggy gt-wakeref leak, we dont get invalid guc-accesses, but without this patch, we wait forever, and with this patch, we get some messages and eventually bail the suspend. alan:snip
Re: [PATCH v6 1/3] drm/i915/pxp/mtl: Update pxp-firmware response timeout
On Sat, 2023-09-16 at 10:25 +0800, lkp wrote: > Hi Alan, > > kernel test robot noticed the following build errors: > > [auto build test ERROR on cf1e91e884bb1113c653e654e9de1754fc1d4488] > > aAll errors (new ones prefixed by >>): > > alan:snip alan: missed building with PXP config after that last change below - will re-rev this. >drivers/gpu/drm/i915/pxp/intel_pxp_gsccs.c: In function > 'gsccs_send_message': > > > drivers/gpu/drm/i915/pxp/intel_pxp_gsccs.c:115:52: error: > > > 'GSC_REPLY_LATENCY_MS' undeclared (first use in this function); did you > > > mean 'GSC_HECI_REPLY_LATENCY_MS'? > 115 | > GSC_REPLY_LATENCY_MS); > | > ^~~~ > | > GSC_HECI_REPLY_LATENCY_MS >drivers/gpu/drm/i915/pxp/intel_pxp_gsccs.c:115:52: note: each undeclared > identifier is reported only once for each function it appears in >
Re: [PATCH v3] drm/i915/pxp: Add drm_dbgs for critical PXP events.
On Fri, 2023-09-15 at 13:15 -0700, Teres Alexis, Alan Previn wrote: > Debugging PXP issues can't even begin without understanding precedding > sequence of important events. Add drm_dbg into the most important PXP events. > > v3 : - move gt_dbg to after mutex block in function > i915_gsc_proxy_component_bind. (Vivaik) > v2 : - remove __func__ since drm_dbg covers that (Jani). > - add timeout dbg of the restart from front-end (Alan). > alan:snip > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_proxy.c > @@ -322,6 +322,7 @@ static int i915_gsc_proxy_component_bind(struct device > *i915_kdev, > gsc->proxy.component = data; > gsc->proxy.component->mei_dev = mei_kdev; > mutex_unlock(>proxy.mutex); > + gt_dbg(gt, "GSC proxy mei component bound\n"); forgot to include RB from Vivaik, per condition of fixing above hunk, from rev2 dri-devel https://lists.freedesktop.org/archives/dri-devel/2023-September/422858.html : Reviewed-by: Balasubrawmanian, Vivaik
Re: [PATCH v5 3/3] drm/i915/lrc: User PXP contexts requires runalone bit in lrc
On Sat, 2023-09-09 at 15:38 -0700, Teres Alexis, Alan Previn wrote: > On Meteorlake onwards, HW specs require that all user contexts that > run on render or compute engines and require PXP must enforce > run-alone bit in lrc. Add this enforcement for protected contexts. alan:snip > > Signed-off-by: Alan Previn > @@ -860,6 +881,8 @@ static void init_common_regs(u32 * const regs, > if (GRAPHICS_VER(engine->i915) < 11) > ctl |= _MASKED_BIT_DISABLE(CTX_CTRL_ENGINE_CTX_SAVE_INHIBIT | > CTX_CTRL_RS_CTX_ENABLE); > + if (ctx_needs_runalone(ce)) > + ctl |= _MASKED_BIT_ENABLE(BIT(7)); > regs[CTX_CONTEXT_CONTROL] = ctl; > > regs[CTX_TIMESTAMP] = ce->stats.runtime.last; alan: to align intel-gfx ml, Vivaik reviewed this on dri-devel at https://lists.freedesktop.org/archives/dri-devel/2023-September/422875.html - snippet: thus, will rerev this (with the others fixes in this series). >> Can we please get the bit defined in intel_engine_regs.h with a define >> instead of a number identification? >> >> Review completed conditional to the above fix. >> >> Reviewed-by: Balasubrawmanian, Vivaik >> >> <mailto:vivaik.balasubrawmanian at intel.com>
Re: [PATCH v5 2/3] drm/i915/pxp/mtl: Update pxp-firmware packet size
On Sat, 2023-09-09 at 15:38 -0700, Teres Alexis, Alan Previn wrote: > Update the GSC-fw input/output HECI packet size to match > updated internal fw specs. > > Signed-off-by: Alan Previn > --- > drivers/gpu/drm/i915/pxp/intel_pxp_cmd_interface_43.h | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/i915/pxp/intel_pxp_cmd_interface_43.h > b/drivers/gpu/drm/i915/pxp/intel_pxp_cmd_interface_43.h > index 0165d38fbead..e017a7d952e9 100644 > --- a/drivers/gpu/drm/i915/pxp/intel_pxp_cmd_interface_43.h > +++ b/drivers/gpu/drm/i915/pxp/intel_pxp_cmd_interface_43.h > @@ -14,8 +14,8 @@ > #define PXP43_CMDID_NEW_HUC_AUTH 0x003F /* MTL+ */ > #define PXP43_CMDID_INIT_SESSION 0x0036 > > -/* PXP-Packet sizes for MTL's GSCCS-HECI instruction */ > -#define PXP43_MAX_HECI_INOUT_SIZE (SZ_32K) > +/* PXP-Packet sizes for MTL's GSCCS-HECI instruction is spec'd at 65K before > page alignment*/ > +#define PXP43_MAX_HECI_INOUT_SIZE (PAGE_ALIGNED(SZ_64K + SZ_1K)) > > /* PXP-Packet size for MTL's NEW_HUC_AUTH instruction */ > #define PXP43_HUC_AUTH_INOUT_SIZE (SZ_4K) Vivaik replied with RB on dri-devel: https://lists.freedesktop.org/archives/dri-devel/2023-September/422862.htmll we connected offline and agreed that his RB can remain standing on condition i fix the PAGE_ALIGNED -> PAGE_ALIGN fix. Thanks Vivaik for reviewing.
Re: [PATCH v5 1/3] drm/i915/pxp/mtl: Update pxp-firmware response timeout
On Sat, 2023-09-09 at 15:38 -0700, Teres Alexis, Alan Previn wrote: > Update the max GSC-fw response time to match updated internal > fw specs. Because this response time is an SLA on the firmware, > not inclusive of i915->GuC->HW handoff latency, when submitting > requests to the GSC fw via intel_gsc_uc_heci_cmd_submit helpers, alan:snip Vivaik replied with RB on dri-devel: https://lists.freedesktop.org/archives/dri-devel/2023-September/422861.html we connected offline and agreed that his RB can remain standing on condition i fix the PAGE_ALIGNED -> PAGE_ALIGN fix. Thanks Vivaik for reviewing.
Re: [PATCH v5 1/3] drm/i915/pxp/mtl: Update pxp-firmware response timeout
On Sat, 2023-09-09 at 15:38 -0700, Teres Alexis, Alan Previn wrote: > Update the max GSC-fw response time to match updated internal > fw specs. Because this response time is an SLA on the firmware, > not inclusive of i915->GuC->HW handoff latency, when submitting > requests to the GSC fw via intel_gsc_uc_heci_cmd_submit helpers, > start the count after the request hits the GSC command streamer. > Also, move GSC_REPLY_LATENCY_MS definition from pxp header to > intel_gsc_uc_heci_cmd_submit.h since its for any GSC HECI packet. > > Signed-off-by: Alan Previn > --- > .../i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c | 20 +-- > .../i915/gt/uc/intel_gsc_uc_heci_cmd_submit.h | 6 ++ > drivers/gpu/drm/i915/pxp/intel_pxp_gsccs.h| 11 ++ > 3 files changed, 31 insertions(+), 6 deletions(-) alan: snip > index 09d3fbdad05a..5ae5c5d9608b 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.h > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.h > @@ -12,6 +12,12 @@ struct i915_vma; > struct intel_context; > struct intel_gsc_uc; > > +#define GSC_HECI_REPLY_LATENCY_MS 350 > +/* > + * Max FW response time is 350ms, but this should be counted from the time > the > + * command has hit the GSC-CS hardware, not the preceding handoff to GuC CTB. > + */ alan: continue to face timeout issues - so increasing this to ~500 to absorb other hw/sw system latencies. this also matches what the gsc-proxy code was doing - so i could use the same macro for that other code path.
Re: [PATCH v5 2/3] drm/i915/pxp/mtl: Update pxp-firmware packet size
On Sat, 2023-09-09 at 15:38 -0700, Teres Alexis, Alan Previn wrote: > Update the GSC-fw input/output HECI packet size to match > updated internal fw specs. > > Signed-off-by: Alan Previn > alan:snip > -/* PXP-Packet sizes for MTL's GSCCS-HECI instruction */ > -#define PXP43_MAX_HECI_INOUT_SIZE (SZ_32K) > +/* PXP-Packet sizes for MTL's GSCCS-HECI instruction is spec'd at 65K before > page alignment*/ > +#define PXP43_MAX_HECI_INOUT_SIZE (PAGE_ALIGNED(SZ_64K + SZ_1K)) alan: silly ctrl-c/v bug on my part - should be PAGE_ALIGN, not ALIGNED > > /* PXP-Packet size for MTL's NEW_HUC_AUTH instruction */ > #define PXP43_HUC_AUTH_INOUT_SIZE (SZ_4K)
Re: [PATCH v1 1/1] drm/i915/pxp: Add drm_dbgs for critical PXP events.
On Mon, 2023-09-11 at 12:26 +0300, Jani Nikula wrote: > On Wed, 06 Sep 2023, Alan Previn wrote: > > Debugging PXP issues can't even begin without understanding precedding > > sequence of events. Add drm_dbg into the most important PXP events. > > > > Signed-off-by: Alan Previn alan:snip > > > + > > + drm_dbg(>ctrl_gt->i915->drm, "PXP: %s invoked", __func__); > > drm_dbg already covers __func__ (via __builtin_return_address(0) in > __drm_dev_dbg), it's redundant. > > Ditto for all added debugs below. My bad - yup - will fix them. Thanks for taking time to review this patch. ...alan >
Re: [PATCH v4 2/3] drm/i915/pxp/mtl: Update pxp-firmware packet size
On Wed, 2023-09-06 at 17:15 -0700, Teres Alexis, Alan Previn wrote: > Update the GSC-fw input/output HECI packet size to match > updated internal fw specs. alan:snip > +++ b/drivers/gpu/drm/i915/pxp/intel_pxp_cmd_interface_43.h > @@ -14,8 +14,8 @@ > > +/* PXP-Packet sizes for MTL's GSCCS-HECI instruction is 65K*/ > +#define PXP43_MAX_HECI_INOUT_SIZE (SZ_64K + SZ_1K) alan: my mistake - didnt fix this before posting - should have been PAGE_ALIGN(65k)
Re: [PATCH v2 2/3] drm/i915/guc: Close deregister-context race against CT-loss
Additional update from the most recent testing. When relying solely on guc_lrc_desc_unpin getting a failure from deregister_context as a means for identifying that we are in the "deregister-context-vs-suspend-late" race, it is too late a location to handle this safely. This is because one of the first things destroyed_worker_func does it to take a gt pm wakeref - which triggers the gt_unpark function that does a whole lot bunch of other flows including triggering more workers and taking additional refs. That said, its best to not even call deregister_destroyed_contexts from the worker when !intel_guc_is_ready (ct-is-disabled). ...alan On Fri, 2023-08-25 at 11:54 -0700, Teres Alexis, Alan Previn wrote: > just a follow up note-to-self: > > On Tue, 2023-08-15 at 12:08 -0700, Teres Alexis, Alan Previn wrote: > > On Tue, 2023-08-15 at 09:56 -0400, Vivi, Rodrigo wrote: > > > On Mon, Aug 14, 2023 at 06:12:09PM -0700, Alan Previn wrote: > > > > > [snip] > > in guc_submission_send_busy_loop, we are incrementing the following > that needs to be decremented if the function fails. > > atomic_inc(>outstanding_submission_g2h); > > also, it seems that even with thie unroll design - we are still > leaking a wakeref elsewhere. this is despite a cleaner redesign of > flows in function "guc_lrc_desc_unpin" > (discussed earlier that wasnt very readible). > > will re-rev today but will probably need more follow ups > tracking that one more leaking gt-wakeref (one in thousands-cycles) > but at least now we are not hanging mid-suspend.. we bail from suspend > with useful kernel messages. > > > >
Re: [PATCH v2 2/3] drm/i915/guc: Close deregister-context race against CT-loss
just a follow up note-to-self: On Tue, 2023-08-15 at 12:08 -0700, Teres Alexis, Alan Previn wrote: > On Tue, 2023-08-15 at 09:56 -0400, Vivi, Rodrigo wrote: > > On Mon, Aug 14, 2023 at 06:12:09PM -0700, Alan Previn wrote: > > > [snip] in guc_submission_send_busy_loop, we are incrementing the following that needs to be decremented if the function fails. atomic_inc(>outstanding_submission_g2h); also, it seems that even with thie unroll design - we are still leaking a wakeref elsewhere. this is despite a cleaner redesign of flows in function "guc_lrc_desc_unpin" (discussed earlier that wasnt very readible). will re-rev today but will probably need more follow ups tracking that one more leaking gt-wakeref (one in thousands-cycles) but at least now we are not hanging mid-suspend.. we bail from suspend with useful kernel messages.
Re: [PATCH v2 1/3] drm/i915/guc: Flush context destruction worker at suspend
Thanks again Rodrigo for reviewing and apologies for my tardy replies. We are stil testing on shipping platforms and these latest patches seemed to have reduced the frequency and solved the "system hangs" while suspending but its still causing issues so we continue to debug. (issue is that its running thousands of cycles overnight on multiple systems and takes time). [snip] > > @@ -693,6 +693,8 @@ void intel_uc_suspend(struct intel_uc *uc) > > return; > > } > > > > + intel_guc_submission_flush_work(guc); > > + > > what happens if a new job comes exactly here? > This still sounds a bit racy, although this already looks > much cleaner than the previous version. alan: intel_uc_suspend is called from suspend-late or init-failure. and the very next function -> intel_guc_suspend will soft reset the guc and eventually nuke it. this is the most appropriate "latests as possible" place to put this. > > > with_intel_runtime_pm(_to_gt(uc)->i915->runtime_pm, wakeref) { > > err = intel_guc_suspend(guc); > > if (err)
Re: [PATCH v3 1/3] drm/i915/pxp/mtl: Update pxp-firmware response timeout
On Tue, 2023-08-15 at 13:29 -0700, Teres Alexis, Alan Previn wrote: > Update the max GSC-fw response time to match updated internal > fw specs. Because this response time is an SLA on the firmware, > not inclusive of i915->GuC->HW handoff latency, when submitting > requests to the GSC fw via intel_gsc_uc_heci_cmd_submit_nonpriv, > start the count after the request hits the GSC command streamer. > > Signed-off-by: Alan Previn > --- > drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c | 3 +++ > drivers/gpu/drm/i915/pxp/intel_pxp_gsccs.h| 6 +++--- > 2 files changed, 6 insertions(+), 3 deletions(-) > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c > b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c > index 89ed5ee9cded..ae45855594ac 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c > @@ -186,6 +186,9 @@ intel_gsc_uc_heci_cmd_submit_nonpriv(struct intel_gsc_uc > *gsc, > i915_request_add(rq); > > if (!err) { > + if (wait_for(i915_request_started(rq), 200)) > + drm_dbg(_uc_to_gt(gsc)->i915->drm, > + "Delay in gsc-heci-non-priv submission to > gsccs-hw"); alan: offline discussion with Daniele, Daniele provided the following review comments: We should add this wait-check for both priv and non-priv but we should increase the timeout to be more than the guaranteed fw response time of 1 other message (since we have a total of 2 contexts that could be sending messages concurrently at any time including this one)... so maybe timeout of the new GSC_REPLY_LATENCY_MS + 150. More importantly, he highlighted the fact that we should wait for the request-started AND ensure there as no error in request status. [snip]
Re: [PATCH v2 2/3] drm/i915/guc: Close deregister-context race against CT-loss
On Tue, 2023-08-15 at 09:56 -0400, Vivi, Rodrigo wrote: > On Mon, Aug 14, 2023 at 06:12:09PM -0700, Alan Previn wrote: > > If we are at the end of suspend or very early in resume > > its possible an async fence signal could lead us to the > > execution of the context destruction worker (after the > > prior worker flush). [snip] > > > @@ -3199,10 +3206,20 @@ static inline void guc_lrc_desc_unpin(struct > > intel_context *ce) > > if (unlikely(disabled)) { > > release_guc_id(guc, ce); > > __guc_context_destroy(ce); > > - return; > > + return 0; > > + } > > + > > + if (deregister_context(ce, ce->guc_id.id)) { > > + /* Seal race with concurrent suspend by unrolling */ > > + spin_lock_irqsave(>guc_state.lock, flags); > > + set_context_registered(ce); > > + clr_context_destroyed(ce); > > + intel_gt_pm_put(gt); > > how sure we are this is not calling unbalanced puts? alan: in this function (guc_lrc_desc_unpin), the summarized flow is: check-status-stuff if guc-is-not-disabled, take-pm, change ctx-state-bits [implied else] if guc-is-disabled, scub-ctx and return thus derigster_context is only called if we didnt exit, i.e. when guc-is-not-disabled, i.e. after the pm was taken. > could we wrap this in some existent function to make this clear? alan: yeah - not so readible as it now - let me tweak this function and make it cleaner > > > + spin_unlock_irqrestore(>guc_state.lock, flags); > > + return -ENODEV; > > } > > > > - deregister_context(ce, ce->guc_id.id); > > + return 0; > > }
Re: [PATCH v2 3/3] drm/i915/gt: Timeout when waiting for idle in suspending
Thanks Rodrigo - agreed on everything below - will re-rev. On Tue, 2023-08-15 at 09:51 -0400, Vivi, Rodrigo wrote: > On Mon, Aug 14, 2023 at 06:12:10PM -0700, Alan Previn wrote: > > When suspending, add a timeout when calling > > intel_gt_pm_wait_for_idle else if we have a lost > > G2H event that holds a wakeref (which would be > > [snip] > > @@ -301,7 +303,10 @@ static void wait_for_suspend(struct intel_gt *gt) > > intel_gt_retire_requests(gt); > > } > > > > - intel_gt_pm_wait_for_idle(gt); > > + /* we are suspending, so we shouldn't be waiting forever */ > > + if (intel_gt_pm_wait_timeout_for_idle(gt, timeout_ms) == -ETIME) > > you forgot to change the error code here..^ > > but maybe we don't even need this here and a simple > if (intel_gt_pm_wait_timeout_for_idle(gt, timeout_ms)) should be enough > since the error from the killable one is unlikely and the only place > I error I could check on that path would be a catastrophic -ERESTARTSYS. > > but up to you. alan: my bad - I'll fix it - but i agree with not needing to check the failure type. and I'll keep the error the same ("bailing from...") [snip] > > +static inline int intel_gt_pm_wait_timeout_for_idle(struct intel_gt *gt, > > int timeout_ms) > > +{ > > + return intel_wakeref_wait_for_idle(>wakeref, timeout_ms); > > } > > I was going to ask why you created a single use function here, but then I > noticed the above one. So it makes sense. > Then I was going to ask why in here you didn't use the same change of > timeout = 0 in the existent function like you used below, but then I > noticed that the above function is called in multiple places and the > patch with this change is much cleaner and the function is static inline > so your approach was good here. alan: yes that was my exact reason - thus no impact across other callers. [snip] > Please add a documentation for this function making sure you have the > following > mentions: alan: good catch -will do. > > /** > [snip] > * @timeout_ms: Timeout in ums, 0 means never timeout. > * > * Returns 0 on success, -ETIMEDOUT upon a timeout, or the unlikely > * error propagation from wait_var_event_killable if timeout_ms is 0. > */ > > with the return error fixed above and the documentation in place: > > Reviewed-by: Rodrigo Vivi > > > -int intel_wakeref_wait_for_idle(struct intel_wakeref *wf) > > +int intel_wakeref_wait_for_idle(struct intel_wakeref *wf, int timeout_ms) > > { > > - int err; > > [snip]
Re: [PATCH v2 0/3] Resolve suspend-resume racing with GuC destroy-context-worker
On Mon, 2023-08-14 at 18:12 -0700, Teres Alexis, Alan Previn wrote: > This series is the result of debugging issues root caused to > races between the GuC's destroyed_worker_func being triggered > vs repeating suspend-resume cycles with concurrent delayed > fence signals for engine-freeing. alan: forgot credit: Tested-by: Mousumi Jana alan:snip. > > > Alan Previn (3): > drm/i915/guc: Flush context destruction worker at suspend > drm/i915/guc: Close deregister-context race against CT-loss > drm/i915/gt: Timeout when waiting for idle in suspending > > drivers/gpu/drm/i915/gt/intel_engine_cs.c | 2 +- > drivers/gpu/drm/i915/gt/intel_gt_pm.c | 7 ++- > drivers/gpu/drm/i915/gt/intel_gt_pm.h | 7 ++- > .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 45 +-- > .../gpu/drm/i915/gt/uc/intel_guc_submission.h | 2 + > drivers/gpu/drm/i915/gt/uc/intel_uc.c | 2 + > drivers/gpu/drm/i915/intel_wakeref.c | 14 -- > drivers/gpu/drm/i915/intel_wakeref.h | 5 ++- > 8 files changed, 71 insertions(+), 13 deletions(-) > > > base-commit: 85f20fb339f05ec4221bb295c13e46061c5c566f
Re: [Intel-gfx] [PATCH v1 1/3] drm/i915/guc: Flush context destruction worker at suspend
> > > Rodrigo: And why here and not some upper layer? like in prepare > alan: wait_for_suspend does both the checking for idle as well as the > potential > wedging if guc or hw has hung at this late state. if i call from the upper > layer before this wait_for_suspend, it might be too early because the > wait-for-idle could experience workloads completing and new > contexts-derigtrations > being queued. If i call it from upper layer after wait_for_suspend, then it > would > be unnecessary if wait_for_suspend decided to nuke everything. Hmmm.. but i > guess > the latter could work too - since the nuke case would have emptied out the > link-list > of that worker and so it would either run and do nothing or would not even be > queued. > Would you rather i go that way? (i'll recheck with my team mates for > i-dotting/t-crossing > discussion. actually, after going up a layer, i realize the right place might be to insert late stage worker-flushing into intel_uc_suspend (called from intel_gt_suspend_late) which is also where the gsc worker is flushed. This will also mean we don't need to create intel_uc_suspend_prepare for new plumbing.
Re: [PATCH v1] drm/i915/pxp/mtl: Update gsc-heci cmd size and timeout
On Fri, 2023-07-07 at 11:34 -0700, Teres Alexis, Alan Previn wrote: > Update the max GSC-HECI packet size and the max firmware > response timeout to match internal fw specs. > > Signed-off-by: Alan Previn I'm going to re-rev this and change the subject slightly to "Update gsc-heci cmd submission to align with spec". The two changes in this patch will be included but the cmd-packet size is now to be increased to 64K. Also, the counting of the timeout needs to start from when the request has his the cmd streamer hardware (not from when the PXP subsystem has thrown it over the wall to the GuC). Although this latter change would imply a longer timeout period, the change to observe this longer timeout should be applicable to code that is actually triggering a PXP session creation/teardown. In addition to that, we also need to update the LRC common defaults to inclure forcing the runalone bit for PXP contexts (as that is the updated hardware spec'd expectation). ...alan
Re: [PATCH v1 2/3] drm/i915/guc: Close deregister-context race against CT-loss
On Wed, 2023-08-02 at 16:35 -0700, Teres Alexis, Alan Previn wrote: > If we are at the end of suspend or very early in resume > its possible an async fence signal could lead us to the > execution of the context destruction worker (after the > prior worker flush). > alan:snip > > static void __guc_context_destroy(struct intel_context *ce) > @@ -3270,7 +3287,20 @@ static void deregister_destroyed_contexts(struct > intel_guc *guc) > if (!ce) > break; > > - guc_lrc_desc_unpin(ce); > + if (guc_lrc_desc_unpin(ce)) { > + /* > + * This means GuC's CT link severed mid-way which only > happens > + * in suspend-resume corner cases. In this case, put the > + * context back into the destroyed_contexts list which > will > + * get picked up on the next context deregistration > event or > + * purged in a GuC sanitization event > (reset/unload/wedged/...). > + */ > + spin_lock_irqsave(>submission_state.lock, flags); > + list_add_tail(>destroyed_link, > + > >submission_state.destroyed_contexts); alan: i completely missed the fact this new code is sitting within a while (!list_empty(>submission_state.submission_state.destroyed_contexts) block so putting it back will cause it to while loop forever. will fix and rerev. > + spin_unlock_irqrestore(>submission_state.lock, > flags); > + } > + > } > } >
Re: [PATCH v1 1/3] drm/i915/guc: Flush context destruction worker at suspend
Thanks Rodrigo for reviewing this. On Mon, 2023-08-07 at 13:52 -0400, Vivi, Rodrigo wrote: > On Wed, Aug 02, 2023 at 04:34:59PM -0700, Alan Previn wrote: > > Suspend is not like reset, it can unroll, so we have to properly > > flush pending context-guc-id deregistrations to complete before > > we return from suspend calls. > > But if is 'unrolls' the execution should just continue, no?! > In other words, why is this flush needed? What happens if we > don't flush, but resume doesn't proceed? in in which case > of resume you are thinking that it returns and not having flushed? alan: I apologize, i realize I should have explained it better. The flush is needed for when we DON'T unroll. I wanted to point out that at in intel_gt_suspend_foo we dont actually know if the suspend is going to unroll or not so we should always flush properly in the case. I will re-rev with better comment on this patch. alan:snip > > > > > static void wait_for_suspend(struct intel_gt *gt) > > { > > - if (!intel_gt_pm_is_awake(gt)) > > + if (!intel_gt_pm_is_awake(gt)) { > > + intel_uc_suspend_prepare(>uc); > > why only on idle? alan: actually no - i am flushing for both idle and non-idle cases (see farther below) but for the non-idle case, i am skipping the flush if we decide to wedge (since the wedge will clear up everything -> all guc-contexts that are inflight are nuked and freed). > > Well, I know, if we are in idle it is because all the requests had > already ended and gt will be wedged, but why do we need to do anything > if we are in idle? Because the issue that is seen as a very late context-deregister that is triggered by a, orthogonal thread via the following callstack: drm-fence signal triggers a FENCE_FREE via__i915_sw_fence_notify that connects to engines_notify -> free_engines_rcu -> (thread) -> intel_context_put -> kref_put(>ref..) that queues the context-destruction worker. That said, eventhough the gt is idle because the last workload has been retired a context-destruction worker may have has just gotten queued. > > And why here and not some upper layer? like in prepare alan: wait_for_suspend does both the checking for idle as well as the potential wedging if guc or hw has hung at this late state. if i call from the upper layer before this wait_for_suspend, it might be too early because the wait-for-idle could experience workloads completing and new contexts-derigtrations being queued. If i call it from upper layer after wait_for_suspend, then it would be unnecessary if wait_for_suspend decided to nuke everything. Hmmm.. but i guess the latter could work too - since the nuke case would have emptied out the link-list of that worker and so it would either run and do nothing or would not even be queued. Would you rather i go that way? (i'll recheck with my team mates for i-dotting/t-crossing discussion. > > > return; > > + } > > > > if (intel_gt_wait_for_idle(gt, I915_GT_SUSPEND_IDLE_TIMEOUT) == -ETIME) > > { > > /* > > @@ -299,6 +301,8 @@ static void wait_for_suspend(struct intel_gt *gt) > > */ > > intel_gt_set_wedged(gt); > > intel_gt_retire_requests(gt); > > + } else { > > + intel_uc_suspend_prepare(>uc); > > } > > > > intel_gt_pm_wait_for_idle(gt); alan:snip
Re: [Intel-gfx] [PATCH v1 3/3] drm/i915/gt: Timeout when waiting for idle in suspending
Thanks Rodrigo for reviewing this. On Mon, 2023-08-07 at 13:56 -0400, Vivi, Rodrigo wrote: > On Wed, Aug 02, 2023 at 04:35:01PM -0700, Alan Previn wrote: > > When suspending, add a timeout when calling > > intel_gt_pm_wait_for_idle else if we have a lost > > G2H event that holds a wakeref (which would be > > indicating of a bug elsewhere in the driver), we > > get to complete the suspend-resume cycle, albeit > > without all the lower power hw counters hitting > > its targets, instead of hanging in the kernel. > > alan:snip > > -int intel_wakeref_wait_for_idle(struct intel_wakeref *wf) > > +int intel_wakeref_wait_for_idle(struct intel_wakeref *wf, int timeout_ms) > > { > > - int err; > > + int err = 0; > > > > might_sleep(); > > > > - err = wait_var_event_killable(>wakeref, > > - !intel_wakeref_is_active(wf)); > > + if (!timeout_ms) > > + err = wait_var_event_killable(>wakeref, > > + !intel_wakeref_is_active(wf)); > > + else if (wait_var_event_timeout(>wakeref, > > + !intel_wakeref_is_active(wf), > > + msecs_to_jiffies(timeout_ms)) < 1) > > + err = -ETIME; > > it looks to me that -ETIMEDOUT would be a better error. alan: okay - will do, thanks.
Re: [PATCH v4 1/1] drm/i915/pxp: Optimize GET_PARAM:PXP_STATUS
On Wed, 2023-08-02 at 11:25 -0700, Teres Alexis, Alan Previn wrote: > After recent discussions with Mesa folks, it was requested > that we optimize i915's GET_PARAM for the PXP_STATUS without > changing the UAPI spec. > > Add these additional optimizations: >- If any PXP initializatoin flow failed, then ensure that > we catch it so that we can change the returned PXP_STATUS > from "2" (i.e. 'PXP is supported but not yet ready') > to "-ENODEV". This typically should not happen and if it > does, we have a platform configuration issue. >- If a PXP arbitration session creation event failed > due to incorrect firmware version or blocking SOC fusing > or blocking BIOS configuration (platform reasons that won't > change if we retry), then reflect that blockage by also > returning -ENODEV in the GET_PARAM:PXP_STATUS. >- GET_PARAM:PXP_STATUS should not wait at all if PXP is > supported but non-i915 dependencies (component-driver / > firmware) we are still pending to complete the init flows. > In this case, just return "2" immediately (i.e. 'PXP is > supported but not yet ready'). > > Difference from prio revs: > v3: - Rebase with latest tip that has pulled in setting the > gsc fw load to fail if proxy init fails. > v2: - Use a #define for the default readiness timeout (Vivaik). > - Improve comments around the failing of proxy-init. > v1: - Change the commit msg style to be imperative. (Jani) > - Rename timeout to timeout_ms. (Jani) > - Fix is_fw_err_platform_config to use higher order > param (pxp) first. (Jani) > > Signed-off-by: Alan Previn alan: Daniele pointed out that i removed Vivaik's RB from rev-3. The difference from this rev vs rev is a hunk of code got removed and went into a different patch (that reused the same code, is higher priority - CI related, and had to go first). So this rev is identical to rev3 except a hunk that has been removed. I've checked with Vivaik and he is good with keeping his R-b since the rest of the patch is the same. Thus repasting his R-b from rev3 (Thanks Daniele and Vivaik): Reviewed-by: Balasubrawmanian, Vivaik
Re: [PATCH v1 0/3] Resolve suspend-resume racing with GuC destroy-context-worker
On Wed, 2023-08-02 at 16:34 -0700, Teres Alexis, Alan Previn wrote: > This series is the result of debugging issues root caused to > races between the GuC's destroyed_worker_func being triggered vs > repeating suspend-resume cycles with concurrent delayed > fence signals for engine-freeing. > > The reproduction steps require that an app is created right before > the start of the suspend cycle where it creates a new gem > context and submits a tiny workload that would complete in the > middle of the suspend cycle. However this app uses dma-buffer > sharing or dma-fence with non-GPU objects or signals that > eventually triggers a FENCE_FREE via__i915_sw_fence_notify that > connects to engines_notify -> free_engines_rcu -> > intel_context_put -> kref_put(>ref..) that queues the > worker after the GuCs CTB has been disabled (i.e. after > i915-gem's suspend-late flows). > As an FYI - in offline conversations with John and Daniele, we have agreed that at least the first two of the patches in this are necessary improvements but the last patch may remain open as further offline debug is continuing to pin down the src of the above fence-signal-flow. For now we are hoping to proceed with reviewing the first two patches and only look into the 3rd patch if there are system level fence signalling that truly can trigger this anomaly or if its just a straddling request somewhere within i915 that has appeared or hung at the wrong time which needs to be fixed. alan:snip
Re: [PATCH] drm/i915/pxp/mtl: intel_pxp_init_hw needs runtime-pm inside pm-complete
> > > > alan:snip Thanks Vinay and Daniele - i'll respin with below fix. > > @@ -48,7 +50,8 @@ void intel_pxp_resume_complete() > > if (!HAS_ENGINE(pxp->ctrl_gt, GSC0) && !pxp->pxp_component) > > return; > > > > - intel_pxp_init_hw(pxp); > > + with_intel_runtime_pm(>ctrl_gt->i915->runtime_pm, wakeref) > > This is called from within the rpm resume path, so you can't do an rpm > get or it will deadlock. Maybe have: > > __pxp_resume_complete(struct intel_pxp *pxp, bool needs_rpm); > > intel_pxp_resume_complete(..) > { > return __pxp_resume_complete(pxp, true); > } > > intel_pxp_runtime_resume(..) > { > return __pxp_resume_complete(pxp, false); > } > > > or something like that. > Daniele
Re: [PATCH v6] drm/i915/selftest/gsc: Ensure GSC Proxy init completes before selftests
On Thu, 2023-07-20 at 14:52 -0700, Ceraolo Spurio, Daniele wrote: > > On 7/20/2023 2:40 PM, Alan Previn wrote: > > On MTL, if the GSC Proxy init flows haven't completed, submissions to the > > GSC engine will fail. Those init flows are dependent on the mei's > > gsc_proxy component that is loaded in parallel with i915 and a > > worker that could potentially start after i915 driver init is done. > > > > That said, all subsytems that access the GSC engine today does check > > for such init flow completion before using the GSC engine. However, > > selftests currently don't wait on anything before starting. > > > > To fix this, add a waiter function at the start of __run_selftests > > that waits for gsc-proxy init flows to complete. Selftests shouldn't > > care if the proxy-init failed as that should be flagged elsewhere. > > > > Difference from prior versions: > > v6: - Add a helper that returns something more than a boolean > > so we selftest can stop waiting if proxy-init hadn't > > completed but failed (Daniele). alan:snip > > > +int intel_gsc_uc_fw_proxy_get_status(struct intel_gsc_uc *gsc) > > +{ > > + if (!(IS_ENABLED(CONFIG_INTEL_MEI_GSC_PROXY))) > > + return -ENODEV; > > + if (!intel_uc_fw_is_loadable(>fw)) > > + return -ENODEV; > > + if (__intel_uc_fw_status(>fw) == INTEL_UC_FIRMWARE_LOAD_FAIL) > > You're missing the change to move the FW status to LOAD_FAIL if the > proxy fails to initialize. Or are you expecting > https://patchwork.freedesktop.org/series/118723/, which included that > change, to be merged first? > > Daniele alan: as per our offline sync, I'll respin this one and move it away from the other patch (since this is more critical) and we can respin the other after this is done so we get a smooth merge. Also, as i move that "change fw status to fail" from that PXP patch to this patch, I'll fix that issue where i missed the 2nd failure point in the proxy init flow. Thanks for your help. :)
Re: [Intel-gfx] [PATCH v4] drm/i915/selftest/gsc: Ensure GSC Proxy init completes before selftests
On Wed, 2023-07-12 at 10:19 +0100, Tvrtko Ursulin wrote: > On 11/07/2023 23:02, Alan Previn wrote: > > On MTL, if the GSC Proxy init flows haven't completed, submissions to the > > GSC engine will fail. Those init flows are dependent on the mei's > > gsc_proxy component that is loaded in parallel with i915 and a > > worker that could potentially start after i915 driver init is done. > > > > That said, all subsytems that access the GSC engine today does check > > for such init flow completion before using the GSC engine. However, > > selftests currently don't wait on anything before starting. > > > > > > alan:snip > > + /* > > +* The gsc proxy component depends on the kernel component driver load > > ordering > > +* and in corner cases (the first time after an IFWI flash), > > init-completion > > +* firmware flows take longer. > > +*/ > > + unsigned long timeout_ms = 8000; > > + > > + if (need_to_wait && > > + (wait_for(intel_gsc_uc_fw_proxy_init_done(>media_gt->uc.gsc, > > true), > > + timeout_ms))) > > + pr_info(DRIVER_NAME "Timed out waiting for > > gsc_proxy_completion!\n"); > > Would it make sense to error out here? Or at least upgrade to pr_warn or > something? alan: agree on pr_warn (especially since need_for_wait being true and we tried for 8 secs - this should never happen). > > I didn't quite understand the points Daniele raised about engine loops > and resets - in my mind GSC engine is this special thing exercised for > highly specialized operations and not touched in random for_each_engine > loop tests, but I also did not really look so might be totally wrong. alan: after consulting with Daniele further, the concern in the case of having gsc-proxy-init mid-execution while other selttests are running (when thinking of tests that have nothing to do with GSC but has indirect effect like memory-pressure, engine-resets, etc) is that the proxy-init will bail-out print an error but that would result in CI reporting a false-negative. We can't skip that error since CONFIG_INTEL_MEI_GSC_PROXY tells us that the kernel owner wants GSC-PROXY working. > > In any case, v4 reads clear - no confusing comments and not > over-engineered so is acceptable to me. > alan: thanks Tvrtko. > P.S. Maybe the check *could* be moved to i915_live_selftests, where hw > dependencies conceptually fit better, and maybe i915_perf_selftests > would need it too then (?), but it is up to you. alan: i can do this quickly as a rev5 (after a bit of manual check if perf needs it) > > Maybe even in the array selftests/i915_live_selftests.h if we could add > a facility to make unskippable tests and add this one after the sanity > check. Which would then achieve the same generalized thing you had in > the previous version without needing to add a new array/loop. alan: i rather not attempt this as part of the current patch but will consider it as a separate patch if you are okay with that?
Re: [Intel-gfx] [PATCH v3] drm/i915/selftest/gsc: Ensure GSC Proxy init completes before selftests
On Tue, 2023-07-11 at 11:49 -0700, Ceraolo Spurio, Daniele wrote: > > > > > @@ -134,6 +193,8 @@ static int __run_selftests(const char *name, > > > >{ > > > > int err = 0; > > > > > > > > + __wait_on_all_system_dependencies(data); > > > Why does this need to be top level selftests and not just a wait for > > > intel_gsc_uc_fw_proxy_init_done in the tests where it is relevant, via > > > some helper or something? > > Alan: it was an offline decision because we didn't want to repeat > > the same check for all permutations of selftests' subtests (i.e. considering > > module params can dictate to skip some subtests but execute others). > > > > Anyways, let me get back to you on how how many selftests' subtests > > actually excercise the > > need for proxy-init to complete - if its just 1-to-2 subtest I'll move the > > remove the code > > from here and move them into the individual subtests. > > I don't think it is going to be easy to figure out which selftest are > impacted. All selftests looping on all engines of course, but also tests > triggering GT resets and/or messing with the system in other ways. Any > new tests added will also need to be evaluated. > > IMO there is minimal impact of having this check on every test. When > running selftests we load i915 after the rest of the system has already > fully booted, so there are no delays in getting the mei component up and > therefore proxy init is sometimes completed even before the selftest > code starts; when we do have to wait, it's usually for a very short > time, because the expected total execution time for the GSC worker when > not having to wait for the mei component to load is ~750ms (~200ms for > GSC load + 20ms for HuC auth + ~500ms for proxy init). Having a few > seconds added to the total selftests runtime is IMO a better option that > having to maintain a list of impacted tests. > > Daniele > Thanks Daniele - I completely forgot about reset or other system disruptive tests. For now I'll re-rev to address Tvrtko's other comments but will keep the waiter as 'once-top-down' for now and wait for Tvrtko's thoughts on that next rev. ...alan
Re: [Intel-gfx] [PATCH v3] drm/i915/selftest/gsc: Ensure GSC Proxy init completes before selftests
Thanks fore reviewing Tvrtko, below are my responses. I'll rerev without generalized func ptr and only for the subtests that need it. ...alan On Thu, 2023-06-29 at 22:44 +0100, Tvrtko Ursulin wrote: > On 29/06/2023 21:42, Alan Previn wrote: > > On MTL, if the GSC Proxy init flows haven't completed, submissions to the > > GSC engine will fail. Those init flows are dependent on the mei's > > gsc_proxy component that is loaded in parallel with i915 and a > > worker that could potentially start after i915 driver init is done. > > > > That said, all subsytems that access the GSC engine today does check > > for such init flow completion before using the GSC engine. However, > > selftests currently don't wait on anything before starting. alan:snip > > +static int > > +__wait_gsc_proxy_completed(struct drm_i915_private *i915, > > + unsigned long timeout_ms) > > +{ > > + bool need_to_wait = (IS_ENABLED(CONFIG_INTEL_MEI_GSC_PROXY) && > > +i915->media_gt && > > +HAS_ENGINE(i915->media_gt, GSC0) && > > + > > intel_uc_fw_is_loadable(>media_gt->uc.gsc.fw)); > > + > > + /* > > +* For gsc proxy component loading + init, we need a much longer timeout > > +* than what CI selftest infrastrucutre currently uses. This longer wait > > +* period depends on the kernel config and component driver load > > ordering > > +*/ > > How is a CI timeout value relevant? > > Plus from the commit message it sounds like the point of wait is so > submission to gsc does not fail if loading is still in progress, not > that the CI times out. So what is the main problem? Alan: The comment was meant to explain why we override the CI selftest timeout (an input param to the generalized func ptr loop) to something much larger specially for gsc-proxy-waiting. However, since your other review comment below is to remove the generalization, this comment therefore will not make sense so I'll remove it accordingly. The point was that CI might have a system level selftest timeout of something much smaller like 500 milisecs (used to have some control over the execution time), but for the gsc-proxy waiting, its not in i915's control but depends on the kernel component driver loading flow (and in rare occasions, after a fresh IFWI was flashed which causes a 1-time longer period for fw-proxy flows to complete). In any case, I'll remove the comment as per your direction. > > > + if (timeout_ms < 8000) > > + timeout_ms = 8000; > > + > alan:snip > > +static int __wait_on_all_system_dependencies(struct drm_i915_private *i915) > > +{ > > + struct __startup_waiter *waiter = all_startup_waiters; > > + int count = ARRAY_SIZE(all_startup_waiters); > > + int ret; > > + > > + if (!waiter || !count || !i915) > > + return 0; > > Ugh. > > If it ever becomes an empty array just zap this whole code and not have > these checks. Alan: Okay sure - will remove these check except the i915 - see below. > > Also, no i915 is a possibility? Alan: i915_mock_selftests passes in NULL for i915. This checking of the i915 aligns with the existing __run_selftests code - but in that function the param is called "data" eventhough in all callers of __run_selftests, that "data" is actually i915 when its not null. > > But actually.. please don't add the function table generalization unless > it is already known something else is coming to be plugged into it. Alan: Okay- I'll remove it. alan:snip > > > @@ -134,6 +193,8 @@ static int __run_selftests(const char *name, > > { > > int err = 0; > > > > + __wait_on_all_system_dependencies(data); > > Why does this need to be top level selftests and not just a wait for > intel_gsc_uc_fw_proxy_init_done in the tests where it is relevant, via > some helper or something? Alan: it was an offline decision because we didn't want to repeat the same check for all permutations of selftests' subtests (i.e. considering module params can dictate to skip some subtests but execute others). Anyways, let me get back to you on how how many selftests' subtests actually excercise the need for proxy-init to complete - if its just 1-to-2 subtest I'll move the remove the code from here and move them into the individual subtests. alan:snip
Re: [Intel-gfx] [PATCH] drm/i915/pxp: Optimize GET_PARAM:PXP_STATUS
On Tue, 2023-06-20 at 09:30 -0500, Balasubrawmanian, Vivaik wrote: > On 6/1/2023 12:45 PM, Alan Previn wrote: > > After recent discussions with Mesa folks, it was requested > > that we optimize i915's GET_PARAM for the PXP_STATUS without > > changing the UAPI spec. > > > > This patch adds this additional optimizations: > > - If any PXP initializatoin flow failed, then ensure that > > we catch it so that we can change the returned PXP_STATUS > > from "2" (i.e. 'PXP is supported but not yet ready') > > to "-ENODEV". This typically should not happen and if it > > does, we have a platform configuration. > > - If a PXP arbitration session creation event failed > > due to incorrect firmware version or blocking SOC fusing > > or blocking BIOS configuration (platform reasons that won't > > change if we retry), then reflect that blockage by also > > returning -ENODEV in the GET_PARAM-PXP_STATUS. > > - GET_PARAM:PXP_STATUS should not wait at all if PXP is > > supported but non-i915 dependencies (component-driver / > > firmware) we are still pending to complete the init flows. > > In this case, just return "2" immediately (i.e. 'PXP is > > supported but not yet ready'). > > > > Signed-off-by: Alan Previn > > --- > > drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c | 11 +- > > drivers/gpu/drm/i915/i915_getparam.c | 2 +- > > drivers/gpu/drm/i915/pxp/intel_pxp.c | 25 ++ > > drivers/gpu/drm/i915/pxp/intel_pxp.h | 2 +- > > drivers/gpu/drm/i915/pxp/intel_pxp_gsccs.c | 7 +++--- > > drivers/gpu/drm/i915/pxp/intel_pxp_tee.c | 7 +++--- > > drivers/gpu/drm/i915/pxp/intel_pxp_types.h | 9 > > 7 files changed, 50 insertions(+), 13 deletions(-) > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c > > b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c > > index fb0984f875f9..4dd744c96a37 100644 > > --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c > > @@ -42,8 +42,17 @@ static void gsc_work(struct work_struct *work) > > } > > > > ret = intel_gsc_proxy_request_handler(gsc); > > - if (ret) > > + if (ret) { > > + if (actions & GSC_ACTION_FW_LOAD) { > > + /* > > +* a proxy request failure that came together > > with the > > +* firmware load action means the last part of > > init has > > +* failed so GSC fw won't be usable after this > > +*/ > > + intel_uc_fw_change_status(>fw, > > INTEL_UC_FIRMWARE_LOAD_FAIL); > > + } > > goto out_put; > > + } > > > > /* mark the GSC FW init as done the first time we run this */ > > if (actions & GSC_ACTION_FW_LOAD) { > > On the huc authentication comment block above this snippet, the last > statement: "Note that we can only do the GSC auth if the GuC auth was" > is confusing as the code below is only dealing with HuC Authentication. alan: i believe what he meant was "can only do the GSC-based auth if the GuC-based auth"... but I can't change that code as part of this patch - I believe the rules for kernel patch is to ensure each single patch is modular (not mixing unrelated changes) and focuses just on what its described to do. IIRC, we would need to create a separate patch review for that change. > > This function seems to have a section to deal with FW load action and > another to deal with SW Proxy requests, but we seem to be mixing both > actions in the SW proxy section. instead, can we move this call to > intel_gsc_proxy_request_handler to the FW load section itself instead of > handling it as an additional check in the SW_proxy section? In the same > vein, we should also move the intel_uc_fw_change_status() call into the > above FW Load action section. i think that way the code reads better. alan: GSC_ACTION_FW_LOAD is used for loading the GSC firmware which is a one-time thing per i915 load. However, GSC_ACTION_SW_PROXY events can happen any time the GSC fw needs to communicate with CSE firmware (or vice versa) due to platform events that may have not been triggered by i915 long after init. However, the rule is after GSC FW is loaded, i915 is required to do a 1-time proxy-init step to prime both GSC and CSE fws that proxy comms is avail. without this step, we can't use the gsc-fw for other ops. So to recap the rules: 1. we launch the worker to do the one-time the GSC firmware load. 2. after the GSC firmware load is successful, we have to do a one-time SW-proxy init. -> this is why we add the GSC_ACTION_SW_PROXY flag successful load completion. 3. If we are doing proxy-handling for the very first time, we ensure -> FW status is only set to RUNNING if proxy int was
Re: [PATCH v3] drm/xe/guc: Fix h2g_write usage of GUC_CTB_MSG_MAX_LEN
On Wed, 2023-06-28 at 21:44 +, Brost, Matthew wrote: > On Wed, Jun 28, 2023 at 11:17:18AM -0700, Alan Previn wrote: > > In the ABI header, GUC_CTB_MSG_MIN_LEN is '1' because > > GUC_CTB_HDR_LEN is 1. This aligns with H2G/G2H CTB specification > > where all command formats are defined in units of dwords so that '1' > > is a dword. Accordingly, GUC_CTB_MSG_MAX_LEN is 256-1 (i.e. 255 > > dwords). However, h2g_write was incorrectly assuming that > > GUC_CTB_MSG_MAX_LEN was in bytes. Fix this. > alan:snip > > diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c > > index 22bc9ce846db..aa04b5c4822f 100644 > > --- a/drivers/gpu/drm/xe/xe_guc_ct.c > > +++ b/drivers/gpu/drm/xe/xe_guc_ct.c > > @@ -401,19 +401,21 @@ static int h2g_write(struct xe_guc_ct *ct, const u32 > > *action, u32 len, > > { > > struct xe_device *xe = ct_to_xe(ct); > > struct guc_ctb *h2g = >ctbs.h2g; > > - u32 cmd[GUC_CTB_MSG_MAX_LEN / sizeof(u32)]; > > - u32 cmd_len = len + GUC_CTB_HDR_LEN; > > - u32 cmd_idx = 0, i; > > +#define H2G_CT_HEADERS (GUC_CTB_HDR_LEN + 1) /* one DW CTB header and one > > DW HxG header */ > > Hate to nit pick but again this should be above the h2g_write per > feedback from Oden on Xe in general. > > Otherwise LGTM. > > With the nit addressed: > > Reviewed-by: Matthew Brost Thanks for reviewing. My bad on the #define - you mentioned that before. Will fix that now. ...alan
Re: [PATCH v2 3/5] drm/i915/mtl/gsc: query the GSC FW for its compatibility version
On Mon, 2023-06-05 at 19:24 -0700, Ceraolo Spurio, Daniele wrote: > The compatibility version is queried via an MKHI command. Right now, the > only existing interface is 1.0 > This is basically the interface version for the GSC FW, so the plan is > to use it as the main tracked version, including for the binary naming > in the fetch code. > > v2: use define for the object size (Alan) > > Cc: Alan Previn > Cc: John Harrison > Reviewed-by: Alan Previn #v1 Diff'd v1 vs v2 of this patch - differences are good. So my R'B still holds: Reviewed-by: Alan Previn > ---
Re: [PATCH v1] drm/i915/gsc: take a wakeref for the proxy-init-completion check
On Thu, 2023-06-08 at 11:19 -0700, Ceraolo Spurio, Daniele wrote: > On 6/8/2023 11:04 AM, Alan Previn wrote: > > Ensure intel_gsc_uc_fw_init_done and intel_gsc_uc_fw_proxy_init > > takes a wakeref before reading GSC Shim registers. alan:snip > > > bool intel_gsc_uc_fw_proxy_init_done(struct intel_gsc_uc *gsc) > > { > > struct intel_uncore *uncore = gsc_uc_to_gt(gsc)->uncore; > > - u32 fw_status = intel_uncore_read(uncore, GSC_FW_STATUS_REG); > > + intel_wakeref_t wakeref; > > + u32 fw_status; > > + > > + with_intel_runtime_pm(uncore->rpm, wakeref) > > + fw_status = intel_uncore_read(uncore, GSC_FW_STATUS_REG); > > I think this could be moved to an helper (gsc_uc_get_fw_status?), so we > don't have to re-do the wakeref in all the callers. alan: thanks - will fix.
Re: [Intel-xe] [v1] drm/xe/guc: Fix h2g_write usage of GUC_CTB_MSG_MAX_LEN
On Thu, 2023-06-08 at 00:12 +, Teres Alexis, Alan Previn wrote: > On Fri, 2023-06-02 at 11:52 -0700, Alan Previn wrote: > > alan: good point - i will go back and find if we have this value internally > spec'd before we continue. alan: actually, even if the spec allowed very large cmds (as its internally spec'd to allow that), we shouldnt consumer the stack like that - will fix this that way.
Re: [PATCH v2 2/5] drm/i915/mtl/gsc: extract release and security versions from the gsc binary
Everything looks good to me, so Reviewed-by: Alan Previn On Mon, 2023-06-05 at 19:23 -0700, Ceraolo Spurio, Daniele wrote: > The release and security versions of the GSC binary are not used at > runtime to decide interface compatibility (there is a separate version > for that), but they're still useful for debug, so it is still worth > extracting them and printing them out in dmesg. alan:snip > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_binary_headers.h > b/drivers/gpu/drm/i915/gt/uc/intel_gsc_binary_headers.h > index 714f0c256118..ad80afcafd23 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_binary_headers.h > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_binary_headers.h alan:snip > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c > b/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c alan:snip > @@ -42,6 +43,157 @@ bool intel_gsc_uc_fw_init_done(struct intel_gsc_uc *gsc) alan:snip > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.h > b/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.h alan:snip > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h > b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h alan:snip > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_huc_fw.c > b/drivers/gpu/drm/i915/gt/uc/intel_huc_fw.c alan:snip > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c > b/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c alan:snip > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.h > b/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.h alan:snip > /* > @@ -289,6 +290,8 @@ static inline u32 intel_uc_fw_get_upload_size(struct > intel_uc_fw *uc_fw) alan:snip
Re: [v2] drm/i915/selftest/gsc: Ensure GSC Proxy init completes before selftests
On Thu, 2023-06-08 at 18:14 +, Dong, Zhanjun wrote: > See my comments below. > > > -Original Message- > > From: Alan Previn alan:snip > > +static int > > +__wait_gsc_proxy_completed(struct drm_i915_private *i915, > > + unsigned long timeout_ms) > > +{ > > + bool need_to_wait = (IS_ENABLED(CONFIG_INTEL_MEI_GSC_PROXY) > > && > > +i915->media_gt && > > +HAS_ENGINE(i915->media_gt, GSC0) && > > +intel_uc_fw_is_loadable(>media_gt- > > > uc.gsc.fw)); > > + > > + /* > > +* For gsc proxy component loading + init, we need a much longer > > timeout > > +* than what CI selftest infrastrucutre currently uses. This longer wait > > +* period depends on the kernel config and component driver load > > ordering > > +*/ > > + if (timeout_ms < 8000) > > + timeout_ms = 8000; > > > Lgtm, just an concern about the fixed number here, shall we set the minimal > here, or let i915_selftest.timeout_ms take control? Thus no longer need > coding change here in the future. > > Reviewed-by: Zhanjun Dong Thanks Zhanjun, unfortunately, based on internal testing, i915_selftest.timeout_ms default is too low that it does occasionally timeout for CI. From experience, with a lean ubuntu config, it typically takes about ~1 seconds for the mei-gsc-sw-proxy component driver to load after i915 loads. Since CI regular unloads and reloads i915, the timeout observed ends up being reported as issue. 8 seconds was based on internal testing of the worst case scenario - which hardly ever happens. We've only seen the 8 second happen when the kernel config has configs enabled for very many SOC IP drivers and component driver (seen one at least one customer config) or if the MTL board IFWI was only just reflashed (this would be a one-off 8 seconds, we suspect due to the firmware doing additional steps)
Re: [v1] drm/xe/guc: Fix h2g_write usage of GUC_CTB_MSG_MAX_LEN
On Fri, 2023-06-02 at 11:52 -0700, Alan Previn wrote: > In the ABI header, GUC_CTB_MSG_MIN_LEN is '1' because > GUC_CTB_HDR_LEN is 1. This aligns with H2G/G2H CTB specification > where all command formats are defined in units of dwords so that '1' > is a dword. Accordingly, GUC_CTB_MSG_MAX_LEN is 256-1 (i.e. 255 > dwords). However, h2g_write was incorrectly assuming that > GUC_CTB_MSG_MAX_LEN was in bytes. Fix this. > > alan:snip > The patch itself make sense but concerned about the value of > GUC_CTB_MSG_MAX_LEN. Is the max size of CTB really 256 DWs? That seems > way to large. Also I think some tools are going to complain with the > stack being this large too. > > Matt alan: good point - i will go back and find if we have this value internally spec'd before we continue.
Re: [PATCH v3] drm/i915/mtl/gsc: Add a gsc_info debugfs
On Mon, 2023-06-05 at 21:32 -0700, Ceraolo Spurio, Daniele wrote: > Add a new debugfs to dump information about the GSC. This includes: > > - the FW path and SW tracking status; > - the release, security and compatibility versions; > - the HECI1 status registers. > > Note that those are the same registers that the mei driver dumps in > their own status sysfs on DG2 (where mei owns the GSC). > alan:snip all looks good. (ofc we do have to follow up on those 2 actions from last rev's conversation (that we agreed should be separate patch). For now, this patch is Reviewed-by: Alan Previn
Re: [PATCH v2 1/5] drm/i915/gsc: fixes and updates for GSC memory allocation
On Mon, 2023-06-05 at 19:23 -0700, Ceraolo Spurio, Daniele wrote: > A few fixes/updates are required around the GSC memory allocation and it > is easier to do them all at the same time. The changes are as follows: > > 1 - Switch the memory allocation to stolen memory. We need to avoid > accesses from GSC FW to normal memory after the suspend function has > completed and to do so we can either switch to using stolen or make sure > the GSC is gone to sleep before the end of the suspend function. Given > that the GSC waits for a bit before going idle even if there are no > pending operations, it is easier and quicker to just use stolen memory. > > 2 - Reduce the GSC allocation size to 4MBs, which is the POR requirement. > The 8MBs were needed only for early FW and I had misunderstood that as > being the expected POR size when I sent the original patch. > > 3 - Perma-map the GSC allocation. This isn't required immediately, but it > will be needed later to be able to quickly extract the GSC logs, which are > inside the allocation. Since the mapping code needs to be rewritten due to > switching to stolen, it makes sense to do the switch immediately to avoid > having to change it again later. > > Note that the explicit setting of CACHE_NONE for Wa_22016122933 has been > dropped because that's the default setting for stolen memory on !LLC > platforms. > > v2: only memset the memory we're not overwriting (Alan) > LGTM so .. Reviewed-by: Alan Previn
Re: [PATCH 2/6] drm/i915/uc/gsc: fixes and updates for GSC memory allocation
On Tue, 2023-05-23 at 08:21 -0700, Ceraolo Spurio, Daniele wrote: > > > > > > +static int gsc_allocate_and_map_vma(struct intel_gsc_uc *gsc, u32 size) > > alan:snip > > > + obj = i915_gem_object_create_stolen(gt->i915, s0ize); > > > + if (IS_ERR(obj)) > > > + return PTR_ERR(obj); > > > + > > > + vma = i915_gem_object_ggtt_pin(obj, NULL, 0, 0, 0); > > alan: should we be passing in the PIN_MAPPABLE flag into the last param? > > No, PIN_MAPPABLE is only for legacy platform that used the aperture BAR > for stolen mem access via GGTT. MTL doesn't have it and stolen is > directly accessible via the LMEM BAR (which is actually the same BAR 2, > but now behaves differently). alan: thanks - sounds good - i forgot about those differentiations > > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h > > > @@ -18,6 +18,7 @@ struct intel_gsc_uc { > > > > > > /* GSC-specific additions */ > > > struct i915_vma *local; /* private memory for GSC usage */ > > > + void __iomem *local_vaddr; /* pointer to access the private memory */ > > alan:nit: relooking at the these variable names that originate from > > last year's patch you worked on introducing gsc_uc, i am wondering now > > if we should rename "local" to "privmem" and local_vaddr becomes > > privmem_vaddr. > > (no significant reason other than improving readibility of the code) > > IIRC I used local because one of the GSC docs referred to it that way. I > don't mind the renaming, but I don't think it should be done as part of > this patch. alan: sure - sounds good.
Re: [PATCH 3/6] drm/i915/uc/gsc: extract release and security versions from the gsc binary
On Fri, 2023-05-26 at 18:27 -0700, Ceraolo Spurio, Daniele wrote: > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_meu_headers.h > > > b/drivers/gpu/drm/i915/gt/uc/intel_gsc_meu_headers.h > > > index d55a66202576..8bce2b8aed84 100644 > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_meu_headers.h > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_meu_headers.h alan:snip > > alan: structure layout seems unnecessarily repetitive... why not -> > > struct partition_info { > > u32 offset; > > u32 size; > > }; > > struct intel_gsc_layout_pointers { > > u8 rom_bypass_vector[16]; > > ... > > struct partition_info datap; > > struct partition_info bootregion[5]; > > struct partition_info trace; > > }__packed; > > > > I've just realized that I didn't reply to this comment. The specs have > the structure defined that way, so I tried to keep a 1:1 match like we > usually do. I think switching to a partition_info structure is ok, but > I'll avoid the array because the GSC partition are 1-based, which could > cause confusion (i.e. partition boot1 would be bootregion[0]). alan: sure - that's fine - let's stick to align with the spec definitions
Re: [PATCH 6/6] drm/i915/uc/gsc: Add a gsc_info debugfs
On Wed, 2023-05-31 at 17:25 -0700, Ceraolo Spurio, Daniele wrote: > > On 5/26/2023 3:57 PM, Teres Alexis, Alan Previn wrote: > > On Fri, 2023-05-05 at 09:04 -0700, Ceraolo Spurio, Daniele wrote: > > > Add a new debugfs to dump information about the GSC. This includes: > > alan:snip > > Actually everything looks good except for a couple of questions + asks - > > hope we can close on this patch in next rev. > > > > > - the FW path and SW tracking status; > > > - the release, security and compatibility versions; > > > - the HECI1 status registers. > > > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c > > > b/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c > > > index 0b6dcd982b14..3014e982aab2 100644 > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c > > > @@ -12,36 +12,31 @@ > > > #include "intel_gsc_fw.h" > > > #include "intel_gsc_meu_headers.h" > > > #include "intel_gsc_uc_heci_cmd_submit.h" > > > - > > > -#define GSC_FW_STATUS_REG_MMIO(0x116C40) > > > -#define GSC_FW_CURRENT_STATE REG_GENMASK(3, 0) > > > -#define GSC_FW_CURRENT_STATE_RESET 0 > > > -#define GSC_FW_PROXY_STATE_NORMAL 5 > > > -#define GSC_FW_INIT_COMPLETE_BIT REG_BIT(9) > > > +#include "i915_reg.h" > > > > > alan:snip > > > > alan: btw, just to be consistent with other top-level "intel_foo_is..." > > checking functions, > > why don't we take the runtime wakeref inside the following functions and > > make it easier for any callers? > > (just like what we do for "intel_huc_is_authenticated"): > > static bool gsc_is_in_reset(struct intel_uncore *uncore) > > bool intel_gsc_uc_fw_proxy_init_done(struct intel_gsc_uc *gsc) > > bool intel_gsc_uc_fw_init_done(struct intel_gsc_uc *gsc) > > The idea was that we shouldn't check the FW status if we're not planning > to do something with it, in which case we should already have a wakeref. > HuC is a special case because userspace can query that when the HW is > idle. This said, I have nothing against adding an extra wakeref , but I > don't think it should be in this patch. alan: i believe intel_pxp_gsccs_is_ready_for_sessions is being used in a similar way where one of the uses it to check huc-status and gsc-proxy status without actually doing any operation. If u still wanna keep it differently then I'll have to update the PXP code. Or perhaps you could help me fix that on the PXP side? alna:snip > > > > > +void intel_gsc_uc_load_status(struct intel_gsc_uc *gsc, struct > > > drm_printer *p) > > > +{ > > > + struct intel_gt *gt = gsc_uc_to_gt(gsc); > > > + struct intel_uncore *uncore = gt->uncore; > > > + intel_wakeref_t wakeref; > > > + > > > + if (!intel_gsc_uc_is_supported(gsc)) { > > alan: this was already checked in caller so we'll never get here. i think > > we should remove the check in the caller, let below msg appear. > > I did the same as what we do for GuC and HuC. I'd prefer to be > consistent in behavior with those. alan: okay - sounds good > > > > + drm_printf(p, "GSC not supported\n"); > > > + return; > > > + } > > alan:snip > > > + drm_printf(p, "HECI1 FWSTST%u = 0x%08x\n", i, status); > > alan:nit: do you we could add those additional shim regs? (seemed useful in > > recent offline debugs). > > Agreed that it would be useful; I'll try to get a complete list from > arch and/or the GSC FW team. Are you ok if we go ahead with this in the > meantime? alan: yes sure. > > > >
Re: [Intel-gfx] [PATCH] drm/i915/pxp: Optimize GET_PARAM:PXP_STATUS
Thanks Jani - will rev this up and fix these. On Fri, 2023-06-02 at 16:03 +0300, Jani Nikula wrote: > On Thu, 01 Jun 2023, Alan Previn wrote: > > After recent discussions with Mesa folks, it was requested > > that we optimize i915's GET_PARAM for the PXP_STATUS without > > changing the UAPI spec. > > > > This patch adds this additional optimizations: > > Nitpick, please avoid "This patch". It's redundant, and after the patch > gets applied it becomes a commit, not a patch. > > Instead, use the imperative mood, e.g. "Add these additional > optimizations". > > See > https://docs.kernel.org/process/submitting-patches.html#describe-your-changes alan:snip > > > -int intel_pxp_get_readiness_status(struct intel_pxp *pxp) > > +int intel_pxp_get_readiness_status(struct intel_pxp *pxp, int timeout) > > It would help the reader if you named the parameter timeout_ms. Assuming > the unit is ms. alan:snip > > -is_fw_err_platform_config(u32 type) > > +is_fw_err_platform_config(u32 type, struct intel_pxp *pxp) > > It's customary to have the parameters ordered from higher context to > lower. > alan:snip
Re: [PATCH v5 5/7] drm/i915/mtl/huc: auth HuC via GSC
On Wed, 2023-05-31 at 16:54 -0700, Ceraolo Spurio, Daniele wrote: > The full authentication via the GSC requires an heci packet submission > to the GSC FW via the GSC CS. The GSC has new PXP command for this > (literally called NEW_HUC_AUTH). > The intel_huc_auth function is also updated to handle both authentication > types. > > alan:snip > @@ -399,6 +416,9 @@ void intel_huc_fini(struct intel_huc *huc) >*/ > delayed_huc_load_fini(huc); > > + if (huc->heci_pkt) > + i915_vma_unpin_and_release(>heci_pkt, 0); alan: nit: i just realized that for consistency (init mirror-ing fini), we should be doing this heci releasing after the intel_uc_fw_fini below. But since the below object isnt referencing the heci packet, this doens't matter, so consider a nit. alan:snip > @@ -470,31 +491,41 @@ int intel_huc_auth(struct intel_huc *huc) > if (!intel_uc_fw_is_loaded(>fw)) > return -ENOEXEC; > > - /* GSC will do the auth */ > + /* GSC will do the auth with the load */ > if (intel_huc_is_loaded_by_gsc(huc)) > return -ENODEV; alan: nit: sorry for another late comment but merely a nit - wondering if we should add a warn in here (to catch if we could end up here) but its okay - this in theory shouldnt happen anyway. alan:snip > +++ b/drivers/gpu/drm/i915/gt/uc/intel_huc_fw.c alan:snip Only a couple of late-comer-nits that you can ignore, else LGTM: Reviewed-by: Alan Previn
Re: [Intel-gfx] [PATCH] drm/i915/pxp: use correct format string for size_t
On Thu, 2023-06-01 at 23:36 +0200, Arnd Bergmann wrote: > From: Arnd Bergmann > > While 'unsigned long' needs the %ld format string, size_t needs the %z > modifier: alan:snip > +++ b/drivers/gpu/drm/i915/pxp/intel_pxp_gsccs.c > @@ -143,7 +143,7 @@ gsccs_send_message(struct intel_pxp *pxp, > > reply_size = header->message_size - sizeof(*header); > if (reply_size > msg_out_size_max) { > - drm_warn(>drm, "caller with insufficient PXP reply size > %u (%ld)\n", > + drm_warn(>drm, "caller with insufficient PXP reply size > %u (%zd)\n", >reply_size, msg_out_size_max); > reply_size = msg_out_size_max; > } Thanks Arnd for catching this, although i believe Nathan sumbmitted a patch for the same fix yesterday and received an RB from Andi.
Re: [PATCH v3 5/7] drm/i915/mtl/huc: auth HuC via GSC
On Fri, 2023-05-26 at 17:52 -0700, Ceraolo Spurio, Daniele wrote: > The full authentication via the GSC requires an heci packet submission > to the GSC FW via the GSC CS. The GSC has new PXP command for this > (literally called NEW_HUC_AUTH). > The intel_huc_auth fuction is also updated to handle both authentication > types. > alan:snip > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c > b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c > index b26f493f86fa..c659cc01f32f 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c > @@ -29,13 +29,32 @@ static void gsc_work(struct work_struct *work) > > if (actions & GSC_ACTION_FW_LOAD) { > ret = intel_gsc_uc_fw_upload(gsc); > - if (ret == -EEXIST) /* skip proxy if not a new load */ > - actions &= ~GSC_ACTION_FW_LOAD; > - else if (ret) > + if (!ret) > + /* setup proxy on a new load */ > + actions |= GSC_ACTION_SW_PROXY; > + else if (ret != -EEXIST) > goto out_put; alan: I realize that this worker doesnt print error values that come back from intel_gsc_uc_fw_upload - wonder if we should print it before goto out_put. > + > + /* > + * The HuC auth can be done both before or after the proxy init; > + * if done after, a proxy request will be issued and must be > + * serviced before the authentication can complete. > + * Since this worker also handles proxy requests, we can't > + * perform an action that requires the proxy from within it and > + * then stall waiting for it, because we'd be blocking the > + * service path. Therefore, it is easier for us to load HuC > + * first and do proxy later. The GSC will ack the HuC auth and > + * then send the HuC proxy request as part of the proxy init > + * flow. > + * Note that we can only do the GSC auth if the GuC auth was > + * successful. > + */ > + if (intel_uc_uses_huc(>uc) && > + intel_huc_is_authenticated(>uc.huc, > INTEL_HUC_AUTH_BY_GUC)) > + intel_huc_auth(>uc.huc, INTEL_HUC_AUTH_BY_GSC); > } > > - if (actions & (GSC_ACTION_FW_LOAD | GSC_ACTION_SW_PROXY)) { > + if (actions & GSC_ACTION_SW_PROXY) { > if (!intel_gsc_uc_fw_init_done(gsc)) { > gt_err(gt, "Proxy request received with GSC not > loaded!\n"); > goto out_put; > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c > b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c > index 579c0f5a1438..0ad090304ca0 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c > alan:snip > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_huc.c > b/drivers/gpu/drm/i915/gt/uc/intel_huc.c > index ab5246ae3979..5a4058d39550 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_huc.c > +++ b/drivers/gpu/drm/i915/gt/uc/intel_huc.c > alan:snip > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_huc_fw.c > b/drivers/gpu/drm/i915/gt/uc/intel_huc_fw.c > index d2b4176c81d6..8e538d639b05 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_huc_fw.c > +++ b/drivers/gpu/drm/i915/gt/uc/intel_huc_fw.c > alan:snip > +int intel_huc_fw_auth_via_gsccs(struct intel_huc *huc) > +{ > + struct drm_i915_gem_object *obj; > + void *pkt_vaddr; > + u64 pkt_offset; > + int retry = 5; > + int err = 0; > + > + if (!huc->heci_pkt) > + return -ENODEV; > + > + obj = huc->heci_pkt->obj; > + pkt_offset = i915_ggtt_offset(huc->heci_pkt); > + > + pkt_vaddr = i915_gem_object_pin_map_unlocked(obj, > + > i915_coherent_map_type(i915, obj, true)); > + if (IS_ERR(pkt_vaddr)) > + return PTR_ERR(pkt_vaddr); > + > + msg_in = pkt_vaddr; > + msg_out = pkt_vaddr + SZ_4K; this 4K*2 (4k for input and 4K for output) seems to be hardcoded in two different locations. One is here in intel_huc_fw_auth_via_gsccs and the other location in intel_huc_init which was even in a different file. Perhaps its better to use a #define after to the definition of PXP43_CMDID_NEW_HUC_AUTH... maybe something like a "#define PXP43_NEW_HUC_AUTH_INOUT_PKT_SIZE (SZ_4K)" > + > + intel_gsc_uc_heci_cmd_emit_mtl_header(_in->header, > + HECI_MEADDRESS_PXP, > + sizeof(*msg_in), 0); > + > + msg_in->huc_in.header.api_version = PXP_APIVER(4, 3); > + msg_in->huc_in.header.command_id = PXP43_CMDID_NEW_HUC_AUTH; > + msg_in->huc_in.header.status = 0; > + msg_in->huc_in.header.buffer_len = sizeof(msg_in->huc_in) - > +
Re: [PATCH 6/6] drm/i915/uc/gsc: Add a gsc_info debugfs
On Fri, 2023-05-05 at 09:04 -0700, Ceraolo Spurio, Daniele wrote: > Add a new debugfs to dump information about the GSC. This includes: alan:snip Actually everything looks good except for a couple of questions + asks - hope we can close on this patch in next rev. > > - the FW path and SW tracking status; > - the release, security and compatibility versions; > - the HECI1 status registers. > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c > b/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c > index 0b6dcd982b14..3014e982aab2 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c > @@ -12,36 +12,31 @@ > #include "intel_gsc_fw.h" > #include "intel_gsc_meu_headers.h" > #include "intel_gsc_uc_heci_cmd_submit.h" > - > -#define GSC_FW_STATUS_REG_MMIO(0x116C40) > -#define GSC_FW_CURRENT_STATE REG_GENMASK(3, 0) > -#define GSC_FW_CURRENT_STATE_RESET 0 > -#define GSC_FW_PROXY_STATE_NORMAL 5 > -#define GSC_FW_INIT_COMPLETE_BIT REG_BIT(9) > +#include "i915_reg.h" > alan:snip alan: btw, just to be consistent with other top-level "intel_foo_is..." checking functions, why don't we take the runtime wakeref inside the following functions and make it easier for any callers? (just like what we do for "intel_huc_is_authenticated"): static bool gsc_is_in_reset(struct intel_uncore *uncore) bool intel_gsc_uc_fw_proxy_init_done(struct intel_gsc_uc *gsc) bool intel_gsc_uc_fw_init_done(struct intel_gsc_uc *gsc) > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c > b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c > index 2ae693b01b49..5475e95d61c6 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c > @@ -9,8 +9,9 @@ > #include "gt/intel_gt_print.h" > #include "intel_gsc_uc.h" alan: nit: not this patch's fault, alphabetically, intel_gsc_uc.h is after intel_gsc_proxy.h > #include "intel_gsc_fw.h" > -#include "i915_drv.h" > #include "intel_gsc_proxy.h" > +#include "i915_drv.h" > +#include "i915_reg.h" > > static void gsc_work(struct work_struct *work) > { > @@ -301,3 +302,46 @@ void intel_gsc_uc_load_start(struct intel_gsc_uc *gsc) > queue_work(gsc->wq, >work); > } > + alan: btw, why are we putting intel_gsc_uc_load_status in intel_gsc_uc.c if the only caller is gsc_uc's debugfs? why not just make it a static in there? unless u plan to call it from "err_print_uc" - then can we add that in next rev? > +void intel_gsc_uc_load_status(struct intel_gsc_uc *gsc, struct drm_printer > *p) > +{ > + struct intel_gt *gt = gsc_uc_to_gt(gsc); > + struct intel_uncore *uncore = gt->uncore; > + intel_wakeref_t wakeref; > + > + if (!intel_gsc_uc_is_supported(gsc)) { alan: this was already checked in caller so we'll never get here. i think we should remove the check in the caller, let below msg appear. > + drm_printf(p, "GSC not supported\n"); > + return; > + } alan:snip > + drm_printf(p, "HECI1 FWSTST%u = 0x%08x\n", i, status); alan:nit: do you we could add those additional shim regs? (seemed useful in recent offline debugs). alan:snip > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_debugfs.c > b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_debugfs.c > new file mode 100644 > index ..da9f96b72291 > --- /dev/null > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_debugfs.c > @@ -0,0 +1,38 @@ > +// SPDX-License-Identifier: MIT > +/* > + * Copyright © 2020 Intel Corporation alan:2023? alan:snip > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_debugfs.h > b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_debugfs.h > new file mode 100644 > index ..c405e5574253 > --- /dev/null > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_debugfs.h > @@ -0,0 +1,14 @@ > +/* SPDX-License-Identifier: MIT */ > +/* > + * Copyright © 2020 Intel Corporation alan:2023? alan:snip > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_huc.c > b/drivers/gpu/drm/i915/gt/uc/intel_huc.c alan:snip > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc_debugfs.c > b/drivers/gpu/drm/i915/gt/uc/intel_uc_debugfs.c > index 2f93cc4e408a..6d541c866edb 100644 alan:snip
Re: [PATCH 5/6] drm/i915/uc/gsc: define gsc fw
Considering the only request i have below is touching up of existing comments (as far as this patch is concerned), and since the rest of the code looks good, here is my R-b - but i hope you can anwser my newbie question at the bottom: Reviewed-by: Alan Previn On Fri, 2023-05-05 at 09:04 -0700, Ceraolo Spurio, Daniele wrote: > Add FW definition and the matching override modparam. > > The GSC FW has both a release version, based on platform and a rolling > counter, and a compatibility version, which is the one tracking > interface changes. Since what we care about is the interface, we use > the compatibility version in the buinary names. alan :s/buinary/binary > > Same as with the GuC, a major version bump indicate a > backward-incompatible change, while a minor version bump indicates a > backward-compatible one, so we use only the former in the file name. > > Signed-off-by: Daniele Ceraolo Spurio > Cc: Alan Previn > Cc: John Harrison > --- > drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c | 32 ++-- > 1 file changed, 24 insertions(+), 8 deletions(-) > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c > b/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c > index 36ee96c02d74..531cd172151d 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c > +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c > @@ -124,6 +124,18 @@ void intel_uc_fw_change_status(struct intel_uc_fw *uc_fw, > fw_def(BROXTON, 0, huc_mmp(bxt, 2, 0, 0)) \ > fw_def(SKYLAKE, 0, huc_mmp(skl, 2, 0, 0)) > > +/* > + * The GSC FW has both a release version, based on platform and a rolling > + * counter, and a compatibility version, which is the one tracking > + * interface changes. Since what we care about is the interface, we use > + * the compatibility version in the buinary names. alan:s/buinary/binary also, since we will (i hope) be adding several comments (alongside the new version objects under intel_gsc_uc structure) in the patch #3 about what their differences are and which one we care about and when they get populated, perhaps we can minimize the information here and redirect to that other comment... OR ... we can minimize the comments in patch #3 and redirect here (will be good to have a single location with detailed explaination in the comments and a redirect-ptr from the other location since a reader would most likely stumble onto those questions from either of these locations). > + * Same as with the GuC, a major version bump indicate a > + * backward-incompatible change, while a minor version bump indicates a > + * backward-compatible one, so we use only the former in the file name. > + */ > +#define INTEL_GSC_FIRMWARE_DEFS(fw_def, gsc_def) \ > + fw_def(METEORLAKE, 0, gsc_def(mtl, 1, 0)) > + > /* > > alan:snip > @@ -257,14 +281,6 @@ __uc_fw_auto_select(struct drm_i915_private *i915, > struct intel_uc_fw *uc_fw) > int i; > bool found; > > - /* > - * GSC FW support is still not fully in place, so we're not defining > - * the FW blob yet because we don't want the driver to attempt to load > - * it until we're ready for it. > - */ > - if (uc_fw->type == INTEL_UC_FW_TYPE_GSC) > - return; > - alan: more of a newbie question from myself: considering this is a new firmware we are adding, is there some kind of requirement to provide a link to the patch targetting the linux firmware repo that is a dependency of this series? or perhaps we should mention in the series that merge will only happen after that patch gets merged (with a final rev that includes the patch on the fw-repo side?). Just trying to understand the process. > /* >* The only difference between the ADL GuC FWs is the HWConfig support. >* ADL-N does not support HWConfig, so we should use the same binary as
Re: [PATCH 4/6] drm/i915/uc/gsc: query the GSC FW for its compatibility version
On Fri, 2023-05-05 at 09:04 -0700, Ceraolo Spurio, Daniele wrote: > The compatibility version is queried via an MKHI command. Right now, the > only existing interface is 1.0 > This is basically the interface version for the GSC FW, so the plan is > to use it as the main tracked version, including for the binary naming > in the fetch code. > > Signed-off-by: Daniele Ceraolo Spurio > alan: just a couple of minor things nits below. One ask though, in line with the clarification we had over the offline coversation, I am wondering if we can document the fact that the file_selected.ver remains as major-minor::zero-zero for the case of gsc until after the firmware is loaded and we query via this function (which happens later at gt-late-init). However, that comment might not belong here - perhaps it belongs in the prior patch together with the other comment i requested for (asking for additional explainations about the different types of versions for gsc). That said, for this patch, LGTM: Reviewed-by: Alan Previn alan:snip > +static int gsc_fw_query_compatibility_version(struct intel_gsc_uc *gsc) > +{ > + struct intel_gt *gt = gsc_uc_to_gt(gsc); > + struct mtl_gsc_ver_msg_in *msg_in; > + struct mtl_gsc_ver_msg_out *msg_out; > + struct i915_vma *vma; > + u64 offset; > + void *vaddr; > + int err; > + > + err = intel_guc_allocate_and_map_vma(>uc.guc, PAGE_SIZE * 2, > + , ); alan: nit: im assuming this code will be used for future discrete cards,.. if so, perhaps we should also be using "SZ_4K * 2" above since different host-cpu-arch could have different PAGE sizes - this way we'll be consistent with exact size allocations. also its more consistent in this function - maybe a #define GSC_UC_GET_ABI_VER_PKT_SIZE SZ_4K at top of function is nice. either way, i consider this a nit. > + if (err) { > + gt_err(gt, "failed to allocate vma for GSC version query\n"); > + return err; > + } alan:snip > + > int intel_gsc_uc_fw_upload(struct intel_gsc_uc *gsc) > { > struct intel_gt *gt = gsc_uc_to_gt(gsc); > @@ -327,11 +406,21 @@ int intel_gsc_uc_fw_upload(struct intel_gsc_uc *gsc) > if (err) > goto fail; > > + err = gsc_fw_query_compatibility_version(gsc); > + if (err) > + goto fail; > + > + /* we only support compatibility version 1.0 at the moment */ > + err = intel_uc_check_file_version(gsc_fw, NULL); > + if (err) > + goto fail; > + > /* FW is not fully operational until we enable SW proxy */ > intel_uc_fw_change_status(gsc_fw, INTEL_UC_FIRMWARE_TRANSFERRED); > > - gt_info(gt, "Loaded GSC firmware %s (r%u.%u.%u.%u, svn%u)\n", > + gt_info(gt, "Loaded GSC firmware %s (cv%u.%u, r%u.%u.%u.%u, svn %u)\n", alan:nit "abi" instead of "cv"? > gsc_fw->file_selected.path, > + gsc_fw->file_selected.ver.major, > gsc_fw->file_selected.ver.minor, > gsc->release.major, gsc->release.minor, > gsc->release.patch, gsc->release.build, > gsc->security_version); > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.h > b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.h > index 8f199d5f963e..fb1453ed4ecf 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.h > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.h alan:snip > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c > b/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c > index cd8fc194f7fa..36ee96c02d74 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c > +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c alan:snip > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.h > b/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.h > index 279244744d43..4406e7b48b27 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.h > +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.h alan:snip
Re: [PATCH 3/6] drm/i915/uc/gsc: extract release and security versions from the gsc binary
On Thu, 2023-05-25 at 09:56 -0700, Ceraolo Spurio, Daniele wrote: > On 5/24/2023 10:14 PM, Teres Alexis, Alan Previn wrote: > > On Fri, 2023-05-05 at 09:04 -0700, Ceraolo Spurio, Daniele wrote: alan:snip > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h > > > @@ -17,6 +17,9 @@ struct intel_gsc_uc { > > > struct intel_uc_fw fw; > > > > > > /* GSC-specific additions */ > > > + struct intel_uc_fw_ver release; > > > + u32 security_version; > > alan: for consistency and less redundancy, can't we add "security_version" > > into 'struct intel_uc_fw_ver' (which is zero for firmware that doesn't > > have it). That way, intel_gsc_uc can re-use intel_uc_fw.file_selected > > just like huc? > > I'm not sure what you mean by re-using intel_uc_fw.file_selected. Is > that for the call from intel_uc_fw_version_from_meu_manifest? I'm > purposely not doing that. Note that the GSC has 3 versions: > > Release version (incremented with each build and encoded in the header) > Security version (also encoded in the header) > Compatibility version (queried via message to the GSC) > > The one we care about for communicating with the GSC is the last one, so > that's the one I stored in intel_uc_fw.file_selected (in the next > patch). The other 2 versions are not strictly required to use the GSC > and we only fetch them for debug purposes, so if something goes wrong we > know exactly what we've loaded. > > Daniele alan: okay thanks - seeing that now in the next patch... (and i also forgot that the GSC release version doesnt reflect interface versioning in anyway like GuC does). In that case, above additional versions are fine. Would definitely love to see additional comments under "GSC-specific-additions" that explain those 3 versioning items and what we care about as how you have explained here.
Re: [PATCH 3/6] drm/i915/uc/gsc: extract release and security versions from the gsc binary
On Fri, 2023-05-05 at 09:04 -0700, Ceraolo Spurio, Daniele wrote: alan: snip > +int intel_gsc_fw_get_binary_info(struct intel_uc_fw *gsc_fw, const void > *data, size_t size) > +{ alan:snip > + /* > + * The GSC binary starts with the pointer layout, which contains the > + * locations of the various partitions of the binary. The one we're > + * interested in to get the version is the boot1 partition, where we can > + * find a BPDT header followed by entries, one of which points to the > + * RBE sub-section of the partition. From here, we can parse the CPD alan: nit: could we add the meaning of 'RBE', probably not here but in the header file where GSC_RBE is defined? > + * header and the following entries to find the manifest location > + * (entry identified by the "RBEP.man" name), from which we can finally > + * extract the version. > + * > + * -- > + * [ intel_gsc_layout_pointers ] > + * [ ... ] > + * [ boot1_offset >---]--o > + * [ ... ] | > + * -- | > + * | > + * -- | > + * [ intel_gsc_bpdt_header ]<-o > + * -- alan: special thanks for the diagram - love these! :) alan: snip > + min_size = layout->boot1_offset + layout->boot1_offset > size; alan: latter is a binary so + 1? or is this a typo and supposed to be: min_size = layout->boot1_offset; actually since we are accessing a bpdt_header hanging off that offset, it should rather be: min_size = layout->boot1_offset + sizeof(*bpdt_header); > + if (size < min_size) { > + gt_err(gt, "GSC FW too small for boot section! %zu < %zu\n", > +size, min_size); > + return -ENODATA; > + } > + > + bpdt_header = data + layout->boot1_offset; > + if (bpdt_header->signature != INTEL_GSC_BPDT_HEADER_SIGNATURE) { > + gt_err(gt, "invalid signature for meu BPDT header: 0x%08x!\n", > +bpdt_header->signature); > + return -EINVAL; > + } > + alan: IIUC, to be strict about the size-crawl-checking, we should check minsize again - this time with the additional "bpdt_header->descriptor_count * sizeof(*bpdt_entry)". (hope i got that right?) - adding that check before parsing through the (bpdt_entry++)'s ->type > + bpdt_entry = (void *)bpdt_header + sizeof(*bpdt_header); > + for (i = 0; i < bpdt_header->descriptor_count; i++, bpdt_entry++) { > + if ((bpdt_entry->type & INTEL_GSC_BPDT_ENTRY_TYPE_MASK) != > + INTEL_GSC_BPDT_ENTRY_TYPE_GSC_RBE) > + continue; > + > + cpd_header = (void *)bpdt_header + > bpdt_entry->sub_partition_offset; > + break; > + } > + > + if (!cpd_header) { > + gt_err(gt, "couldn't find CPD header in GSC binary!\n"); > + return -ENODATA; > + } alan: same as above, so for size-crawl-checking, we should check minsize again with the addition of cpd_header, no? > + > + if (cpd_header->header_marker != INTEL_GSC_CPD_HEADER_MARKER) { > + gt_err(gt, "invalid marker for meu CPD header in GSC bin: > 0x%08x!\n", > +cpd_header->header_marker); > + return -EINVAL; > + } alan: and again here, the size crawl checking with cpd_header->num_of_entries * *cpd_entry > + cpd_entry = (void *)cpd_header + cpd_header->header_length; > + for (i = 0; i < cpd_header->num_of_entries; i++, cpd_entry++) { > + if (strcmp(cpd_entry->name, "RBEP.man") == 0) { > + manifest = (void *)cpd_header + > cpd_entry_offset(cpd_entry); > + intel_uc_fw_version_from_meu_manifest(>release, > + manifest); > + gsc->security_version = manifest->security_version; > + break; > + } > + } > + > + return 0; alan:snip > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.h > b/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.h > index fff8928218df..8d7b9e4f1ffc 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.h > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.h alan:snip > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_meu_headers.h > b/drivers/gpu/drm/i915/gt/uc/intel_gsc_meu_headers.h > index d55a66202576..8bce2b8aed84 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_meu_headers.h > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_meu_headers.h alan:snip > +struct intel_gsc_layout_pointers { > + u8
Re: [PATCH 2/6] drm/i915/uc/gsc: fixes and updates for GSC memory allocation
On Fri, 2023-05-05 at 09:04 -0700, Ceraolo Spurio, Daniele wrote: > A few fixes/updates are required around the GSC memory allocation and it > is easier to do them all at the same time. The changes are as follows: alan:snip > @@ -109,38 +110,21 @@ static int gsc_fw_load_prepare(struct intel_gsc_uc *gsc) > { > struct intel_gt *gt = gsc_uc_to_gt(gsc); > struct drm_i915_private *i915 = gt->i915; > - struct drm_i915_gem_object *obj; > - void *src, *dst; > + void *src; alan:snip > > - memset(dst, 0, obj->base.size); > - memcpy(dst, src, gsc->fw.size); > + memset_io(gsc->local_vaddr, 0, gsc->local->size); > + memcpy_toio(gsc->local_vaddr, src, gsc->fw.size); alan: i wonder if it there is benefit to do the memcpy_toio first and then do the memset_io but only for the balance of area from offset 'gsc->fw.size' for (gsc->local->size - gsc->fw.size) bytes. alan:snip > --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c > @@ -130,26 +130,85 @@ void intel_gsc_uc_init_early(struct intel_gsc_uc *gsc) > } > } > > +static int gsc_allocate_and_map_vma(struct intel_gsc_uc *gsc, u32 size) alan:snip > + obj = i915_gem_object_create_stolen(gt->i915, size); > + if (IS_ERR(obj)) > + return PTR_ERR(obj); > + > + vma = i915_gem_object_ggtt_pin(obj, NULL, 0, 0, 0); alan: should we be passing in the PIN_MAPPABLE flag into the last param? alan:snip > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h > b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h > index a2a0813b8a76..c01286dddbdb 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h > @@ -18,6 +18,7 @@ struct intel_gsc_uc { > > /* GSC-specific additions */ > struct i915_vma *local; /* private memory for GSC usage */ > + void __iomem *local_vaddr; /* pointer to access the private memory */ alan:nit: relooking at the these variable names that originate from last year's patch you worked on introducing gsc_uc, i am wondering now if we should rename "local" to "privmem" and local_vaddr becomes privmem_vaddr. (no significant reason other than improving readibility of the code)
Re: [PATCH v2 5/8] drm/i915/huc: differentiate the 2 steps of the MTL HuC auth flow
On Fri, 2023-04-28 at 11:58 -0700, Ceraolo Spurio, Daniele wrote: > Before we add the second step of the MTL HuC auth (via GSC), we need to > have the ability to differentiate between them. To do so, the huc > authentication check is duplicated for GuC and GSC auth, with meu > binaries being considered fully authenticated only after the GSC auth > step. > > To report the difference between the 2 auth steps, a new case is added > to the HuC getparam. This way, the clear media driver can start > submitting before full auth, as partial auth is enough for those > workloads. > > v2: fix authentication status check for DG2 > > Signed-off-by: Daniele Ceraolo Spurio > Cc: Alan Previn > --- > drivers/gpu/drm/i915/gt/uc/intel_huc.c| 94 +-- > drivers/gpu/drm/i915/gt/uc/intel_huc.h| 16 +++- > drivers/gpu/drm/i915/gt/uc/intel_huc_fw.c | 4 +- > drivers/gpu/drm/i915/i915_reg.h | 3 + > include/uapi/drm/i915_drm.h | 3 +- > 5 files changed, 91 insertions(+), 29 deletions(-) > I believe you need a rebase with the PXP single session merged (the readiness code in gsccs backend). Other than that, all looks good: Reviewed-by: Alan Previn
Re: [PATCH v2] drm/i915/huc: Parse the GSC-enabled HuC binary
On Tue, 2023-05-02 at 08:27 -0700, Ceraolo Spurio, Daniele wrote: > The new binaries that support the 2-step authentication have contain the > legacy-style binary, which we can use for loading the HuC via DMA. To > find out where this is located in the image, we need to parse the meu > manifest of the GSC binary. The manifest consist of a partition header > followed by entries, one of which contains the offset we're looking for. > Note that the DG2 GSC binary contains entries with the same names, but > it doesn't contain a full legacy binary, so we need to skip assigning > the dma offset in that case (which we can do by checking the ccs). > Also, since we're now parsing the entries, we can extract the HuC > version that way instead of using hardcoded offsets. > > Note that the meu structure will be re-used for parsing the GSC binary, > so they've been added in their own header. > > v2: fix structure names to match meu defines (s/CPT/CPD/), update commit > message, check ccs validity, drop old version location defines. > > Signed-off-by: Daniele Ceraolo Spurio > Cc: Alan Previn > --- > .../drm/i915/gt/uc/intel_gsc_meu_headers.h| 74 ++ Compared line by line as per internal reviews and the spec. All looks good to me - nice to see that additional ccs validity. LGTM, Reviewed-by: Alan Previn
Re: [Intel-gfx] [PATCH] drm/i915/guc: Fix confused register capture list creation
On Thu, 2023-05-11 at 18:35 -0700, john.c.harri...@intel.com wrote: > From: John Harrison > > The GuC has a completely separate engine class enum when referring to > register capture lists, which combines render and compute. The driver > was using the 'normal' GuC specific engine class enum instead. That > meant that it thought it was defining a capture list for compute > engines, the list was actually being applied to the GSC engine. And if > a platform didn't have a render engine, then it would get no compute > register captures at all. > > Fix that. > > Signed-off-by: John Harrison alan:snip. LGTM, simple and straight-forward patch - although i can only imagine the pain of debugging this one. So for the benefit of others on the mailing list, because the COMPUTE and RENDER enum of the i915 (not-GuC-ABI) was different, but the GuC-ABI one was using the its own Render for both, (coincidentially '0' == render for both GUC-ABI and i915), it means that ADS popultion of capture-register list would only cause problems for devices that had no render or has a GSC (all have VD/VE/Blt). So MTL is not yet fully POR and none of the existing upstream supported devices had no render engine so therefore the "Fixes" tag isn't needed IIUC. That said, pending CI results, Reviewed-by: Alan Previn
Re: [Intel-gfx] [PATCH v9 6/8] drm/i915/uapi/pxp: Add a GET_PARAM for PXP
alan:snip > This is why I asked if it was it was "basically certain that in a > production environment, then it will eventually return 1 meaning it's > ready". Alan's response was a little ambiguous on this point. alan: if we get a '2' and never transition to '1' - thats a kernel bug or firmware / system issue. > Arguably a transition from 2 to -ENODEV could be considered a kernel > bug, but it doesn't sound like Alan would agree. :) Maybe Alan would > agree to saying it's either a kernel, or system integration bug. alan: agreed - that would be a kernel bug or a system integration bug. I also agreed on the init-flow vs app-usage thoughts Jordan had. That said MESA has many ways it can use this UAPI and we can discuss that on the MESA patch. hmmm.. so... ack anyone? [insert big hopeful smiley here] apologies if I am asking too often.
Re: [PATCH v9 6/8] drm/i915/uapi/pxp: Add a GET_PARAM for PXP
On Thu, 2023-04-27 at 16:48 -0700, Teres Alexis, Alan Previn wrote: > Because of the additional firmware, component-driver and > initialization depedencies required on MTL platform before a > PXP context can be created, UMD calling for PXP creation as a > way to get-caps can take a long time. An actual real world > customer stack has seen this happen in the 4-to-8 second range > after the kernel starts (which sees MESA's init appear in the > middle of this range as the compositor comes up). To avoid > unncessary delays experienced by the UMD for get-caps purposes, > add a GET_PARAM for I915_PARAM_PXP_SUPPORT. > alan:snip. Progress update on the UMD side - I'm working on patch for PR here: https://gitlab.freedesktop.org/alan_previn_intel/mesa-alan-previn-features/-/commit/fb9d4fbfbef7dfd3f41df335cd31549fd39ddb57 but taking time to verify certain code paths
Re: [PATCH v3 4/4] drm/i915/gsc: add support for GSC proxy interrupt
On Tue, 2023-05-02 at 09:38 -0700, Ceraolo Spurio, Daniele wrote: > The GSC notifies us of a proxy request via the HECI2 interrupt. The > interrupt must be enabled both in the HECI layer and in our usual gt irq > programming; for the latter, the interrupt is enabled via the same enable > register as the GSC CS, but it does have its own mask register. When the > interrupt is received, we also need to de-assert it in both layers. > > The handling of the proxy request is deferred to the same worker that we > use for GSC load. New flags have been added to distinguish between the > init case and the proxy interrupt. > > v2: Make sure not to set the reset bit when enabling/disabling the GSC > interrupts, fix defines (Alan) > > v3: rebase on proxy status register check > alan:snip Reviewed-by: Alan Previn
Re: [PATCH v3 3/4] drm/i915/gsc: add initial support for GSC proxy
On Tue, 2023-05-02 at 09:38 -0700, Ceraolo Spurio, Daniele wrote: > The GSC uC needs to communicate with the CSME to perform certain > operations. Since the GSC can't perform this communication directly > on platforms where it is integrated in GT, i915 needs to transfer the > messages from GSC to CSME and back. > alan:snip > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c > @@ -13,6 +13,7 @@ > #define GSC_FW_STATUS_REG_MMIO(0x116C40) > #define GSC_FW_CURRENT_STATE REG_GENMASK(3, 0) > #define GSC_FW_CURRENT_STATE_RESET 0 > +#define GSC_FW_PROXY_STATE_NORMAL 5 > #define GSC_FW_INIT_COMPLETE_BIT REG_BIT(9) > > static bool gsc_is_in_reset(struct intel_uncore *uncore) > @@ -23,6 +24,15 @@ static bool gsc_is_in_reset(struct intel_uncore *uncore) > GSC_FW_CURRENT_STATE_RESET; > } > > +bool intel_gsc_uc_fw_proxy_init_done(struct intel_gsc_uc *gsc) > +{ > + struct intel_uncore *uncore = gsc_uc_to_gt(gsc)->uncore; > + u32 fw_status = intel_uncore_read(uncore, GSC_FW_STATUS_REG); > + > + return REG_FIELD_GET(GSC_FW_CURRENT_STATE, fw_status) == > +GSC_FW_PROXY_STATE_NORMAL; > +} > + Since this function was added here, repeating rb: Reviewed-by: Alan Previn
Re: [PATCH v3 2/4] mei: gsc_proxy: add gsc proxy driver
We only had nits before and all sorted now, so.. Reviewed-by: Alan Previn On Tue, 2023-05-02 at 09:38 -0700, Ceraolo Spurio, Daniele wrote: > From: Alexander Usyskin > > Add GSC proxy driver. It to allows messaging between GSC component > on Intel graphics card and CSE device. > > Cc: Alan Previn > Signed-off-by: Alexander Usyskin > Signed-off-by: Tomas Winkler > Signed-off-by: Daniele Ceraolo Spurio > Acked-by: Greg Kroah-Hartman > --- > > v2: re-order includes, drop reference to "on board" card in commit > message and comments.
Re: [PATCH v3 1/4] drm/i915/mtl: Define GSC Proxy component interface
LGTM Reviewed-by: Alan Previn On Tue, 2023-05-02 at 09:38 -0700, Ceraolo Spurio, Daniele wrote: > From: Alexander Usyskin > > GSC Proxy component is used for communication between the > Intel graphics driver and MEI driver. > > Cc: Alan Previn > Signed-off-by: Alexander Usyskin > Signed-off-by: Tomas Winkler > Signed-off-by: Daniele Ceraolo Spurio > Acked-by: Greg Kroah-Hartman > --- > > v2: Improve documentation, remove unneeded includes
Re: [Intel-gfx] [PATCH v2 3/4] drm/i915/guc: Capture list naming clean up
LGTM: Reviewed-by: Alan Previn On Fri, 2023-04-28 at 11:56 -0700, john.c.harri...@intel.com wrote: > From: John Harrison > > Don't use 'xe_lp*' prefixes for register lists that are common with > Gen8. > > Don't add Xe only GSC registers to pre-Xe devices that don't > even have a GSC engine. > > Fix Xe_LP name. > > Don't use GEN9 as a prefix for register lists that contain all GEN8 > registers. > > Rename the 'default_' register list prefix to 'gen8_' as that is the > more accurate name. alan:snip
Re: [PATCH v9 3/8] drm/i915/pxp: Add MTL helpers to submit Heci-Cmd-Packet to GSC
On Thu, 2023-04-27 at 16:48 -0700, Teres Alexis, Alan Previn wrote: > Add helper functions into a new file for heci-packet-submission. > The helpers will handle generating the MTL GSC-CS Memory-Header > and submission of the Heci-Cmd-Packet instructions to the engine. > > alan: I accidentally forgot to carry Daniele's RB forward from rev8. Repasting as this patch did not change: Reviewed-by: Daniele Ceraolo Spurio
Re: [Intel-gfx] [PATCH v8 6/8] drm/i915/uapi/pxp: Add a GET_PARAM for PXP
(fixed email addresses again - why is my Evolution client deteorating??) On Thu, 2023-04-27 at 17:18 +, Teres Alexis, Alan Previn wrote: > On Wed, 2023-04-26 at 15:35 -0700, Justen, Jordan L wrote: > > On 2023-04-26 11:17:16, Teres Alexis, Alan Previn wrote: > alan:snip > > Can you tell that pxp is in progress, but not ready yet, as a separate > > state from 'it will never work on this platform'? If so, maybe the > > status could return something like: > > > > 0: It's never going to work > > 1: It's ready to use > > 2: It's starting and should work soon > > > > I could see an argument for treating that as a case where we could > > still advertise protected content support, but if we try to use it we > > might be in for a nasty delay. > > > alan: IIRC Lionel seemed okay with any permutation that would allow it to not > get blocked. Daniele did ask for something similiar to what u mentioned above > but he said that is non-blocking. But since both you AND Daniele have > mentioned > the same thing, i shall re-rev this and send that change out today. > I notice most GET_PARAMS use -ENODEV for "never gonna work" so I will stick > with that. > but 1 = ready to use and 2 = starting and should work sounds good. so '0' > will never > be returned - we just look for a positive value (from user space). I will also > make a PR for mesa side as soon as i get it tested. thanks for reviewing btw. alan: I also realize with these final touch-ups, we can go back to the original pxp-context-creation timeout of 250 milisecs like it was on ADL since the user space component will have this new param to check on (so even farther down from 1 sec on the last couple of revs). Jordan, Lional - i am thinking of creating the PR on MESA side to take advantage of GET_PARAM on both get-caps AND runtime creation (latter will be useful to ensure no unnecesssary delay experienced by Mesa stuck in kernel call - which practically never happenned in ADL AFAIK): 1. MESA PXP get caps: - use GET_PARAM (any positive number shall mean its supported). 2. MESA app-triggered PXP context creation (i.e. if caps was supported): - use GET_PARAM to wait until positive number switches from "2" to "1". - now call context creation. So at this point if it fails, we know its an actual failure. you guys okay with above? (i'll re-rev this kernel series first and wait on your ack or feedback before i create/ test/ submit a PR for Mesa side).
Re: [Intel-gfx] [PATCH] drm/i915/guc: Fix error capture for virtual engines
On Fri, 2023-04-14 at 17:27 -0700, john.c.harri...@intel.com wrote: > From: John Harrison > > GuC based register dumps in error capture logs were basically broken > for virtual engines. This can be seen in igt@gem_exec_balancer@hang: > [IGT] gem_exec_balancer: starting subtest hang > [drm] GPU HANG: ecode 12:4:e1524110, in gem_exec_balanc [6388] > [drm] GT0: GUC: No register capture node found for 0x1005 / 0xFEDC311D > [drm] GPU HANG: ecode 12:4:, in gem_exec_balanc [6388] > [IGT] gem_exec_balancer: exiting, ret=0 > > The test causes a hang on both engines of a virtual engine context. > The engine instance zero hang gets a valid error capture but the > non-instance-zero hang does not. > > Fix that by scanning through the list of pending register captures > when a hang notification for a virtual engine is received. That way, > the hang can be assigned to the correct physical engine prior to > starting the error capture process. So later on, when the error capture > handler tries to find the engine register list, it looks for one on > the correct engine. > > Also, sneak in a missing blank line before a comment in the node > search code. > > Signed-off-by: John Harrison LGTM - thanks for fixing this! :D Reviewed-by: Alan Previn A side conversation - potentially requring an unrelated future patch,... i notice that the array "error->reset_engine_count[]" (which is being used for error state reporting and as some verification in selftests) seem to have a size indicating of engine-instance-count but the reading/wrting of members of this array keep using the engine->uabi_class as index... meaning its being used to track engine class reset counts? Maybe this is an issue or i am misundertanding that. Either way, that issue is unrelated to the intent of this patch - i just wanted to get that highlighted for future action if needed. I can take that onus if its in fact an issue.
Re: [Intel-gfx] [PATCH 6/6] drm/i915/guc: Capture list clean up - 5
On Wed, 2023-04-26 at 11:24 -0700, john.c.harri...@intel.com wrote: > From: John Harrison > > Rename the 'default_' register list prefix to 'gen8_' as that is the > more accurate name. > > Signed-off-by: John Harrison Reviewed-by: Alan Previn
Re: [PATCH v8 6/8] drm/i915/uapi/pxp: Add a GET_PARAM for PXP
On Thu, 2023-04-20 at 22:34 -0700, Teres Alexis, Alan Previn wrote: > Because of the additional firmware, component-driver and > initialization depedencies required on MTL platform before a > PXP context can be created, UMD calling for PXP creation as a > way to get-caps can take a long time. An actual real world > customer stack has seen this happen in the 4-to-8 second range > after the kernel starts (which sees MESA's init appear in the > middle of this range as the compositor comes up). To avoid > unncessary delays experienced by the UMD for get-caps purposes, > add a GET_PARAM for I915_PARAM_PXP_SUPPORT. > > However, some failures can still occur after all the depedencies > are met (such as firmware init flow failure, bios configurations > or SOC fusing not allowing PXP enablement). Those scenarios will > only be known to user space when it attempts creating a PXP context. > > With this change, large delays are only met by user-space processes > that explicitly need to create a PXP context and boot very early. > There is no way to avoid this today. > > Signed-off-by: Alan Previn > Reviewed-by: Daniele Ceraolo Spurio > Acked-by: Lionel Landwerlin > --- alan: Hi Jordan, Tvrtko, Daniel Vetter and Lionel,... how to proceed on this series (which have required Rb/Ack's) in light of the recent discussion on the other series here: https://patchwork.freedesktop.org/patch/532994/?series=115980=8 it sounds like: - Jordan still wants the extension query - Daniel recapped the overview of historical discussions and kernel UAPI guidelines and in summary is okay with the GET_PARAM option. However Daniel cites PXP taking 8 seconds to create a context is broken. - I corrected above assumption -> current timeout is 1 second. 8 seconds is not the time taken to init the context, its the max-possible-time to ensure dependencies like the gsc-proxy-component is loaded so that protected context could be attempted successfully. Else, it would return an error that Mesa could interpret as not supported. Should I: (a) rerev this, drop this patch #6 and work towards merging the rest of the series to allow the get-param vs extensions-query to continue independently? (b) go ahead and merge this series as is with the GET_PARAM? (i need to get the Mesa UMD change tested and PR requested first) alan:snip
Re: IOCTL feature detection (Was: Re: [Intel-gfx] [PATCH 8/8] drm/i915: Allow user to set cache at BO creation)
On Wed, 2023-04-26 at 13:52 +0200, Daniel Vetter wrote: > On Tue, Apr 25, 2023 at 04:41:54PM +0300, Joonas Lahtinen wrote: > > (+ Faith and Daniel as they have been involved in previous discussions) > > Quoting Jordan Justen (2023-04-24 20:13:00) > > > On 2023-04-24 02:08:43, Tvrtko Ursulin wrote: > > > > > > > > > alan:snip > - the more a feature spans drivers/modules, the more it should be > discovered by trying it out, e.g. dma-buf fence import/export was a huge > discussion, luckily mesa devs figured out how to transparantly fall back > at runtime so we didn't end up merging the separate feature flag (I > think at least, can't find it). pxp being split across i915/me/fw/who > knows what else is kinda similar so I'd heavily lean towards discovery > by creating a context > > - pxp taking 8s to init a ctx sounds very broken, irrespective of anything > else > Alan: Please be aware that: 1. the wait-timeout was changed to 1 second sometime back. 2. the I'm not deciding the time-out. I initially wanted to keep it at the same timeout as ADL (250 milisec) - and ask the UMD to retry if user needs it. (as per same ADL behavior). Daniele requested to move it to 8 seconds - but thru review process, we reduced it to 1 second. 3. In anycase, thats just the wait-timeout - and we know it wont succeed until ~6 seconds after i915 (~9 secs after boot). The issue isnt our hardware or i915 - its the component driver load <-- this is what's broken. Details: PXP context is dependent on gsc-fw load, huc-firmware load, mei-gsc-proxy component driver load + bind, huc-authentication and gsc-proxy-init-handshake. Most of above steps begin rather quickly during i915 driver load - the delay seems to come from a very late mei-gsc-proxy component driver load. In fact the parent mei-me driver is only getting ~6 seconds after i915 init is done. That blocks the gsc-proxy-init-handshake and huc-authentication and lastly PXP. That said, what is broken is why it takes so long to get the component drivers to come up. NOTE: PXP isnt really doing anything differently in the context creation flow (in terms of time-consuming-steps compared to ADL) besides the extra dependency waits these. We can actually go back to the original timeout of 250 milisecs like we have in ADL but will fail if MESA calls in too early (but will succeed later) ... or... we can create the GET_PARAMs. A better idea would be to figure out how to control the driver load order and force mei driver + components to get called right after i915. I was informed there is no way to control this and changes here will likely not be accepted upstream. ++ Daniele - can you chime in? Take note that ADL has the same issue but for whatever reason, the dependant mei component on ADL loaded much sooner - so it was never an issue that was caught but still existed on ADL time merge (if users customize the kernel + compositor for fastboot it will happen).(i realize I havent tested ADL with the new kernel configs that we use to also boot PXP on MTL - wonder if the new mei configs are causing the delay - i.e. ADL customer could suddenly see this 6 sec delay too. - something i have to check now)
Re: [Intel-gfx] [PATCH 5/5] drm/i915/guc: Capture list clean up - 4
On Thu, 2023-04-06 at 15:26 -0700, john.c.harri...@intel.com wrote: > From: John Harrison > > Don't use GEN9 as a prefix for register lists that contain all GEN8 > registers. alan:snip alan: This patch as a stand-along looks good, so I'll provide the RB but take note of the comment below that should be reflected by decision on the review comments of patch #1 so this patch might change from GEN9_foo-from-patch-1 to GEN8_foo-from-patch-1. Reviewed-by: Alan Previn > -/* GEN9 - Global */ > +/* GEN8 - Global */ > static const struct __guc_mmio_reg_descr default_global_regs[] = { > COMMON_BASE_GLOBAL, > - COMMON_GEN9BASE_GLOBAL, > - GEN9_GLOBAL, > + COMMON_GEN8BASE_GLOBAL, > + GEN8_GLOBAL, alan: see patch comment about "COMMON_GLOBAL" vs "GLOBAL" confusion.
Re: [Intel-gfx] [PATCH 4/5] drm/i915/guc: Capture list clean up - 3
On Thu, 2023-04-06 at 15:26 -0700, john.c.harri...@intel.com wrote: > From: John Harrison > > Fix Xe_LP name. > > Signed-off-by: John Harrison alan:snip > -/* GEN9/XE_LPD - Render / Compute Per-Engine-Instance */ > +/* GEN8+ Render / Compute Per-Engine-Instance */ alan: two comments on this: 1. shouldnt this go with the earlier patch? 2. i agree with renaming the names of the register to reflect the when all of those registers got first introduced... however, considering we only support GuC on Gen12 and beyond (we do have select CI-tests that enable GuC on Gen9 but not on Gen8 and before), should we also change the comments? I think the comment should reflect the usage not just follow the same name of the registe #define - else why even add the comments. (please apply this same comment for gen8_vd_inst_regs, gen8_vec_inst_regs and gen8_blt_inst_regs). alternatively, we could keep those GEN8+ comments above the list but maybe add just one comment in the default list - see below. alan: snip > @@ -366,7 +364,7 @@ guc_capture_get_device_reglist(struct intel_guc *guc) > const struct __guc_mmio_reg_descr_group *lists; > > if (GRAPHICS_VER(i915) >= 12) > - lists = xe_lpd_lists; > + lists = xe_lp_lists; > else > lists = default_lists; alan: perhaps add a comment that we really don't support any of this on anything below Gen9? >
Re: [Intel-gfx] [PATCH 3/5] drm/i915/guc: Capture list clean up - 2
On Thu, 2023-04-06 at 15:26 -0700, john.c.harri...@intel.com wrote: > From: John Harrison > > Don't use 'xe_lp*' prefixes for register lists that are common with > Gen8. alan:snip > @@ -177,32 +177,32 @@ static const struct __guc_mmio_reg_descr > empty_regs_list[] = { > static const struct __guc_mmio_reg_descr_group default_lists[] = { alan:snip > - MAKE_REGLIST(xe_lpd_gsc_inst_regs, PF, ENGINE_INSTANCE, > GUC_GSC_OTHER_CLASS), > + MAKE_REGLIST(empty_regs_list, PF, ENGINE_INSTANCE, GUC_GSC_OTHER_CLASS), alan: i missed from the review of Daniele's GSC enabling patch - thanks for catching this. > {} > }; > > simple patch - all looks good: Reviewed-by: Alan Previn
Re: [Intel-gfx] [PATCH 2/5] drm/i915/guc: Capture list clean up - 1
On Thu, 2023-04-06 at 15:26 -0700, john.c.harri...@intel.com wrote: > From: John Harrison > > Remove 99% duplicated steered register list code. Also, include the > pre-Xe steered registers in the pre-Xe list generation. > > Signed-off-by: John Harrison alan: Nice work - good cleanup. Thanks so much. Reviewed-by: Alan Previn
Re: [PATCH 1/5] drm/i915/guc: Don't capture Gen8 regs on Xe devices
On Thu, 2023-04-06 at 15:26 -0700, Harrison, John C wrote: > From: John Harrison > > A pair of pre-Xe registers were being included in the Xe capture list. > GuC was rejecting those as being invalid and logging errors about > them. So, stop doing it. > alan:snip > #define COMMON_GEN9BASE_GLOBAL \ > - { GEN8_FAULT_TLB_DATA0, 0, 0, "GEN8_FAULT_TLB_DATA0" }, \ > - { GEN8_FAULT_TLB_DATA1, 0, 0, "GEN8_FAULT_TLB_DATA1" }, \ > { ERROR_GEN6, 0, 0, "ERROR_GEN6" }, \ > { DONE_REG, 0, 0, "DONE_REG" }, \ > { HSW_GTT_CACHE_EN, 0, 0, "HSW_GTT_CACHE_EN" } > > +#define GEN9_GLOBAL \ > + { GEN8_FAULT_TLB_DATA0, 0, 0, "GEN8_FAULT_TLB_DATA0" }, \ > + { GEN8_FAULT_TLB_DATA1, 0, 0, "GEN8_FAULT_TLB_DATA1" } > + > #define COMMON_GEN12BASE_GLOBAL \ > { GEN12_FAULT_TLB_DATA0,0, 0, "GEN12_FAULT_TLB_DATA0" }, \ > { GEN12_FAULT_TLB_DATA1,0, 0, "GEN12_FAULT_TLB_DATA1" }, \ > @@ -142,6 +144,7 @@ static const struct __guc_mmio_reg_descr > xe_lpd_gsc_inst_regs[] = { > static const struct __guc_mmio_reg_descr default_global_regs[] = { > COMMON_BASE_GLOBAL, > COMMON_GEN9BASE_GLOBAL, > + GEN9_GLOBAL, > }; > alan: splitting out a couple registers from COMMON_GEN9BASE_GLOBAL into GEN9_GLOBAL doesn't seem to communicate the intent of fix for this patch. This is more of a naming, thing and i am not sure what counter-proposal will work well in terms of readibility. One idea: perhaps we rename "COMMON_GEN9BASE_GLOBAL" to "COMMON_GEN9PLUS_BASE_GLOBAL" and rename GEN9_GLOBAL to COMMON_GEN9LEGACY_GLOBAL. so we would have two gen9-global with a clear distinction in naming where one is "GEN9PLUS" and the other is "GEN9LEGACY". But since this is a list-naming thing, i am okay either above change... OR... keeping the same but with the condition of adding a comment under COMMON_GEN9BASE_GLOBAL and GEN9_GLOBAL names that explain the differences where one is gen9-legacy and the other is gen9-and-future that carries over to beyond Gen9. (side note: coding style wise, is it possible to add the comment right under the #define line as opposed to under the entire list?) (conditional) Reviewed-by: Alan Previn
Re: IOCTL feature detection (Was: Re: [Intel-gfx] [PATCH 8/8] drm/i915: Allow user to set cache at BO creation)
On Tue, 2023-04-25 at 16:41 +0300, Joonas Lahtinen wrote: > (+ Faith and Daniel as they have been involved in previous discussions) An orthogonal (but losely related) question: Is PXP the only subsystem that has the unique problem of: Uses a delayed worker to complete all dependencies for init.. but they take so long that by the time compositors-mesa-init comes up, creation of PXP context blocks too long and may still likely failing and incorrectly assuming PXP is not supported. (when we don't have a GET_PARAM for it). I believe HuC has a similiar issue - but doesnt reflect in the UAPI but rather the cmd buffers. We don't have other subsystems that have this problem? (dependency on other startups?)
Re: [PATCH 4/4] drm/i915/gsc: add support for GSC proxy interrupt
I think we are also bottom-ing on the opens fo this patch too: On Thu, 2023-04-20 at 13:21 -0700, Ceraolo Spurio, Daniele wrote: > On 4/20/2023 11:49 AM, Teres Alexis, Alan Previn wrote: > > On Wed, 2023-03-29 at 09:56 -0700, Ceraolo Spurio, Daniele wrote: > > > The GSC notifies us of a proxy request via the HECI2 interrupt. The > > alan:snip > > > > > @@ -256,6 +262,7 @@ void gen11_gt_irq_postinstall(struct intel_gt *gt) > > > u32 irqs = GT_RENDER_USER_INTERRUPT; > > > u32 guc_mask = intel_uc_wants_guc(>uc) ? GUC_INTR_GUC2HOST > > > : 0; > > > u32 gsc_mask = 0; > > > + u32 heci_mask = 0; > > alan: nit: should we call this heci2_mask - uniquely identifiable from > > legacy. > > The mask is theoretically not just for HECI2, the bit we set in it is. > If future platforms add back the HECI1 interrupt then it'd go in the > same mask. alan: yeah - im good with that - no change needed then - was a nit anyways. > > alan:snip > > > > > - else if (HAS_HECI_GSC(gt->i915)) > > > + heci_mask = GSC_IRQ_INTF(1); /* HECI2 IRQ for SW Proxy*/ > > alan: does this GSC_IRQ_INTF macro still make sense for this newer > > platforms with GSC-CS / HECI2? > > i dont think i see other bit definitions for this mask register, so might > > else well just define it as BIT14? > > GSC_IRQ_INTF(1) is the HECI2 interrupt on DG2 and the bit has remained > the same exactly to allow SW to re-use the same code/defines, so IMO we > should do so. alan: okay sure - sounds good. > > > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h > > > b/drivers/gpu/drm/i915/gt/intel_gt_regs.h > > > index 4aecb5a7b631..da11ce5ca99e 100644 > > > --- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h > > > +++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h > > > @@ -1577,6 +1577,7 @@ > > > > > > #define GEN11_GT_INTR_DW(x) _MMIO(0x190018 + ((x) * > > > 4)) > > > #define GEN11_CSME(31) > > > +#define GEN12_HECI_2 (30) > > alan: I dont see this being used anywhere - should remove this. > > A few of the defines for this register here are not used. I've added > this one in as a way of documenting where the bit was, but I can remove > it if you think it's unneeded. alan: Oh i actually didnt notice that earlier - in that case, please keep it would be nice to add a comment for all of those such bits (consider this a nit). > > > > > +void intel_gsc_proxy_irq_handler(struct intel_gsc_uc *gsc, u32 iir) > > > +{ > > > + struct intel_gt *gt = gsc_uc_to_gt(gsc); > > > + > > > + if (unlikely(!iir)) > > > + return; > > > + > > > + lockdep_assert_held(gt->irq_lock); > > > + > > > + if (!gsc->proxy.component) { > > > + gt_err(gt, "GSC proxy irq received without the component being > > > bound!\n"); > > alan: So although proxy init is only a one-time thing (even surviving > > suspend-resume), we > > know that proxy runtime irqs could happen anytime (depending on other > > gpu-security-related > > system interactions). However, would the component driver be bound/unbound > > through a > > suspend-resume cycle? If yes, then would this check fail if an IRQ for a > > proxy request > > came too early after a resume cycle. If yes, then should we not do > > somethign here to ensure that > > when the component gets bound later, we know there is something waiting to > > be processed? > > (in bind, if proxy-init was done before, but we have outstanding IRQs, then > > we can trigger > > the worker from there, i.e. the bind func?) > > A proxy request can only be triggered in response to a userspace ask, > which in turn can only happen once we've completed the resume flow and > the component is re-bound. Therefore, we should never have a situation > where we get a proxy interrupt without the component being bound. alan: thanks -my bad. > > > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h > > > @@ -23,6 +23,9 @@ struct intel_gsc_uc { > > > /* for delayed load and proxy handling */ > > > struct workqueue_struct *wq; > > > struct work_struct work; > > > + u32 gsc_work_actions; /* protected by gt->irq_lock */ > > > +#define GSC_ACTION_FW_LOAD BIT(0) > > > +#define GSC_ACTION_SW_PROXY BIT(1 > > > > > alan: perhaps its time to have a substruct for "proxy_worker" and include > > 'wq' and 'work' in additional to actions? > > The worker is not just for proxy, we use it for a GSC and HuC loading as > well. It's the main way we handle the gsc_uc, so IMO it's better off > staying in the main struct. However, if you still think a substructure > will make things cleaner I can add it in, but please suggest a name > because I have no idea what to call it (something like handler?). alan: i was thinking "task_handler" - but since its not specific to proxy, then let's not change it.
Re: [PATCH 2/4] mei: gsc_proxy: add gsc proxy driver
i guess we are settled with this patch... On Thu, 2023-04-20 at 15:04 -0700, Ceraolo Spurio, Daniele wrote: > On 4/18/2023 11:57 PM, Teres Alexis, Alan Previn wrote: > > On Wed, 2023-03-29 at 09:56 -0700, Ceraolo Spurio, Daniele wrote: > > > From: Alexander Usyskin > > > > > > Add GSC proxy driver. It to allows messaging between GSC component > > > on Intel on board graphics card and CSE device. > > alan:nit: isn't "Intel integrated GPU" clearer than "Intel on-board > > graphics card"? > > Same thing for the Kconfig description later (or am i missing something > > else here). > > Will change > alan: saw your reply on better alternative for both 'i' and 'd' alan:snip > > > +static int mei_gsc_proxy_recv(struct device *dev, void *buf, size_t size) > > > +{ > > > + ssize_t ret; > > > + > > > + if (!dev || !buf) > > alan: nit: same as in the 'send' above,.. not sure if we should be checking > > for !size here... > > or perhaps 0 sized recv is supported. > > AFAICS the lower level of the mei code do allow for size 0 for both send > and recv. Also, this is the same check as what we do for the PXP component. alan: agreed - thus the nit as per my earlier email. > > alan:snip > > > + if (subcomponent != I915_COMPONENT_GSC_PROXY) > > > + return 0; > > > + > > > + return component_compare_dev(dev->parent, ((struct device > > > *)data)->parent); > > alan: do we care if both these parents are non-null? i notice in other mei > > component > > drivers match functions we do check that. > > Those should always both be non-NULL, since both the mei and the GFX > device have the PCI bus as parent (and the previous check on pdev > ensures those are the 2 devices we're handling at this point). alan: sounds good. > > > +#define MEI_UUID_GSC_PROXY UUID_LE(0xf73db04, 0x97ab, 0x4125, \ > > > +0xb8, 0x93, 0xe9, 0x4, 0xad, 0xd, 0x54, 0x64) > > alan: apologies for the newbie question, but why are we using UUID for the > > gsc_proxy > > as opposed to GUID like the other mei components? i am not sure if i read > > the right > > archived patch review but it sounded like GUID is for internal to kernel > > only whereas > > UUID is for external too? [snip] > AFAICS all other mei components use UUID_LE as well. The code was > updated from GUID to UUID_LE in: > https://lore.kernel.org/all/20221228160558.21311-1-andriy.shevche...@linux.intel.com/ alan: sounds good- thanks for the URL.
Re: [PATCH 4/4] drm/i915/gsc: add support for GSC proxy interrupt
On Wed, 2023-03-29 at 09:56 -0700, Ceraolo Spurio, Daniele wrote: > The GSC notifies us of a proxy request via the HECI2 interrupt. The alan:snip > @@ -256,6 +262,7 @@ void gen11_gt_irq_postinstall(struct intel_gt *gt) > u32 irqs = GT_RENDER_USER_INTERRUPT; > u32 guc_mask = intel_uc_wants_guc(>uc) ? GUC_INTR_GUC2HOST : 0; > u32 gsc_mask = 0; > + u32 heci_mask = 0; alan: nit: should we call this heci2_mask - uniquely identifiable from legacy. alan:snip > - else if (HAS_HECI_GSC(gt->i915)) > + heci_mask = GSC_IRQ_INTF(1); /* HECI2 IRQ for SW Proxy*/ alan: does this GSC_IRQ_INTF macro still make sense for this newer platforms with GSC-CS / HECI2? i dont think i see other bit definitions for this mask register, so might else well just define it as BIT14? alan:snip > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h > b/drivers/gpu/drm/i915/gt/intel_gt_regs.h > index 4aecb5a7b631..da11ce5ca99e 100644 > --- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h > +++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h > @@ -1577,6 +1577,7 @@ > > #define GEN11_GT_INTR_DW(x) _MMIO(0x190018 + ((x) * 4)) > #define GEN11_CSME (31) > +#define GEN12_HECI_2 (30) alan: I dont see this being used anywhere - should remove this. > +#define GEN11_HECI2_RSVD_INTR_MASK _MMIO(0x1900e4) alan: GEN11? don't you mean GEN12? > +void intel_gsc_proxy_irq_handler(struct intel_gsc_uc *gsc, u32 iir) > +{ > + struct intel_gt *gt = gsc_uc_to_gt(gsc); > + > + if (unlikely(!iir)) > + return; > + > + lockdep_assert_held(gt->irq_lock); > + > + if (!gsc->proxy.component) { > + gt_err(gt, "GSC proxy irq received without the component being > bound!\n"); alan: So although proxy init is only a one-time thing (even surviving suspend-resume), we know that proxy runtime irqs could happen anytime (depending on other gpu-security-related system interactions). However, would the component driver be bound/unbound through a suspend-resume cycle? If yes, then would this check fail if an IRQ for a proxy request came too early after a resume cycle. If yes, then should we not do somethign here to ensure that when the component gets bound later, we know there is something waiting to be processed? (in bind, if proxy-init was done before, but we have outstanding IRQs, then we can trigger the worker from there, i.e. the bind func?) alan:snip > static int i915_gsc_proxy_component_bind(struct device *i915_kdev, >struct device *tee_kdev, void *data) > { > struct drm_i915_private *i915 = kdev_to_i915(i915_kdev); > - struct intel_gsc_uc *gsc = >media_gt->uc.gsc; > + struct intel_gt *gt = i915->media_gt; > + struct intel_gsc_uc *gsc = >uc.gsc; > + intel_wakeref_t wakeref; > + > + /* enable HECI2 IRQs */ > + with_intel_runtime_pm(>runtime_pm, wakeref) > + intel_uncore_rmw(gt->uncore, HECI_H_CSR(MTL_GSC_HECI2_BASE), > + 0, HECI_H_CSR_IE); alan: i notice we don't seem to care about potentially re-writing a '1' into reset if it was midst reset when we read. Shouldn't we also protect against that here? alan:snip > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h > b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h > index 023bded10dde..a2a0813b8a76 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h > @@ -23,6 +23,9 @@ struct intel_gsc_uc { > /* for delayed load and proxy handling */ > struct workqueue_struct *wq; > struct work_struct work; > + u32 gsc_work_actions; /* protected by gt->irq_lock */ > +#define GSC_ACTION_FW_LOAD BIT(0) > +#define GSC_ACTION_SW_PROXY BIT(1 > alan: perhaps its time to have a substruct for "proxy_worker" and include 'wq' and 'work' in additional to actions? > ) > > struct { > struct i915_gsc_proxy_component *component;
Re: [PATCH 3/4] drm/i915/gsc: add initial support for GSC proxy
I have a number of comments but most are personal preferences and so i labelled them nits. I did catch a few minor coding styling issues and am assuming those need to be enforced as per i915/kernel rules? That said, since they are so minor (or maybe they are not strict), I'm providing a conditional RB to fix those 4 issues (i.e. the header inclusion alphabetical ordering and struct '{' bracket position) Reviewed-by: Alan Previn On Wed, 2023-03-29 at 09:56 -0700, Ceraolo Spurio, Daniele wrote: > The GSC uC needs to communicate with the CSME to perform certain > operations. Since the GSC can't perform this communication directly > on platforms where it is integrated in GT, i915 needs to transfer the > messages from GSC to CSME and back. > The proxy flow is as follow: > 1 - i915 submits a request to GSC asking for the message to CSME > 2 - GSC replies with the proxy header + payload for CSME > 3 - i915 sends the reply from GSC as-is to CSME via the mei proxy > component > 4 - CSME replies with the proxy header + payload for GSC > 5 - i915 submits a request to GSC with the reply from CSME > 6 - GSC replies either with a new header + payload (same as step 2, > so we restart from there) or with an end message. > > After GSC load, i915 is expected to start the first proxy message chain, > while all subsequent ones will be triggered by the GSC via interrupt. > > To communicate with the CSME, we use a dedicated mei component, which > means that we need to wait for it to bind before we can initialize the > proxies. This usually happens quite fast, but given that there is a > chance that we'll have to wait a few seconds the GSC work has been moved > to a dedicated WQ to not stall other processes. > > Signed-off-by: Daniele Ceraolo Spurio > Cc: Alan Previn > --- > drivers/gpu/drm/i915/Makefile | 1 + > drivers/gpu/drm/i915/gt/uc/intel_gsc_proxy.c | 384 ++ > drivers/gpu/drm/i915/gt/uc/intel_gsc_proxy.h | 17 + > drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c | 40 +- > drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h | 14 +- > .../i915/gt/uc/intel_gsc_uc_heci_cmd_submit.h | 1 + > 6 files changed, 452 insertions(+), 5 deletions(-) > create mode 100644 drivers/gpu/drm/i915/gt/uc/intel_gsc_proxy.c > create mode 100644 drivers/gpu/drm/i915/gt/uc/intel_gsc_proxy.h > alan:snip > new file mode 100644 > index ..ed8f68e78c26 > --- /dev/null > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_proxy.c > @@ -0,0 +1,384 @@ > +#include "intel_gsc_proxy.h" > + > +// SPDX-License-Identifier: MIT > +/* > + * Copyright © 2022 Intel Corporation alan: nit: 2022 - 2023? > + */ > + > +#include > +#include "drm/i915_gsc_proxy_mei_interface.h" alan: alphabetical > +#include "drm/i915_component.h" alan: snip > +/* > + * GSC proxy: > + * The GSC uC needs to communicate with the CSME to perform certain > operations. > + * Since the GSC can't perform this communication directly on platforms > where it > + * is integrated in GT, i915 needs to transfer the messages from GSC to CSME > + * and back. i915 must manually start the proxy flow after the GSC is loaded > to > + * signal to GSC that we're ready to handle its messages and allow it to > query > + * its init data from CSME; GSC will then trigger an HECI2 interrupt if it > needs > + * to send messages to CSME again. > + * The proxy flow is as follow: > + * 1 - i915 submits a request to GSC asking for the message to CSME > + * 2 - GSC replies with the proxy header + payload for CSME > + * 3 - i915 sends the reply from GSC as-is to CSME via the mei proxy > component > + * 4 - CSME replies with the proxy header + payload for GSC > + * 5 - i915 submits a request to GSC with the reply from CSME > + * 6 - GSC replies either with a new header + payload (same as step 2, so we > + * restart from there) or with an end message. > + */ > + > +/* > + * The component should load quite quickly in most cases, but it could take > + * a bit. Using a very big timeout just to cover the worst case scenario > + */ > +#define GSC_PROXY_INIT_TIMEOUT_MS 2 > + > +/* the protocol supports up to 32K in each direction */ > +#define GSC_PROXY_BUFFER_SIZE SZ_32K > +#define GSC_PROXY_CHANNEL_SIZE (GSC_PROXY_BUFFER_SIZE * 2) > +#define GSC_PROXY_MAX_MSG_SIZE (GSC_PROXY_BUFFER_SIZE - sizeof(struct > intel_gsc_mtl_header)) > + > +/* FW-defined proxy header */ > +struct intel_gsc_proxy_header > +{ alan: i thought we typically put the '{' on the same line as the struct name > alan:snip > +struct gsc_proxy_msg > +{ alan: shouldnt the '{' be above? > + struct intel_gsc_mtl_header header; > + struct intel_gsc_proxy_header proxy_header; > +} __packed; > + > +static int proxy_send_to_csme(struct intel_gsc_uc *gsc) > +{ > + struct intel_gt *gt = gsc_uc_to_gt(gsc); > + struct i915_gsc_proxy_component *comp = gsc->proxy.component; > + struct intel_gsc_mtl_header *hdr; > + void *in = gsc->proxy.to_csme; > + void *out =