On Fri, 2022-01-07 at 09:03 +0000, Tvrtko Ursulin wrote:
> On 06/01/2022 18:33, Teres Alexis, Alan Previn wrote:
> > On Thu, 2022-01-06 at 09:38 +0000, Tvrtko Ursulin wrote:
> > > On 05/01/2022 17:30, Teres Alexis, Alan Previn wrote:
> > > > On Tue, 2022-01-04 at 13:56 +0000, Tvrtko Ursulin wrote:
> > > > > > The flow of events are as below:
> > > > > > 
> > > > > > 1. guc sends notification that an error capture was done and ready 
> > > > > > to take.
> > > > > >     - at this point we copy the guc error captured dump into an 
> > > > > > interim store
> > > > > >       (larger buffer that can hold multiple captures).
> > > > > > 2. guc sends notification that a context was reset (after the prior)
> > > > > >     - this triggers a call to i915_gpu_coredump with the 
> > > > > > corresponding engine-mask
> > > > > >              from the context that was reset
> > > > > >     - i915_gpu_coredump proceeds to gather entire gpu state 
> > > > > > including driver state,
> > > > > >              global gpu state, engine state, context vmas and also 
> > > > > > engine registers. For the
> > > > > >              engine registers now call into the guc_capture code 
> > > > > > which merely needs to verify
> > > > > >       that GuC had already done a step 1 and we have data ready to 
> > > > > > be parsed.
> > > > > 
> > > > > What about the time between the actual reset and receiving the context
> > > > > reset notification? Latter will contain intel_context->guc_id - can 
> > > > > that
> > > > > be re-assigned or "retired" in between the two and so cause problems 
> > > > > for
> > > > > matching the correct (or any) vmas?
> > > > > 
> > > > Not it cannot because its only after the context reset notification 
> > > > that i915 starts
> > > > taking action against that cotnext - and even that happens after the 
> > > > i915_gpu_codedump (engine-mask-of-context) happens.
> > > > That's what i've observed in the code flow.
> > > 
> > > The fact it is "only after" is exactly why I asked.
> > > 
> > > Reset notification is in a CT queue with other stuff, right? So can be
> > > some unrelated time after the actual reset. Could have context be
> > > retired in the meantime and guc_id released is the question.
> > > 
> > > Because i915 has no idea there was a reset until this delayed message
> > > comes over, but it could see user interrupt signaling end of batch,
> > > after the reset has happened, unbeknown to i915, right?
> > > 
> > > Perhaps the answer is guc_id cannot be released via the request retire
> > > flows. Or GuC signaling release of guc_id is a thing, which is then
> > > ordered via the same CT buffer.
> > > 
> > > I don't know, just asking.
> > > 
> > As long as the context is pinned, the guc-id wont be re-assigned. After a 
> > bit of offline brain-dump
> > from John Harrison, there are many factors that can keep the context pinned 
> > (recounts) including
> > new or oustanding requests. So a guc-id can't get re-assigned between a 
> > capture-notify and a
> > context-reset even if that outstanding request is the only refcount left 
> > since it would still
> > be considered outstanding by the driver. I also think we may also be 
> > talking past each other
> > in the sense that the guc-id is something the driver assigns to a context 
> > being pinned and only
> > the driver can un-assign it (both assigning and unasigning is via H2G 
> > interactions).
> > I get the sense you are assuming the GuC can un-assign the guc-id's on its 
> > own - which isn't
> > the case. Apologies if i mis-assumed.
> 
> I did not think GuC can re-assign ce->guc_id. I asked about request/context 
> complete/retire happening before reset/capture notification is received.
> 
> That would be the time window between the last intel_context_put, so last 
> i915_request_put from retire, at which point AFAICT GuC code releases the 
> guc_id. Execution timeline like:
> 
> > ------ rq1 ------|------ rq2 ------|
>     ^ engine reset                ^ rq2, rq1 retire, guc id released
> 
>                                                                       ^ GuC 
> reset notify received - guc_id not known any more?
>   
> You are saying something is guaranteed to be holding onto the guc_id at the 
> point of receiving the notification? "There are many factors that can keep 
> the context pinned" - what is it in this case? Or the case cannot happen?
> 
> Regards,
> 
> Tvrtko

above chart is incorrect: GuC reset notification is sent from GuC to host 
before it sends the engine reset notification 
...alan





Reply via email to