On 24/12/2021 13:34, Teres Alexis, Alan Previn wrote:
On Fri, 2021-12-24 at 12:09 +0000, Tvrtko Ursulin wrote:
Hi,
Somehow I stumbled on this while browsing through the mailing list.
On 23/12/2021 18:54, Teres Alexis, Alan Previn wrote:
Revisiting below hunk of patch-7 comment, as per offline discussion with Matt,
there is little benefit to even making that guc-id lookup because:
1. the delay between the context reset notification (when the vmas are copied
and when we verify we had received a guc err capture dump) may be
subjectively
large enough and not tethered that the guc-id may have already been
re-assigned.
2. I was really looking for some kind of unique context handle to print out
that could
be correlated (by user inspecting the dump) back to a unique app or
process or
context-id but cant find such a param in struct intel_context.
As part of further reviewing the end to end flows and possible error scenarios,
there
also may potentially be a mismatch between "which context was reset by guc at
time-n"
vs "which context's vma buffers is being printed out at time-n+x" if
we are experiencing back-to-back resets and the user dumped the debugfs x-time
later.
What does this all actually mean, because it sounds rather alarming,
that it just won't be possible to know which context, belonging to which
process, was reset? And because of guc_id potentially re-assigned even
the captured VMAs may not be the correct ones?
The flow of events are as below:
1. guc sends notification that an error capture was done and ready to take.
- at this point we copy the guc error captured dump into an interim
store
(larger buffer that can hold multiple captures).
2. guc sends notification that a context was reset (after the prior)
- this triggers a call to i915_gpu_coredump with the corresponding
engine-mask
from the context that was reset
- i915_gpu_coredump proceeds to gather entire gpu state including
driver state,
global gpu state, engine state, context vmas and also engine
registers. For the
engine registers now call into the guc_capture code which merely
needs to verify
that GuC had already done a step 1 and we have data ready to be
parsed.
What about the time between the actual reset and receiving the context
reset notification? Latter will contain intel_context->guc_id - can that
be re-assigned or "retired" in between the two and so cause problems for
matching the correct (or any) vmas?
Regards,
Tvrtko
(time elapses)
3. end user triggers the sysfs to dump the error state and all prior
information is
printed out in proper format.
Between 2 and 3:
- Looking at existing framework (established by execlist-capture codes),
I believe we only hold on to the first error state capture and drop any
subsequent context reset captures occurring before #3 (i.e. before the
end user
triggers the debugfs)
- However, in that same space, guc can send us more and more error-capture
logs
long as we have space for it in the buffer.
So the issue was that in my original patch, for every next capture-snaphot we
find in
guc-error-capture output buffer, i would find the matching engine and print out
all
the VMA data (that was successfully captured in #2). However, i should only do
that
for the first dump only since that would correlate exactly with the existing
execlist
code behavior. So this fix is actually pretty straight forward to get the right
matching
VMA.
WRT to my statement about "getting the context-to->process" lookup, i was
initially hoping that
I could "on my own" (within the guc-err-capture module) get that information,
but it would be
a stretch (in terms of inter-component information access). More importantly,
its totally
unnecessary since existing execlist code already did that in Step 2. That code
remains intact
with guc-error-capture.
One open i plan to test before final rev is with shared engines like CCS and
RCS where i want to
trigger cascading hangs + resets in quick succession just to see how the
overall flow behavior works.
I will attach an output guc error capture based gpu error dump as per the
review comment from Matthew
on last rev.
..alan
Regards,
Tvrtko
(Recap: First, guc notifies capture event, second, guc notifies context reset
during
which we trigger i915_gpu_coredump. In this second step, the vma's are dumped
and we
verify that the guc capture happened but don't parse the guc-err-capture-logs
yet.
Third step is when user triggers the debugfs to dump which is when we parse the
error
capture logs.)
As a fix, what we can do in the guc_error_capture report out is to ensure that
we dont re-print the previously dumped vmas if we end up finding multiple
guc-error-capture dumps since the i915_gpu_coredump would have only captured
the vma's
for the very first context that was reset. And with guc-submission, that would
always
correlate to the "next-yet-to-be-parsed" guc-err-capture dump (since the
guc-error-capture
logs are large enough to hold data for multiple dumps).
The changes (removal of below-hunk and adding of only-print-the-first-vma") is
trivial
but i felt it warranted a good explanation. Apologies for the inbox noise.
...alan
On Tue, 2021-12-07 at 22:32 -0800, Alan Previn Teres Alexis wrote:
Thanks again for the detailed review here.
Will fix all the rest on next rev.
One special response for this one:
On Tue, 2021-12-07 at 16:22 -0800, Matthew Brost wrote:
On Mon, Nov 22, 2021 at 03:04:02PM -0800, Alan Previn wrote:
+ if (datatype == GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE) {
+ GCAP_PRINT_GUC_INST_INFO(i915, ebuf, data);
+ eng_inst =
FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_INSTANCE, data.info);
+ eng = guc_lookup_engine(guc, engineclass,
eng_inst);
+ if (eng) {
+ GCAP_PRINT_INTEL_ENG_INFO(i915, ebuf,
eng);
+ } else {
+ PRINT(&i915->drm, ebuf, " i915-Eng-Lookup
Fail!\n");
+ }
+ ce = guc_context_lookup(guc, data.guc_ctx_id);
You are going to need to reference count the 'ce' here. See
intel_guc_context_reset_process_msg for an example.
Oh crap - i missed this one - which you had explicitly mentioned offline when i
was doing the
development. Sorry about that i just totally missed it from my todo-notes.
...alan