Re: [PATCH v13 00/13] Add GuC Error Capture Support

2022-04-05 Thread Tvrtko Ursulin



On 23/03/2022 13:19, Tvrtko Ursulin wrote:


Hi,

On 21/03/2022 16:45, Alan Previn wrote:

This series:
   1. Enables support of GuC to report error-state-capture
  using a list of MMIO registers the driver registers
  and GuC will dump, log and notify right before a GuC
  triggered engine-reset event.
   2. Updates the ADS blob creation to register said lists
  of global, engine class and engine instance registers
  with GuC.
   3. Defines tables of register lists that are global or
  engine class or engine instance in scope.
   4. Updates usage and buffer-state data for the regions
  of the shared GuC log-buffer to accomdate both
  the existing relay logging of general debug logs
  along with the new error state capture usage.
   5. Using a pool of preallocated memory, provide ability
  to extract and format the GuC reported register-capture
  data into chunks consistent with existing i915 error-
  state collection flows and structures.
   6. Connects the i915_gpu_coredump reporting function
  to the GuC error capture module to print all GuC
  error state capture dumps that is reported.


Story is finished with this series and we have everything on feature 
parity? Intel_error_decode handles it fine?


Would you have an example error capture at hand, I am curious how it looks?


Ping? I need to write a meaningful pull request text in ~2 weeks time 
when submitting the feature to drm-next so looking for a high level 
statement as described - does it complete the work and is error capture 
now at feature parity and all usual userspace tools to access it handle it?


Regards,

Tvrtko





Regards,

Tvrtko



This is the 13th rev of this series where the first 3 revs
are RFC

Prior receipts of rvb's:
   - Patch #2, #3, #4, #5, #10, #11, #12, #13 have received
 R-v-b's from Umesh Nerlige Ramappa 
   - Patch #1, #6, #7, #8, #9 has received an R-v-b from Matthew Brost
 . NOTE: some of these came in on the
 trybot series. https://patchwork.freedesktop.org/series/100831/

Changes from prior revs:
   v13:- Fixing register list definition styling as per Jani's request.
   v12:- Re-sending it because previous revs only got to intel-gfx,
 and only cover letter was in dri-devel. Also rebased again.
   v11:- Rebase again on latest drm-tip to fix merge error.
   v10:- Rebase on latest drm-tip again. Fix a number of checkpatch
 warnings and an error Reported-by: kernel test robot 
.

   v9: - Rebase on latest drm-tip to solve CI merge-build error.
   v8: - Fix a bug found by CI in rev7: Create a cached ADS
 capture list for null-header like the other lists.
   - Fixed a bug on the ggtt offset calculation in the
 ADS population loop. Thanks to Matt Brost.
   - Change the storage uses for initial allocation and
 caching of the ADS register lists so we only store
 a regular pointer instead of file handle.
   - Multiple improvements on code styling, variable names,
 comments and code reduction from Umesh suggestions
 across multiple patches.

   v7: - Rebased on lastest drm_tip that has the ADS now using
 shmem based ads_blob_write utilities. Stress test
 was performed with this patch included to fix a
 legacy bug:
 https://patchwork.freedesktop.org/series/100768/

   v6: - In patch #1, ADS reg-list population, we now alloc
 regular memory to create the lists and cache them for
 simpler and faster use by GuC ADS module at init,
 suspend-resume and reset cycles. This was in response
 to review comments from Lucas De Marchi that also
 wanted to ensure the GuC ADS module owns the final
 copying into the ADS phyical memory.
   - Thanks to Jani Nikula for pointing out that patch #2
 and #3 should ensure static tables as constant and
 dynamic lists should be allocated and cached but
 attached to the GT level for the case of multiple
 cards with different fusings for steered registers.
 These are addressed now along with multiple code
 style fixups (thanks to review comment from Umesh)
 and splitting the steered register list generation
 as a seperate patch.
   - The extraction functionality, Patch #10 and #11 (was
 patch #7), has fixed all of Umesh's review comments
 related to the code styling. Additionally, it was
 discovered during stress tests that the extraction
 function could be called by the ct processing thread
 at the same time as the start of a GT reset event.
 Thus, a redesign was done whereby the linked list of
 processed capture-output-nodes are allocated up
 front and reused throughout the driver's life to
 ensure no memory locks are taken during extraction.
   - For patch #6 (now 7, 8 and 9), updates to
 intel_guc_log was split into 

Re: [PATCH v13 00/13] Add GuC Error Capture Support

2022-03-23 Thread Tvrtko Ursulin



Hi,

On 21/03/2022 16:45, Alan Previn wrote:

This series:
   1. Enables support of GuC to report error-state-capture
  using a list of MMIO registers the driver registers
  and GuC will dump, log and notify right before a GuC
  triggered engine-reset event.
   2. Updates the ADS blob creation to register said lists
  of global, engine class and engine instance registers
  with GuC.
   3. Defines tables of register lists that are global or
  engine class or engine instance in scope.
   4. Updates usage and buffer-state data for the regions
  of the shared GuC log-buffer to accomdate both
  the existing relay logging of general debug logs
  along with the new error state capture usage.
   5. Using a pool of preallocated memory, provide ability
  to extract and format the GuC reported register-capture
  data into chunks consistent with existing i915 error-
  state collection flows and structures.
   6. Connects the i915_gpu_coredump reporting function
  to the GuC error capture module to print all GuC
  error state capture dumps that is reported.


Story is finished with this series and we have everything on feature 
parity? Intel_error_decode handles it fine?


Would you have an example error capture at hand, I am curious how it looks?

Regards,

Tvrtko



This is the 13th rev of this series where the first 3 revs
are RFC

Prior receipts of rvb's:
   - Patch #2, #3, #4, #5, #10, #11, #12, #13 have received
 R-v-b's from Umesh Nerlige Ramappa 
   - Patch #1, #6, #7, #8, #9 has received an R-v-b from Matthew Brost
 . NOTE: some of these came in on the
 trybot series. https://patchwork.freedesktop.org/series/100831/

Changes from prior revs:
   v13:- Fixing register list definition styling as per Jani's request.
   v12:- Re-sending it because previous revs only got to intel-gfx,
 and only cover letter was in dri-devel. Also rebased again.
   v11:- Rebase again on latest drm-tip to fix merge error.
   v10:- Rebase on latest drm-tip again. Fix a number of checkpatch
 warnings and an error Reported-by: kernel test robot .
   v9: - Rebase on latest drm-tip to solve CI merge-build error.
   v8: - Fix a bug found by CI in rev7: Create a cached ADS
 capture list for null-header like the other lists.
   - Fixed a bug on the ggtt offset calculation in the
 ADS population loop. Thanks to Matt Brost.
   - Change the storage uses for initial allocation and
 caching of the ADS register lists so we only store
 a regular pointer instead of file handle.
   - Multiple improvements on code styling, variable names,
 comments and code reduction from Umesh suggestions
 across multiple patches.

   v7: - Rebased on lastest drm_tip that has the ADS now using
 shmem based ads_blob_write utilities. Stress test
 was performed with this patch included to fix a
 legacy bug:
 https://patchwork.freedesktop.org/series/100768/

   v6: - In patch #1, ADS reg-list population, we now alloc
 regular memory to create the lists and cache them for
 simpler and faster use by GuC ADS module at init,
 suspend-resume and reset cycles. This was in response
 to review comments from Lucas De Marchi that also
 wanted to ensure the GuC ADS module owns the final
 copying into the ADS phyical memory.
   - Thanks to Jani Nikula for pointing out that patch #2
 and #3 should ensure static tables as constant and
 dynamic lists should be allocated and cached but
 attached to the GT level for the case of multiple
 cards with different fusings for steered registers.
 These are addressed now along with multiple code
 style fixups (thanks to review comment from Umesh)
 and splitting the steered register list generation
 as a seperate patch.
   - The extraction functionality, Patch #10 and #11 (was
 patch #7), has fixed all of Umesh's review comments
 related to the code styling. Additionally, it was
 discovered during stress tests that the extraction
 function could be called by the ct processing thread
 at the same time as the start of a GT reset event.
 Thus, a redesign was done whereby the linked list of
 processed capture-output-nodes are allocated up
 front and reused throughout the driver's life to
 ensure no memory locks are taken during extraction.
   - For patch #6 (now 7, 8 and 9), updates to
 intel_guc_log was split into smaller chunks and the
 log_state structure was returned back to inside of
 the intel_guc_log struct as opposed to the
 intel_guc struct in prior rev. This is in response
 to review comments by Matt Brost.
   - #Patch 13 (previously #10) is mostly identical but
 addresses all of the code styling comments reviews
 

Re: [PATCH v13 00/13] Add GuC Error Capture Support

2022-03-22 Thread Lucas De Marchi

On Mon, Mar 21, 2022 at 09:45:14AM -0700, Alan Previn wrote:

This series:
 1. Enables support of GuC to report error-state-capture
using a list of MMIO registers the driver registers
and GuC will dump, log and notify right before a GuC
triggered engine-reset event.
 2. Updates the ADS blob creation to register said lists
of global, engine class and engine instance registers
with GuC.
 3. Defines tables of register lists that are global or
engine class or engine instance in scope.
 4. Updates usage and buffer-state data for the regions
of the shared GuC log-buffer to accomdate both
the existing relay logging of general debug logs
along with the new error state capture usage.
 5. Using a pool of preallocated memory, provide ability
to extract and format the GuC reported register-capture
data into chunks consistent with existing i915 error-
state collection flows and structures.
 6. Connects the i915_gpu_coredump reporting function
to the GuC error capture module to print all GuC
error state capture dumps that is reported.

This is the 13th rev of this series where the first 3 revs
are RFC

Prior receipts of rvb's:
 - Patch #2, #3, #4, #5, #10, #11, #12, #13 have received
   R-v-b's from Umesh Nerlige Ramappa 
 - Patch #1, #6, #7, #8, #9 has received an R-v-b from Matthew Brost
   . NOTE: some of these came in on the
   trybot series. https://patchwork.freedesktop.org/series/100831/

Changes from prior revs:
 v13:- Fixing register list definition styling as per Jani's request.


As in v12, the CI failure in display is unrelated to these changes.

Applied to drm-intel-gt-next.

Thanks
Lucas De Marchi


[PATCH v13 00/13] Add GuC Error Capture Support

2022-03-21 Thread Alan Previn
This series:
  1. Enables support of GuC to report error-state-capture
 using a list of MMIO registers the driver registers
 and GuC will dump, log and notify right before a GuC
 triggered engine-reset event.
  2. Updates the ADS blob creation to register said lists
 of global, engine class and engine instance registers
 with GuC.
  3. Defines tables of register lists that are global or
 engine class or engine instance in scope.
  4. Updates usage and buffer-state data for the regions
 of the shared GuC log-buffer to accomdate both
 the existing relay logging of general debug logs
 along with the new error state capture usage.
  5. Using a pool of preallocated memory, provide ability
 to extract and format the GuC reported register-capture
 data into chunks consistent with existing i915 error-
 state collection flows and structures.
  6. Connects the i915_gpu_coredump reporting function
 to the GuC error capture module to print all GuC
 error state capture dumps that is reported.

This is the 13th rev of this series where the first 3 revs
are RFC

Prior receipts of rvb's:
  - Patch #2, #3, #4, #5, #10, #11, #12, #13 have received
R-v-b's from Umesh Nerlige Ramappa 
  - Patch #1, #6, #7, #8, #9 has received an R-v-b from Matthew Brost
. NOTE: some of these came in on the
trybot series. https://patchwork.freedesktop.org/series/100831/

Changes from prior revs:
  v13:- Fixing register list definition styling as per Jani's request.
  v12:- Re-sending it because previous revs only got to intel-gfx,
and only cover letter was in dri-devel. Also rebased again.
  v11:- Rebase again on latest drm-tip to fix merge error.
  v10:- Rebase on latest drm-tip again. Fix a number of checkpatch
warnings and an error Reported-by: kernel test robot .
  v9: - Rebase on latest drm-tip to solve CI merge-build error.
  v8: - Fix a bug found by CI in rev7: Create a cached ADS
capture list for null-header like the other lists.
  - Fixed a bug on the ggtt offset calculation in the
ADS population loop. Thanks to Matt Brost.
  - Change the storage uses for initial allocation and
caching of the ADS register lists so we only store
a regular pointer instead of file handle.
  - Multiple improvements on code styling, variable names,
comments and code reduction from Umesh suggestions
across multiple patches.

  v7: - Rebased on lastest drm_tip that has the ADS now using
shmem based ads_blob_write utilities. Stress test
was performed with this patch included to fix a
legacy bug:
https://patchwork.freedesktop.org/series/100768/

  v6: - In patch #1, ADS reg-list population, we now alloc
regular memory to create the lists and cache them for
simpler and faster use by GuC ADS module at init, 
suspend-resume and reset cycles. This was in response
to review comments from Lucas De Marchi that also
wanted to ensure the GuC ADS module owns the final
copying into the ADS phyical memory.
  - Thanks to Jani Nikula for pointing out that patch #2
and #3 should ensure static tables as constant and
dynamic lists should be allocated and cached but
attached to the GT level for the case of multiple
cards with different fusings for steered registers.
These are addressed now along with multiple code
style fixups (thanks to review comment from Umesh)
and splitting the steered register list generation
as a seperate patch.
  - The extraction functionality, Patch #10 and #11 (was
patch #7), has fixed all of Umesh's review comments
related to the code styling. Additionally, it was
discovered during stress tests that the extraction
function could be called by the ct processing thread
at the same time as the start of a GT reset event.
Thus, a redesign was done whereby the linked list of
processed capture-output-nodes are allocated up
front and reused throughout the driver's life to
ensure no memory locks are taken during extraction.
  - For patch #6 (now 7, 8 and 9), updates to
intel_guc_log was split into smaller chunks and the
log_state structure was returned back to inside of
the intel_guc_log struct as opposed to the
intel_guc struct in prior rev. This is in response
to review comments by Matt Brost.
  - #Patch 13 (previously #10) is mostly identical but
addresses all of the code styling comments reviews
from Umesh.

  v5: - Added Gen9->Gen11 register list for CI coverage that
included Gen9 with GuC submission.
  - Redesigned the extraction of the GuC error-capture
dumps by grouping them into complete per-engine-reset
nodes. Complete here means each node includes the
global, engine-class and