Re: [PATCH v13 00/13] Add GuC Error Capture Support
On 23/03/2022 13:19, Tvrtko Ursulin wrote: Hi, On 21/03/2022 16:45, Alan Previn wrote: This series: 1. Enables support of GuC to report error-state-capture using a list of MMIO registers the driver registers and GuC will dump, log and notify right before a GuC triggered engine-reset event. 2. Updates the ADS blob creation to register said lists of global, engine class and engine instance registers with GuC. 3. Defines tables of register lists that are global or engine class or engine instance in scope. 4. Updates usage and buffer-state data for the regions of the shared GuC log-buffer to accomdate both the existing relay logging of general debug logs along with the new error state capture usage. 5. Using a pool of preallocated memory, provide ability to extract and format the GuC reported register-capture data into chunks consistent with existing i915 error- state collection flows and structures. 6. Connects the i915_gpu_coredump reporting function to the GuC error capture module to print all GuC error state capture dumps that is reported. Story is finished with this series and we have everything on feature parity? Intel_error_decode handles it fine? Would you have an example error capture at hand, I am curious how it looks? Ping? I need to write a meaningful pull request text in ~2 weeks time when submitting the feature to drm-next so looking for a high level statement as described - does it complete the work and is error capture now at feature parity and all usual userspace tools to access it handle it? Regards, Tvrtko Regards, Tvrtko This is the 13th rev of this series where the first 3 revs are RFC Prior receipts of rvb's: - Patch #2, #3, #4, #5, #10, #11, #12, #13 have received R-v-b's from Umesh Nerlige Ramappa - Patch #1, #6, #7, #8, #9 has received an R-v-b from Matthew Brost . NOTE: some of these came in on the trybot series. https://patchwork.freedesktop.org/series/100831/ Changes from prior revs: v13:- Fixing register list definition styling as per Jani's request. v12:- Re-sending it because previous revs only got to intel-gfx, and only cover letter was in dri-devel. Also rebased again. v11:- Rebase again on latest drm-tip to fix merge error. v10:- Rebase on latest drm-tip again. Fix a number of checkpatch warnings and an error Reported-by: kernel test robot . v9: - Rebase on latest drm-tip to solve CI merge-build error. v8: - Fix a bug found by CI in rev7: Create a cached ADS capture list for null-header like the other lists. - Fixed a bug on the ggtt offset calculation in the ADS population loop. Thanks to Matt Brost. - Change the storage uses for initial allocation and caching of the ADS register lists so we only store a regular pointer instead of file handle. - Multiple improvements on code styling, variable names, comments and code reduction from Umesh suggestions across multiple patches. v7: - Rebased on lastest drm_tip that has the ADS now using shmem based ads_blob_write utilities. Stress test was performed with this patch included to fix a legacy bug: https://patchwork.freedesktop.org/series/100768/ v6: - In patch #1, ADS reg-list population, we now alloc regular memory to create the lists and cache them for simpler and faster use by GuC ADS module at init, suspend-resume and reset cycles. This was in response to review comments from Lucas De Marchi that also wanted to ensure the GuC ADS module owns the final copying into the ADS phyical memory. - Thanks to Jani Nikula for pointing out that patch #2 and #3 should ensure static tables as constant and dynamic lists should be allocated and cached but attached to the GT level for the case of multiple cards with different fusings for steered registers. These are addressed now along with multiple code style fixups (thanks to review comment from Umesh) and splitting the steered register list generation as a seperate patch. - The extraction functionality, Patch #10 and #11 (was patch #7), has fixed all of Umesh's review comments related to the code styling. Additionally, it was discovered during stress tests that the extraction function could be called by the ct processing thread at the same time as the start of a GT reset event. Thus, a redesign was done whereby the linked list of processed capture-output-nodes are allocated up front and reused throughout the driver's life to ensure no memory locks are taken during extraction. - For patch #6 (now 7, 8 and 9), updates to intel_guc_log was split into
Re: [PATCH v13 00/13] Add GuC Error Capture Support
Hi, On 21/03/2022 16:45, Alan Previn wrote: This series: 1. Enables support of GuC to report error-state-capture using a list of MMIO registers the driver registers and GuC will dump, log and notify right before a GuC triggered engine-reset event. 2. Updates the ADS blob creation to register said lists of global, engine class and engine instance registers with GuC. 3. Defines tables of register lists that are global or engine class or engine instance in scope. 4. Updates usage and buffer-state data for the regions of the shared GuC log-buffer to accomdate both the existing relay logging of general debug logs along with the new error state capture usage. 5. Using a pool of preallocated memory, provide ability to extract and format the GuC reported register-capture data into chunks consistent with existing i915 error- state collection flows and structures. 6. Connects the i915_gpu_coredump reporting function to the GuC error capture module to print all GuC error state capture dumps that is reported. Story is finished with this series and we have everything on feature parity? Intel_error_decode handles it fine? Would you have an example error capture at hand, I am curious how it looks? Regards, Tvrtko This is the 13th rev of this series where the first 3 revs are RFC Prior receipts of rvb's: - Patch #2, #3, #4, #5, #10, #11, #12, #13 have received R-v-b's from Umesh Nerlige Ramappa - Patch #1, #6, #7, #8, #9 has received an R-v-b from Matthew Brost . NOTE: some of these came in on the trybot series. https://patchwork.freedesktop.org/series/100831/ Changes from prior revs: v13:- Fixing register list definition styling as per Jani's request. v12:- Re-sending it because previous revs only got to intel-gfx, and only cover letter was in dri-devel. Also rebased again. v11:- Rebase again on latest drm-tip to fix merge error. v10:- Rebase on latest drm-tip again. Fix a number of checkpatch warnings and an error Reported-by: kernel test robot . v9: - Rebase on latest drm-tip to solve CI merge-build error. v8: - Fix a bug found by CI in rev7: Create a cached ADS capture list for null-header like the other lists. - Fixed a bug on the ggtt offset calculation in the ADS population loop. Thanks to Matt Brost. - Change the storage uses for initial allocation and caching of the ADS register lists so we only store a regular pointer instead of file handle. - Multiple improvements on code styling, variable names, comments and code reduction from Umesh suggestions across multiple patches. v7: - Rebased on lastest drm_tip that has the ADS now using shmem based ads_blob_write utilities. Stress test was performed with this patch included to fix a legacy bug: https://patchwork.freedesktop.org/series/100768/ v6: - In patch #1, ADS reg-list population, we now alloc regular memory to create the lists and cache them for simpler and faster use by GuC ADS module at init, suspend-resume and reset cycles. This was in response to review comments from Lucas De Marchi that also wanted to ensure the GuC ADS module owns the final copying into the ADS phyical memory. - Thanks to Jani Nikula for pointing out that patch #2 and #3 should ensure static tables as constant and dynamic lists should be allocated and cached but attached to the GT level for the case of multiple cards with different fusings for steered registers. These are addressed now along with multiple code style fixups (thanks to review comment from Umesh) and splitting the steered register list generation as a seperate patch. - The extraction functionality, Patch #10 and #11 (was patch #7), has fixed all of Umesh's review comments related to the code styling. Additionally, it was discovered during stress tests that the extraction function could be called by the ct processing thread at the same time as the start of a GT reset event. Thus, a redesign was done whereby the linked list of processed capture-output-nodes are allocated up front and reused throughout the driver's life to ensure no memory locks are taken during extraction. - For patch #6 (now 7, 8 and 9), updates to intel_guc_log was split into smaller chunks and the log_state structure was returned back to inside of the intel_guc_log struct as opposed to the intel_guc struct in prior rev. This is in response to review comments by Matt Brost. - #Patch 13 (previously #10) is mostly identical but addresses all of the code styling comments reviews
Re: [PATCH v13 00/13] Add GuC Error Capture Support
On Mon, Mar 21, 2022 at 09:45:14AM -0700, Alan Previn wrote: This series: 1. Enables support of GuC to report error-state-capture using a list of MMIO registers the driver registers and GuC will dump, log and notify right before a GuC triggered engine-reset event. 2. Updates the ADS blob creation to register said lists of global, engine class and engine instance registers with GuC. 3. Defines tables of register lists that are global or engine class or engine instance in scope. 4. Updates usage and buffer-state data for the regions of the shared GuC log-buffer to accomdate both the existing relay logging of general debug logs along with the new error state capture usage. 5. Using a pool of preallocated memory, provide ability to extract and format the GuC reported register-capture data into chunks consistent with existing i915 error- state collection flows and structures. 6. Connects the i915_gpu_coredump reporting function to the GuC error capture module to print all GuC error state capture dumps that is reported. This is the 13th rev of this series where the first 3 revs are RFC Prior receipts of rvb's: - Patch #2, #3, #4, #5, #10, #11, #12, #13 have received R-v-b's from Umesh Nerlige Ramappa - Patch #1, #6, #7, #8, #9 has received an R-v-b from Matthew Brost . NOTE: some of these came in on the trybot series. https://patchwork.freedesktop.org/series/100831/ Changes from prior revs: v13:- Fixing register list definition styling as per Jani's request. As in v12, the CI failure in display is unrelated to these changes. Applied to drm-intel-gt-next. Thanks Lucas De Marchi
[PATCH v13 00/13] Add GuC Error Capture Support
This series: 1. Enables support of GuC to report error-state-capture using a list of MMIO registers the driver registers and GuC will dump, log and notify right before a GuC triggered engine-reset event. 2. Updates the ADS blob creation to register said lists of global, engine class and engine instance registers with GuC. 3. Defines tables of register lists that are global or engine class or engine instance in scope. 4. Updates usage and buffer-state data for the regions of the shared GuC log-buffer to accomdate both the existing relay logging of general debug logs along with the new error state capture usage. 5. Using a pool of preallocated memory, provide ability to extract and format the GuC reported register-capture data into chunks consistent with existing i915 error- state collection flows and structures. 6. Connects the i915_gpu_coredump reporting function to the GuC error capture module to print all GuC error state capture dumps that is reported. This is the 13th rev of this series where the first 3 revs are RFC Prior receipts of rvb's: - Patch #2, #3, #4, #5, #10, #11, #12, #13 have received R-v-b's from Umesh Nerlige Ramappa - Patch #1, #6, #7, #8, #9 has received an R-v-b from Matthew Brost . NOTE: some of these came in on the trybot series. https://patchwork.freedesktop.org/series/100831/ Changes from prior revs: v13:- Fixing register list definition styling as per Jani's request. v12:- Re-sending it because previous revs only got to intel-gfx, and only cover letter was in dri-devel. Also rebased again. v11:- Rebase again on latest drm-tip to fix merge error. v10:- Rebase on latest drm-tip again. Fix a number of checkpatch warnings and an error Reported-by: kernel test robot . v9: - Rebase on latest drm-tip to solve CI merge-build error. v8: - Fix a bug found by CI in rev7: Create a cached ADS capture list for null-header like the other lists. - Fixed a bug on the ggtt offset calculation in the ADS population loop. Thanks to Matt Brost. - Change the storage uses for initial allocation and caching of the ADS register lists so we only store a regular pointer instead of file handle. - Multiple improvements on code styling, variable names, comments and code reduction from Umesh suggestions across multiple patches. v7: - Rebased on lastest drm_tip that has the ADS now using shmem based ads_blob_write utilities. Stress test was performed with this patch included to fix a legacy bug: https://patchwork.freedesktop.org/series/100768/ v6: - In patch #1, ADS reg-list population, we now alloc regular memory to create the lists and cache them for simpler and faster use by GuC ADS module at init, suspend-resume and reset cycles. This was in response to review comments from Lucas De Marchi that also wanted to ensure the GuC ADS module owns the final copying into the ADS phyical memory. - Thanks to Jani Nikula for pointing out that patch #2 and #3 should ensure static tables as constant and dynamic lists should be allocated and cached but attached to the GT level for the case of multiple cards with different fusings for steered registers. These are addressed now along with multiple code style fixups (thanks to review comment from Umesh) and splitting the steered register list generation as a seperate patch. - The extraction functionality, Patch #10 and #11 (was patch #7), has fixed all of Umesh's review comments related to the code styling. Additionally, it was discovered during stress tests that the extraction function could be called by the ct processing thread at the same time as the start of a GT reset event. Thus, a redesign was done whereby the linked list of processed capture-output-nodes are allocated up front and reused throughout the driver's life to ensure no memory locks are taken during extraction. - For patch #6 (now 7, 8 and 9), updates to intel_guc_log was split into smaller chunks and the log_state structure was returned back to inside of the intel_guc_log struct as opposed to the intel_guc struct in prior rev. This is in response to review comments by Matt Brost. - #Patch 13 (previously #10) is mostly identical but addresses all of the code styling comments reviews from Umesh. v5: - Added Gen9->Gen11 register list for CI coverage that included Gen9 with GuC submission. - Redesigned the extraction of the GuC error-capture dumps by grouping them into complete per-engine-reset nodes. Complete here means each node includes the global, engine-class and