This set improves per queue reset support for a number of IPs. When we reset the queue, the queue is lost so we need to re-emit the unprocessed state from subsequent submissions. This is handled in gfx/compute queues via switch buffer and pipeline sync packets. However, you can still end up with parallel execution across queues. For correctness in that cause, enforce isolation needs to be enabled. That can impact certain use cases however and in most cases, the guilty job is correctly identified even without enforce isolation.
Tested on GC 10 and 11 chips with a game running and then running hang tests. The game pauses when the hang happens, then continues after the queue reset. The same approach is extended to SDMA and VCN. They don't need enforce isolation because those engines are single threaded so they always operate serially. Rework re-emit to signal the seq number of the bad job and verify that to verify that the reset worked, then re-emit the rest of the non-guilty state. This way we are not waiting on the rest of the state to complete, and if the subsequent state also contains a bad job, we'll end up in queue reset again rather than adapter reset. Git tree: https://gitlab.freedesktop.org/agd5f/linux/-/commits/kq_resets?ref_type=heads v4: Drop explicit padding patches Drop new timeout macro Rework re-emit sequence v5: Add a helper for reemit Convert VCN, JPEG, SDMA to use new helpers v6: Update SDMA 4.4.2 to use new helpers Move ptr tracking to amdgpu_fence Skip all jobs from the bad context on the ring v7: Rework the backup logic Move and clean up the guilty logic for engine resets Integrate suggestions from Christian Add JPEG 4.0.5 support v8: Add non-guilty ring backup handling Clean up new function signatures Reorder some bug fixes to the start of the series v9: Clean up fence_emit SDMA 5.x fixes Add new reset helpers sched wqueue stop/start cleanup Add support for VCNs without unified queues v10: Drop enforce isolation default change Add more documentation Clean up ring backup logic Alex Deucher (30): drm/amdgpu: remove job parameter from amdgpu_fence_emit() drm/amdgpu/sdma5.x: suspend KFD queues in ring reset drm/amdgpu: update ring reset function signature drm/amdgpu: move force completion into ring resets drm/amdgpu: move guilty handling into ring resets drm/amdgpu: move scheduler wqueue handling into callbacks drm/amdgpu: track ring state associated with a fence drm/amdgpu/gfx9: re-emit unprocessed state on kcq reset drm/amdgpu/gfx9.4.3: re-emit unprocessed state on kcq reset drm/amdgpu/gfx10: re-emit unprocessed state on ring reset drm/amdgpu/gfx11: re-emit unprocessed state on ring reset drm/amdgpu/gfx12: re-emit unprocessed state on ring reset drm/amdgpu/sdma6: re-emit unprocessed state on ring reset drm/amdgpu/sdma7: re-emit unprocessed state on ring reset drm/amdgpu/jpeg2: re-emit unprocessed state on ring reset drm/amdgpu/jpeg2.5: re-emit unprocessed state on ring reset drm/amdgpu/jpeg3: re-emit unprocessed state on ring reset drm/amdgpu/jpeg4: re-emit unprocessed state on ring reset drm/amdgpu/jpeg4.0.3: re-emit unprocessed state on ring reset drm/amdgpu/jpeg4.0.5: add queue reset drm/amdgpu/jpeg5: add queue reset drm/amdgpu/jpeg5.0.1: re-emit unprocessed state on ring reset drm/amdgpu/vcn4: re-emit unprocessed state on ring reset drm/amdgpu/vcn4.0.3: re-emit unprocessed state on ring reset drm/amdgpu/vcn4.0.5: re-emit unprocessed state on ring reset drm/amdgpu/vcn5: re-emit unprocessed state on ring reset drm/amdgpu/vcn: add a helper framework for engine resets drm/amdgpu/vcn2: implement ring reset drm/amdgpu/vcn2.5: implement ring reset drm/amdgpu/vcn3: implement ring reset Christian König (1): drm/amdgpu: rework queue reset scheduler interaction drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 133 ++++++++++++++++++---- drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c | 20 +++- drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 48 ++------ drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 59 ++++++++++ drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 27 ++++- drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c | 17 +-- drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 76 +++++++++++++ drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h | 6 +- drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 42 +++---- drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 33 ++---- drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 33 ++---- drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 9 +- drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 11 +- drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c | 7 +- drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c | 7 +- drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c | 7 +- drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c | 7 +- drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c | 7 +- drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_5.c | 11 ++ drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_0.c | 14 +++ drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c | 7 +- drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 49 ++++---- drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c | 16 ++- drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 16 ++- drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 25 +++- drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c | 25 +++- drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c | 12 ++ drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c | 11 ++ drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c | 13 +++ drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c | 8 +- drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 9 +- drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c | 8 +- drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c | 8 +- 33 files changed, 566 insertions(+), 215 deletions(-) -- 2.49.0