On Sun, Jul 6, 2025 at 11:05 AM Rodrigo Siqueira <sique...@igalia.com> wrote: > > On 07/01, Alex Deucher wrote: > > This set improves per queue reset support for a number of IPs. > > When we reset the queue, the queue is lost so we need > > to re-emit the unprocessed state from subsequent submissions. > > This is handled in gfx/compute queues via switch buffer and > > pipeline sync packets. However, you can still end up with > > parallel execution across queues. For correctness in that > > cause, enforce isolation needs to be enabled. That can > > impact certain use cases however and in most cases, the > > guilty job is correctly identified even without enforce isolation. > > > > Tested on GC 10 and 11 chips with a game running and > > then running hang tests. The game pauses when the > > Hi Alex, > > Which hang test did you run?
The hang tests in HangTestSuite and IGT. Alex > > Thanks > > > hang happens, then continues after the queue reset. > > > > The same approach is extended to SDMA and VCN. > > They don't need enforce isolation because those engines > > are single threaded so they always operate serially. > > > > Rework re-emit to signal the seq number of the bad job and > > verify that to verify that the reset worked, then re-emit the > > rest of the non-guilty state. This way we are not waiting on > > the rest of the state to complete, and if the subsequent state > > also contains a bad job, we'll end up in queue reset again rather > > than adapter reset. > > > > Patches apply to the amd-staging-drm-next or drm-next branches in my > > git tree. > > > > Git tree: > > https://gitlab.freedesktop.org/agd5f/linux/-/commits/kq_resets?ref_type=heads > > > > The IGT deadlock tests need the following fixes to properly handle -ETIME > > fences: > > https://patchwork.freedesktop.org/series/150724/ > > > > v4: Drop explicit padding patches > > Drop new timeout macro > > Rework re-emit sequence > > v5: Add a helper for reemit > > Convert VCN, JPEG, SDMA to use new helpers > > v6: Update SDMA 4.4.2 to use new helpers > > Move ptr tracking to amdgpu_fence > > Skip all jobs from the bad context on the ring > > v7: Rework the backup logic > > Move and clean up the guilty logic for engine resets > > Integrate suggestions from Christian > > Add JPEG 4.0.5 support > > v8: Add non-guilty ring backup handling > > Clean up new function signatures > > Reorder some bug fixes to the start of the series > > v9: Clean up fence_emit > > SDMA 5.x fixes > > Add new reset helpers > > sched wqueue stop/start cleanup > > Add support for VCNs without unified queues > > v10: Drop enforce isolation default change > > Add more documentation > > Clean up ring backup logic > > v11: SDMA6/7 fixes > > v12: Ring backup and reemit fixes > > SDMA cleanups > > SDMA5.x reemit support > > GFX10 KGQ reset fix > > v13: drop SDMA cleaups, they caused regressions in some IGT tests > > > > Alex Deucher (28): > > drm/amdgpu/sdma: consolidate engine reset handling > > drm/amdgpu/sdma: allow caller to handle kernel rings in engine reset > > drm/amdgpu: track ring state associated with a fence > > drm/amdgpu/gfx9: re-emit unprocessed state on kcq reset > > drm/amdgpu/gfx9.4.3: re-emit unprocessed state on kcq reset > > drm/amdgpu/gfx10: re-emit unprocessed state on ring reset > > drm/amdgpu/gfx11: re-emit unprocessed state on ring reset > > drm/amdgpu/gfx12: re-emit unprocessed state on ring reset > > drm/amdgpu/sdma5: re-emit unprocessed state on ring reset > > drm/amdgpu/sdma5.2: re-emit unprocessed state on ring reset > > drm/amdgpu/sdma6: re-emit unprocessed state on ring reset > > drm/amdgpu/sdma7: re-emit unprocessed state on ring reset > > drm/amdgpu/jpeg2: re-emit unprocessed state on ring reset > > drm/amdgpu/jpeg2.5: re-emit unprocessed state on ring reset > > drm/amdgpu/jpeg3: re-emit unprocessed state on ring reset > > drm/amdgpu/jpeg4: re-emit unprocessed state on ring reset > > drm/amdgpu/jpeg4.0.3: re-emit unprocessed state on ring reset > > drm/amdgpu/jpeg4.0.5: add queue reset > > drm/amdgpu/jpeg5: add queue reset > > drm/amdgpu/jpeg5.0.1: re-emit unprocessed state on ring reset > > drm/amdgpu/vcn4: re-emit unprocessed state on ring reset > > drm/amdgpu/vcn4.0.3: re-emit unprocessed state on ring reset > > drm/amdgpu/vcn4.0.5: re-emit unprocessed state on ring reset > > drm/amdgpu/vcn5: re-emit unprocessed state on ring reset > > drm/amdgpu/vcn: add a helper framework for engine resets > > drm/amdgpu/vcn2: implement ring reset > > drm/amdgpu/vcn2.5: implement ring reset > > drm/amdgpu/vcn3: implement ring reset > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 90 +++++++++++++++++++ > > drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c | 15 +++- > > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 4 +- > > drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 67 ++++++++++++++ > > drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 18 ++++ > > drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c | 43 +++++---- > > drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h | 3 +- > > drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 76 ++++++++++++++++ > > drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h | 6 +- > > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 + > > drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 41 ++------- > > drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 35 +------- > > drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 35 +------- > > drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 12 +-- > > drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 12 +-- > > drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c | 11 +-- > > drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c | 11 +-- > > drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c | 11 +-- > > drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c | 11 +-- > > drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c | 11 +-- > > drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_5.c | 11 +++ > > drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_0.c | 14 +++ > > drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c | 11 +-- > > drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 19 +--- > > drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c | 23 +++-- > > drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 23 +++-- > > drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 18 ++-- > > drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c | 18 ++-- > > drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c | 12 +++ > > drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c | 11 +++ > > drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c | 13 +++ > > drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c | 11 +-- > > drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 10 +-- > > drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c | 11 +-- > > drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c | 11 +-- > > .../drm/amd/amdkfd/kfd_device_queue_manager.c | 2 +- > > 36 files changed, 454 insertions(+), 280 deletions(-) > > > > -- > > 2.50.0 > > > > -- > Rodrigo Siqueira