On Tue, Jan 13, 2026 at 9:48 AM Christian König <[email protected]> wrote: > > On 1/13/26 15:10, Alex Deucher wrote: > > On Tue, Jan 13, 2026 at 8:57 AM Christian König > > <[email protected]> wrote: > >> > >> Patches #1-#3: Reviewed-by: Christian König <[email protected]> > >> > >> Comment on patch #4 which also affects patches #5-#26. > >> > >> Comment on patch #27 and #28. When #28 comes before #27 then that would > >> potentially solve the issue with #27. > >> > >> Patches #31: Reviewed-by: Christian König <[email protected]> > >> > >> Patches #32-#40 that looks extremely questionable to me. I've > >> intentionally removed that state from the job because it isn't job > >> dependent and sometimes has inter-job meaning. > >> > >> Patch #41: Absolutely clear NAK! We have exercised that nonsense to the > >> max and I'm clearly against doing that over and over again. Saving the > >> ring content clearly seems to be the saver approach. > >> > > > > I disagree. If the ring emit functions are purely just emitting > > packets to the ring, it's a much cleaner approach than trying to save > > and restore packet sequences repeatedly. > > Exactly that's the problem, this is not what they do. > > See gfx_v11_0_ring_emit_gfx_shadow() for an example: > > ... > /* > * We start with skipping the prefix SET_Q_MODE and always executing > * the postfix SET_Q_MODE packet. This is changed below with a > * WRITE_DATA command when the postfix executed. > */ > amdgpu_ring_write(ring, shadow_va ? 1 : 0); > amdgpu_ring_write(ring, 0); > > if (ring->set_q_mode_offs) { > uint64_t addr; > > addr = amdgpu_bo_gpu_offset(ring->ring_obj); > addr += ring->set_q_mode_offs << 2; > end = gfx_v11_0_ring_emit_init_cond_exec(ring, addr); > } > ... > if (shadow_va) { > uint64_t token = shadow_va ^ csa_va ^ gds_va ^ vmid; > > /* > * If the tokens match try to skip the last postfix SET_Q_MODE > * packet to avoid saving/restoring the state all the time. > */ > if (ring->set_q_mode_ptr && ring->set_q_mode_token == token) > *ring->set_q_mode_ptr = 0; > > ring->set_q_mode_token = token; > } else { > ring->set_q_mode_ptr = &ring->ring[ring->set_q_mode_offs]; > } > > ring->set_q_mode_offs = offs; > } > > Executing this multiple times is simply not possible without saving > set_q_mode_offs, the token and the CPU pointer (and restoring the CPU pointer > content). > > And that is just the tip of the iceberg, we have tons of state like this.
There is not much more than that. I looked when I wrote these patches. Even this state should be handled correctly. In this case, the state is saved in the job at the original submission time and is explicitly passed to the emit ring functions. As such the original state is reproduced. In this case, ring->set_q_mode_offs and ring->set_q_mode_ptr get reset in gfx_v11_0_ring_emit_vm_flush(). Then they get set as appropriate based on the saved state in the job in gfx_v11_0_ring_emit_gfx_shadow(). It emits the same ring state again. > > > If the relevant state is > > stored in the job, you can re-emit it and get the same ring state each > > time. > > No, you can't. Background is that the relevant state is not job dependent, > but inter job dependent. > > In other words it doesn't depend on what job is executing now but rather > which one was executed right before that one. > > Or even worse in the case of the set_q_mode packet on the job dependent after > the one you want to execute. > > I can absolutely not see how stuff like that should work with re-submission. All you need to do is save the state that was used to emit the packets in the original submission. > > > If you end up with multiple queue resets in a row, it gets > > really complex to try and save and restore opaque ring contents. By > > the time you fix up the state tracking to handle that, you end up > > pretty close to this solution. > > Not even remotely, you have tons of state we would need to save and restore > and a lot of that is outside of the job. > > Updating a few fence pointers on re-submission is absolutely trivial compared > to that. It's not that easy. If you want to just emit the fences for bad contexts rather than the whole IB stream, you can also potentially mess up the ring state. You'd end up needing a pile of pointers that need to be recalculated on every reset to try and remit the appropriate state again. This approach also paves the way for re-emitting state for all queues after adapter reset when VRAM is not lost. Alex > > Regards, > Christian. > > > > > Alex > > > >> Regards, > >> Christian. > >> > >> On 1/8/26 15:48, Alex Deucher wrote: > >>> This set contains a number of bug fixes and cleanups for > >>> IB handling that I worked on over the holidays. > >>> > >>> Patches 1-2: > >>> Simple bug fixes. > >>> > >>> Patches 3-26: > >>> Removes the direct submit path for IBs and requires > >>> that all IB submissions use a job structure. This > >>> greatly simplifies the IB submission code. > >>> > >>> Patches 27-42: > >>> Split IB state setup and ring emission. This keeps all > >>> of the IB state in the job. This greatly simplifies > >>> re-emission of non-timed-out jobs after a ring reset and > >>> allows for re-emission multiple times if multiple resets > >>> happen in a row. It also properly handles the dma fence > >>> error handling for timedout jobs with adapter resets. > >>> > >>> Alex Deucher (42): > >>> drm/amdgpu/jpeg4.0.3: remove redundant sr-iov check > >>> drm/amdgpu: fix error handling in ib_schedule() > >>> drm/amdgpu: add new job ids > >>> drm/amdgpu/vpe: switch to using job for IBs > >>> drm/amdgpu/gfx6: switch to using job for IBs > >>> drm/amdgpu/gfx7: switch to using job for IBs > >>> drm/amdgpu/gfx8: switch to using job for IBs > >>> drm/amdgpu/gfx9: switch to using job for IBs > >>> drm/amdgpu/gfx9.4.2: switch to using job for IBs > >>> drm/amdgpu/gfx9.4.3: switch to using job for IBs > >>> drm/amdgpu/gfx10: switch to using job for IBs > >>> drm/amdgpu/gfx11: switch to using job for IBs > >>> drm/amdgpu/gfx12: switch to using job for IBs > >>> drm/amdgpu/gfx12.1: switch to using job for IBs > >>> drm/amdgpu/si_dma: switch to using job for IBs > >>> drm/amdgpu/cik_sdma: switch to using job for IBs > >>> drm/amdgpu/sdma2.4: switch to using job for IBs > >>> drm/amdgpu/sdma3: switch to using job for IBs > >>> drm/amdgpu/sdma4: switch to using job for IBs > >>> drm/amdgpu/sdma4.4.2: switch to using job for IBs > >>> drm/amdgpu/sdma5: switch to using job for IBs > >>> drm/amdgpu/sdma5.2: switch to using job for IBs > >>> drm/amdgpu/sdma6: switch to using job for IBs > >>> drm/amdgpu/sdma7: switch to using job for IBs > >>> drm/amdgpu/sdma7.1: switch to using job for IBs > >>> drm/amdgpu: require a job to schedule an IB > >>> drm/amdgpu: mark fences with errors before ring reset > >>> drm/amdgpu: rename amdgpu_fence_driver_guilty_force_completion() > >>> drm/amdgpu: don't call drm_sched_stop/start() in asic reset > >>> drm/amdgpu: drop drm_sched_increase_karma() > >>> drm/amdgpu: plumb timedout fence through to force completion > >>> drm/amdgpu: change function signature for emit_pipeline_sync() > >>> drm/amdgpu: drop extra parameter for vm_flush > >>> drm/amdgpu: move need_ctx_switch into amdgpu_job > >>> drm/amdgpu: store vm flush state in amdgpu_job > >>> drm/amdgpu: split fence init and emit logic > >>> drm/amdgpu: split vm flush and vm flush emit logic > >>> drm/amdgpu: split ib schedule and ib emit logic > >>> drm/amdgpu: move drm sched stop/start into amdgpu_job_timedout() > >>> drm/amdgpu: add an all_instance_rings_reset ring flag > >>> drm/amdgpu: rework reset reemit handling > >>> drm/amdgpu: simplify per queue reset code > >>> > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 +- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 2 +- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 +- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 136 +++------ > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c | 289 ++++++++++---------- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 40 ++- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 13 + > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 67 ----- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 37 +-- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c | 4 +- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c | 2 +- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 21 +- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 141 +++++----- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 3 +- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 45 +-- > >>> drivers/gpu/drm/amd/amdgpu/cik_sdma.c | 36 ++- > >>> drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 41 ++- > >>> drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 41 ++- > >>> drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 41 ++- > >>> drivers/gpu/drm/amd/amdgpu/gfx_v12_1.c | 33 ++- > >>> drivers/gpu/drm/amd/amdgpu/gfx_v6_0.c | 28 +- > >>> drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c | 30 +- > >>> drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c | 143 +++++----- > >>> drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 149 +++++----- > >>> drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c | 26 +- > >>> drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 38 +-- > >>> drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c | 3 +- > >>> drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c | 3 +- > >>> drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c | 3 +- > >>> drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c | 3 +- > >>> drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c | 6 +- > >>> drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_5.c | 3 +- > >>> drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_0.c | 3 +- > >>> drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c | 3 +- > >>> drivers/gpu/drm/amd/amdgpu/jpeg_v5_3_0.c | 3 +- > >>> drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c | 43 +-- > >>> drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c | 43 +-- > >>> drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c | 43 +-- > >>> drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 45 +-- > >>> drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c | 46 ++-- > >>> drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 45 +-- > >>> drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 45 +-- > >>> drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c | 45 +-- > >>> drivers/gpu/drm/amd/amdgpu/sdma_v7_1.c | 45 +-- > >>> drivers/gpu/drm/amd/amdgpu/si_dma.c | 34 ++- > >>> drivers/gpu/drm/amd/amdgpu/uvd_v6_0.c | 8 +- > >>> drivers/gpu/drm/amd/amdgpu/vce_v3_0.c | 4 +- > >>> drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c | 2 + > >>> drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c | 2 + > >>> drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c | 3 +- > >>> drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 4 +- > >>> drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c | 3 +- > >>> drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c | 3 +- > >>> drivers/gpu/drm/amd/amdgpu/vcn_v5_0_1.c | 4 +- > >>> 54 files changed, 952 insertions(+), 966 deletions(-) > >>> > >> >
