On Fri, Jul 11, 2025 at 12:07 PM Tvrtko Ursulin <tvrtko.ursu...@igalia.com> wrote: > > > On 11/07/2025 16:51, Alex Deucher wrote: > > On Fri, Jul 11, 2025 at 9:58 AM Tvrtko Ursulin > > <tvrtko.ursu...@igalia.com> wrote: > >> > >> > >> On 11/07/2025 14:39, Alex Deucher wrote: > >>> On Fri, Jul 11, 2025 at 9:22 AM Tvrtko Ursulin > >>> <tvrtko.ursu...@igalia.com> wrote: > >>>> > >>>> > >>>> On 11/07/2025 13:45, Christian König wrote: > >>>>> On 11.07.25 14:23, Tvrtko Ursulin wrote: > >>>>>> Commit > >>>>>> 94b1e028e15c ("drm/amdgpu/sdma5.2: add begin/end_use ring callbacks") > >>>>>> added a workaround which disables GFXOFF for the duration of the job > >>>>>> submit stage (with a 100ms trailing hysteresis). > >>>>>> > >>>>>> Empirically the GFXOFF disable/enable request can suffer from > >>>>>> significant > >>>>>> latencies (2ms is easily seen) which are then inserted onto the > >>>>>> amdgpu_job_run() path, which slows down the CPU submission of ready > >>>>>> jobs. > >>>>>> > >>>>>> 1) > >>>>>> If the premise of the GFXOFF workaround is to keep it disabled while > >>>>>> the > >>>>>> SDMA engine is active, the current workaround achieves that only > >>>>>> partially, for submissions and jobs which take less than 100ms (the > >>>>>> GFXOFF > >>>>>> re-enable hysteresis), counting from the ring write phase, up to > >>>>>> completion. > >>>>>> > >>>>>> 2) > >>>>>> If disabling GFXOFF affects the GFX engine too, basing the workaround > >>>>>> solely on the SDMA activity creates, at minimum, a needless "chatter" > >>>>>> on > >>>>>> the SMU communication channel. > >>>>> > >>>>> IIRC that is intentional. This "needless" chatter is what the > >>>>> workaround was all about. > >>>> > >>>> I tried to gather knowledge to how the hardware works from the comment > >>>> in sdma_v5_2_ring_begin_use(). Maybe I got it wrong so bear with me > >>>> please. > >>>> > >>>> To try and explain my questions better. If the GFX ring/engine is busy > >>>> is there a point for SDMA to be requesting GFXOFF enable/disable? Or > >>>> maybe with diagrams... > >>>> > >>>> 1) > >>>> > >>>> SDMA: > >>>> > >>>> ring-write ring-commit job-execute job-done > >>>> gfxoff-off-req gfxoff-on-req >100ms -> gfxoff-on > >>>> > >>>> Was the workaround prematurely dropped in this case (aka is > >>>> ring->funcs->end_use() the right place to drop it from)? Probably > >>>> theoretical that a SDMA job takes more than 100ms but I am trying to > >>>> understand it all. > >>>> > >>> > >>> The firmware controls the power to a subset of the chip which contains > >>> both gfx and sdma. Normally the firmware dynamically powers up and > >>> down gfx transparently when doorbells come in or the engines go idle > >>> for either engine. amdgpu_gfx_off_ctrl() tells the firmware to allow > >>> or disallow gfxoff entry. So what this workaround does is disallow > >>> gfxoff (which results in gfx being powered up) before we touch SDMA. > >>> Once SDMA is active, we can allow gfxoff again as it will dynamically > >> > >> Hmm so it is "once" and not "while", as the comment says: > >> > >> /* SDMA 5.2.3 (RMB) FW doesn't seem to properly > >> * disallow GFXOFF in some cases leading to > >> * hangs in SDMA. Disallow GFXOFF while SDMA is active. > >> > >> ? > >> > >> And for "once active" amdgpu_ring_commit() is what it counts? > > > > Yes, amdgpu_ring_commit() rings the doorbell and at that point the > > engine starts running and SDMA is active. > > Should I submit a patch to change this comment to say "Disallow GFXOFF > until SDMA is active (doorbell was rang)"?
Sure. > > >> > >>> be disabled once GFX/SDMA is no longer active. In this particular > >>> case there was a race condition somewhere in the internal handshaking > >>> with SDMA which led to SDMA missing doorbells sometimes and not > >>> executing the job even if there was work in the ring. > >> > >> Thank you, more or less than what I assumed. > >> > >> But in this case there should be no harm in holding GFXOFF disabled > >> until the job completes (like this patch)? Only a win to avoid the SMU > >> communication latencies while unit is powered on anyway. > > > > The extra latency is only on the CPU side, once the > > amdgpu_ring_commit() is called the SDMA engine is already working. > > It is on the CPU side but can create bubbles in the pipeline, no? Is > there no scope with AMD to have GFX and SDMA jobs depend on each other? > Because, as said, I've seen some high latencies from the GFXOFF disable > calls. The SDMA job is already executing at that point. The allow gfxoff message to the firmware shouldn't come until later because it's handled by a delayed work thread from end_use(). If you have multiple submissions to SDMA within the delay window, the begin_use() and end_use() will just be ref count handling and won't actually talk to the firmware. > > > Plus, the sooner you allow gfxoff again, the sooner it can start > > kicking in again. > > From this I read that even while GFX or SDMA rings are executing jobs > GFXOFF can still power down stuff? Only if all of the blocks are idle which would likely not be the case if you have outstanding jobs. Alex > > Regards, > > Tvrtko > > > > >> > >> Regards, > >> > >> Tvrtko > >> > >>>> 2) > >>>> > >>>> > >>>> GFX: > >>>> > >>>> +----- job executing --------------------------------------+ > >>>> > >>>> SDMA: > >>>> > >>>> ring-write ring-commit job-execute job-done > >>>> gfxoff-off-req gfxoff-on-req >100ms -> gfxoff-on > >>>> > >>>> > >>>> Is it required for the SDMA activity to cause SMU message traffic in > >>>> this case, or is the powerdomain implied to be on (GFXOFF cannot turn it > >>>> off while GFX is active)? > >>>> > >>>> This is the case I measured latency spikes. While the GFX load was > >>>> running I was seeing SDMA->run_job() spike and traced it to the SMU > >>>> communication. > >>>> > >>>> Hence the idea from the patch - prevent adev->gfx.gfx_off_req_count > >>>> dropping to zero until both GFX and SDMA are idle. > >>>> > >>>> https://imgshare.cc/rdxu2bjl > >>>> > >>>> Above is visual representation of these latencies. Y is SDMA run_job() > >>>> duration in micro-seconds, X is seconds wall time. Blue is stock kernel, > >>>> orange is with this patch. X goes for ~60 seconds, which is how long > >>>> Cyberpunk 2077 benchmark is. > >>>> > >>>> Regards, > >>>> > >>>> Tvrtko > >>>> > >>>>>> If 1) and 2) hold true, we can improve on the workaround by; a) only > >>>>>> re-enabling GFXOFF once the job had actually completed*, and b) apply > >>>>>> the > >>>>>> same workaround on other rings which share the same GFXOFF powergating > >>>>>> domain. > >>>>> > >>>>> The point of GFXOFF is to turn GFX on/off *without* kernel driver > >>>>> interaction. Otherwise we don't need it in the first place. > >>>>> > >>>>> We just have a hack for the SDMA because that moved into the GFXOFF > >>>>> domain with Navi and is broken on some HW generations IIRC. > >>>>> > >>>>>> > >>>>>> With these two applied, the GFXOFF re-enable requests are avoided > >>>>>> altogether during persistent activity on the GFX ring and simultaneous > >>>>>> sporadic activity on the SDMA ring. > >>>>>> > >>>>>> This has a positive effect of drastically reducing SDMA submission > >>>>>> latencies. For example during the Cyberpunk 2077 benchmark, they are > >>>>>> reduced from an average of 64us (stdev 60) to 9us (stdev 6). Or more > >>>>>> importantly the worst case latency, averaged to a one second window, is > >>>>>> reduced from 305us to 30us**. > >>>>>> > >>>>>> *) For ease of implementation we put the re-enable at the job free > >>>>>> stage, > >>>>>> since doing it on actual completion is problematic in terms of locking. > >>>>> > >>>>> Absolutely clear NAK to this. Never ever base anything on the job > >>>>> livetime! > >>>>> > >>>>> We already had enough trouble with that. > >>>>> > >>>>>> > >>>>>> **) Submission latency ewma averaged (DECLARE_EWMA(latency, 6, 4)) - > >>>>>> Approximately 30 SDMA submissions per second, ewma average logged once > >>>>>> per second therefore significantly hides the worst case latency. Eg. > >>>>>> the real improvement in max submission latency is severely understated > >>>>>> by > >>>>>> these numbers. > >>>>> > >>>>> Well that would indeed be quite nice to have. > >>>>> > >>>>> Regards, > >>>>> Christian. > >>>>> > >>>>>> > >>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursu...@igalia.com> > >>>>>> References: 94b1e028e15c ("drm/amdgpu/sdma5.2: add begin/end_use ring > >>>>>> callbacks") > >>>>>> Cc: Mario Limonciello <mario.limoncie...@amd.com> > >>>>>> Cc: Christian König <christian.koe...@amd.com> > >>>>>> Cc: Alex Deucher <alexander.deuc...@amd.com> > >>>>>> --- > >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h | 1 + > >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c | 8 ++++++++ > >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 7 +++++++ > >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 2 ++ > >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 3 +++ > >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 + > >>>>>> drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 1 + > >>>>>> 7 files changed, 23 insertions(+) > >>>>>> > >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h > >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h > >>>>>> index 08f268dab8f5..eee40f385793 100644 > >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h > >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h > >>>>>> @@ -475,6 +475,7 @@ struct amdgpu_gfx { > >>>>>> uint32_t compute_supported_reset; > >>>>>> > >>>>>> /* gfx off */ > >>>>>> + bool gfx_off_held; /* true: > >>>>>> rings hold gfx_off */ > >>>>>> bool gfx_off_state; /* true: > >>>>>> enabled, false: disabled */ > >>>>>> struct mutex gfx_off_mutex; /* mutex to > >>>>>> change gfxoff state */ > >>>>>> uint32_t gfx_off_req_count; /* default > >>>>>> 1, enable gfx off: dec 1, disable gfx off: add 1 */ > >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c > >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c > >>>>>> index 206b70acb29a..bf9bffe40235 100644 > >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c > >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c > >>>>>> @@ -191,6 +191,14 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, > >>>>>> unsigned int num_ibs, > >>>>>> return r; > >>>>>> } > >>>>>> > >>>>>> + if (job && adev->gfx.gfx_off_held && > >>>>>> + (ring->funcs->type == AMDGPU_RING_TYPE_GFX || > >>>>>> + ring->funcs->type == AMDGPU_RING_TYPE_COMPUTE || > >>>>>> + ring->funcs->type == AMDGPU_RING_TYPE_SDMA)) { > >>>>>> + amdgpu_gfx_off_ctrl(adev, false); > >>>>>> + job->gfx_off_held = true; > >>>>>> + } > >>>>>> + > >>>>>> need_ctx_switch = ring->current_ctx != fence_ctx; > >>>>>> if (ring->funcs->emit_pipeline_sync && job && > >>>>>> ((tmp = amdgpu_sync_get_fence(&job->explicit_sync)) || > >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > >>>>>> index 2b58e353cca1..4cfd175ac6df 100644 > >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > >>>>>> @@ -191,6 +191,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, > >>>>>> struct amdgpu_vm *vm, > >>>>>> if (!*job) > >>>>>> return -ENOMEM; > >>>>>> > >>>>>> + (*job)->adev = adev; > >>>>>> (*job)->vm = vm; > >>>>>> > >>>>>> amdgpu_sync_create(&(*job)->explicit_sync); > >>>>>> @@ -268,6 +269,9 @@ static void amdgpu_job_free_cb(struct > >>>>>> drm_sched_job *s_job) > >>>>>> > >>>>>> amdgpu_sync_free(&job->explicit_sync); > >>>>>> > >>>>>> + if (job->gfx_off_held) > >>>>>> + amdgpu_gfx_off_ctrl(job->adev, true); > >>>>>> + > >>>>> > >>>>> > >>>>> > >>>>> > >>>>>> /* only put the hw fence if has embedded fence */ > >>>>>> if (!job->hw_fence.base.ops) > >>>>>> kfree(job); > >>>>>> @@ -301,6 +305,9 @@ void amdgpu_job_free(struct amdgpu_job *job) > >>>>>> if (job->gang_submit != &job->base.s_fence->scheduled) > >>>>>> dma_fence_put(job->gang_submit); > >>>>>> > >>>>>> + if (job->gfx_off_held) > >>>>>> + amdgpu_gfx_off_ctrl(job->adev, true); > >>>>>> + > >>>>>> if (!job->hw_fence.base.ops) > >>>>>> kfree(job); > >>>>>> else > >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h > >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h > >>>>>> index 2f302266662b..d4ab832ac193 100644 > >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h > >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h > >>>>>> @@ -46,6 +46,7 @@ enum amdgpu_ib_pool_type; > >>>>>> > >>>>>> struct amdgpu_job { > >>>>>> struct drm_sched_job base; > >>>>>> + struct amdgpu_device *adev; > >>>>>> struct amdgpu_vm *vm; > >>>>>> struct amdgpu_sync explicit_sync; > >>>>>> struct amdgpu_fence hw_fence; > >>>>>> @@ -55,6 +56,7 @@ struct amdgpu_job { > >>>>>> bool vm_needs_flush; > >>>>>> bool gds_switch_needed; > >>>>>> bool spm_update_needed; > >>>>>> + bool gfx_off_held; > >>>>>> uint64_t vm_pd_addr; > >>>>>> unsigned vmid; > >>>>>> unsigned pasid; > >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c > >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c > >>>>>> index 426834806fbf..22cac94e2f2a 100644 > >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c > >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c > >>>>>> @@ -350,6 +350,9 @@ int amdgpu_ring_init(struct amdgpu_device *adev, > >>>>>> struct amdgpu_ring *ring, > >>>>>> ring->max_dw = max_dw; > >>>>>> ring->hw_prio = hw_prio; > >>>>>> > >>>>>> + if (ring->funcs->gfx_off_held) > >>>>>> + adev->gfx.gfx_off_held = true; > >>>>>> + > >>>>>> if (!ring->no_scheduler && ring->funcs->type < > >>>>>> AMDGPU_HW_IP_NUM) { > >>>>>> hw_ip = ring->funcs->type; > >>>>>> num_sched = &adev->gpu_sched[hw_ip][hw_prio].num_scheds; > >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h > >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h > >>>>>> index 784ba2ec354c..afaf951b0b78 100644 > >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h > >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h > >>>>>> @@ -202,6 +202,7 @@ struct amdgpu_ring_funcs { > >>>>>> bool support_64bit_ptrs; > >>>>>> bool no_user_fence; > >>>>>> bool secure_submission_supported; > >>>>>> + bool gfx_off_held; > >>>>>> unsigned extra_dw; > >>>>>> > >>>>>> /* ring read/write ptr handling */ > >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c > >>>>>> b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c > >>>>>> index 42a25150f83a..c88de65e82bc 100644 > >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c > >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c > >>>>>> @@ -1944,6 +1944,7 @@ static const struct amdgpu_ring_funcs > >>>>>> sdma_v5_2_ring_funcs = { > >>>>>> .nop = SDMA_PKT_NOP_HEADER_OP(SDMA_OP_NOP), > >>>>>> .support_64bit_ptrs = true, > >>>>>> .secure_submission_supported = true, > >>>>>> + .gfx_off_held = true, > >>>>>> .get_rptr = sdma_v5_2_ring_get_rptr, > >>>>>> .get_wptr = sdma_v5_2_ring_get_wptr, > >>>>>> .set_wptr = sdma_v5_2_ring_set_wptr, > >>>>> > >>>> > >> >