sdma5.2: Avoid latencies caused by the powergating workaround

Alex Deucher Fri, 11 Jul 2025 09:51:59 -0700

On Fri, Jul 11, 2025 at 12:07 PM Tvrtko Ursulin
<tvrtko.ursu...@igalia.com> wrote:
>
>
> On 11/07/2025 16:51, Alex Deucher wrote:
> > On Fri, Jul 11, 2025 at 9:58 AM Tvrtko Ursulin
> > <tvrtko.ursu...@igalia.com> wrote:
> >>
> >>
> >> On 11/07/2025 14:39, Alex Deucher wrote:
> >>> On Fri, Jul 11, 2025 at 9:22 AM Tvrtko Ursulin
> >>> <tvrtko.ursu...@igalia.com> wrote:
> >>>>
> >>>>
> >>>> On 11/07/2025 13:45, Christian König wrote:
> >>>>> On 11.07.25 14:23, Tvrtko Ursulin wrote:
> >>>>>> Commit
> >>>>>> 94b1e028e15c ("drm/amdgpu/sdma5.2: add begin/end_use ring callbacks")
> >>>>>> added a workaround which disables GFXOFF for the duration of the job
> >>>>>> submit stage (with a 100ms trailing hysteresis).
> >>>>>>
> >>>>>> Empirically the GFXOFF disable/enable request can suffer from 
> >>>>>> significant
> >>>>>> latencies (2ms is easily seen) which are then inserted onto the
> >>>>>> amdgpu_job_run() path, which slows down the CPU submission of ready 
> >>>>>> jobs.
> >>>>>>
> >>>>>> 1)
> >>>>>> If the premise of the GFXOFF workaround is to keep it disabled while 
> >>>>>> the
> >>>>>> SDMA engine is active, the current workaround achieves that only
> >>>>>> partially, for submissions and jobs which take less than 100ms (the 
> >>>>>> GFXOFF
> >>>>>> re-enable hysteresis), counting from the ring write phase, up to
> >>>>>> completion.
> >>>>>>
> >>>>>> 2)
> >>>>>> If disabling GFXOFF affects the GFX engine too, basing the workaround
> >>>>>> solely on the SDMA activity creates, at minimum, a needless "chatter" 
> >>>>>> on
> >>>>>> the SMU communication channel.
> >>>>>
> >>>>> IIRC that is intentional. This "needless" chatter is what the 
> >>>>> workaround was all about.
> >>>>
> >>>> I tried to gather knowledge to how the hardware works from the comment
> >>>> in sdma_v5_2_ring_begin_use(). Maybe I got it wrong so bear with me 
> >>>> please.
> >>>>
> >>>> To try and explain my questions better. If the GFX ring/engine is busy
> >>>> is there a point for SDMA to be requesting GFXOFF enable/disable? Or
> >>>> maybe with diagrams...
> >>>>
> >>>> 1)
> >>>>
> >>>> SDMA:
> >>>>
> >>>>       ring-write     ring-commit       job-execute       job-done
> >>>>     gfxoff-off-req  gfxoff-on-req  >100ms -> gfxoff-on
> >>>>
> >>>> Was the workaround prematurely dropped in this case (aka is
> >>>> ring->funcs->end_use() the right place to drop it from)? Probably
> >>>> theoretical that a SDMA job takes more than 100ms but I am trying to
> >>>> understand it all.
> >>>>
> >>>
> >>> The firmware controls the power to a subset of the chip which contains
> >>> both gfx and sdma.  Normally the firmware dynamically powers up and
> >>> down gfx transparently when doorbells come in or the engines go idle
> >>> for either engine.  amdgpu_gfx_off_ctrl() tells the firmware to allow
> >>> or disallow gfxoff entry.  So what this workaround does is disallow
> >>> gfxoff (which results in gfx being powered up) before we touch SDMA.
> >>> Once SDMA is active, we can allow gfxoff again as it will dynamically
> >>
> >> Hmm so it is "once" and not "while", as the comment says:
> >>
> >>          /* SDMA 5.2.3 (RMB) FW doesn't seem to properly
> >>           * disallow GFXOFF in some cases leading to
> >>           * hangs in SDMA.  Disallow GFXOFF while SDMA is active.
> >>
> >> ?
> >>
> >> And for "once active" amdgpu_ring_commit() is what it counts?
> >
> > Yes, amdgpu_ring_commit() rings the doorbell and at that point the
> > engine starts running and SDMA is active.
>
> Should I submit a patch to change this comment to say "Disallow GFXOFF
> until SDMA is active (doorbell was rang)"?


Sure.

>
> >>
> >>> be disabled once GFX/SDMA is no longer active.  In this particular
> >>> case there was a race condition somewhere in the internal handshaking
> >>> with SDMA which led to SDMA missing doorbells sometimes and not
> >>> executing the job even if there was work in the ring.
> >>
> >> Thank you, more or less than what I assumed.
> >>
> >> But in this case there should be no harm in holding GFXOFF disabled
> >> until the job completes (like this patch)? Only a win to avoid the SMU
> >> communication latencies while unit is powered on anyway.
> >
> > The extra latency is only on the CPU side, once the
> > amdgpu_ring_commit() is called the SDMA engine is already working.
>
> It is on the CPU side but can create bubbles in the pipeline, no? Is
> there no scope with AMD to have GFX and SDMA jobs depend on each other?
> Because, as said, I've seen some high latencies from the GFXOFF disable
> calls.

The SDMA job is already executing at that point.  The allow gfxoff
message to the firmware shouldn't come until later because it's
handled by a delayed work thread from end_use().  If you have multiple
submissions to SDMA within the delay window, the begin_use() and
end_use() will just be ref count handling and won't actually talk to
the firmware.

>
> > Plus, the sooner you allow gfxoff again, the sooner it can start
> > kicking in again.
>
>  From this I read that even while GFX or SDMA rings are executing jobs
> GFXOFF can still power down stuff?

Only if all of the blocks are idle which would likely not be the case
if you have outstanding jobs.

Alex

>
> Regards,
>
> Tvrtko
>
> >
> >>
> >> Regards,
> >>
> >> Tvrtko
> >>
> >>>> 2)
> >>>>
> >>>>
> >>>> GFX:
> >>>>
> >>>>       +-----  job executing --------------------------------------+
> >>>>
> >>>> SDMA:
> >>>>
> >>>>       ring-write     ring-commit       job-execute       job-done
> >>>>     gfxoff-off-req  gfxoff-on-req  >100ms -> gfxoff-on
> >>>>
> >>>>
> >>>> Is it required for the SDMA activity to cause SMU message traffic in
> >>>> this case, or is the powerdomain implied to be on (GFXOFF cannot turn it
> >>>> off while GFX is active)?
> >>>>
> >>>> This is the case I measured latency spikes. While the GFX load was
> >>>> running I was seeing SDMA->run_job() spike and traced it to the SMU
> >>>> communication.
> >>>>
> >>>> Hence the idea from the patch - prevent adev->gfx.gfx_off_req_count
> >>>> dropping to zero until both GFX and SDMA are idle.
> >>>>
> >>>> https://imgshare.cc/rdxu2bjl
> >>>>
> >>>> Above is visual representation of these latencies. Y is SDMA run_job()
> >>>> duration in micro-seconds, X is seconds wall time. Blue is stock kernel,
> >>>> orange is with this patch. X goes for ~60 seconds, which is how long
> >>>> Cyberpunk 2077 benchmark is.
> >>>>
> >>>> Regards,
> >>>>
> >>>> Tvrtko
> >>>>
> >>>>>> If 1) and 2) hold true, we can improve on the workaround by; a) only
> >>>>>> re-enabling GFXOFF once the job had actually completed*, and b) apply 
> >>>>>> the
> >>>>>> same workaround on other rings which share the same GFXOFF powergating
> >>>>>> domain.
> >>>>>
> >>>>> The point of GFXOFF is to turn GFX on/off *without* kernel driver 
> >>>>> interaction. Otherwise we don't need it in the first place.
> >>>>>
> >>>>> We just have a hack for the SDMA because that moved into the GFXOFF 
> >>>>> domain with Navi and is broken on some HW generations IIRC.
> >>>>>
> >>>>>>
> >>>>>> With these two applied, the GFXOFF re-enable requests are avoided
> >>>>>> altogether during persistent activity on the GFX ring and simultaneous
> >>>>>> sporadic activity on the SDMA ring.
> >>>>>>
> >>>>>> This has a positive effect of drastically reducing SDMA submission
> >>>>>> latencies. For example during the Cyberpunk 2077 benchmark, they are
> >>>>>> reduced from an average of 64us (stdev 60) to 9us (stdev 6). Or more
> >>>>>> importantly the worst case latency, averaged to a one second window, is
> >>>>>> reduced from 305us to 30us**.
> >>>>>>
> >>>>>> *) For ease of implementation we put the re-enable at the job free 
> >>>>>> stage,
> >>>>>> since doing it on actual completion is problematic in terms of locking.
> >>>>>
> >>>>> Absolutely clear NAK to this. Never ever base anything on the job 
> >>>>> livetime!
> >>>>>
> >>>>> We already had enough trouble with that.
> >>>>>
> >>>>>>
> >>>>>> **) Submission latency ewma averaged (DECLARE_EWMA(latency, 6, 4)) -
> >>>>>> Approximately 30 SDMA submissions per second, ewma average logged once
> >>>>>> per second therefore significantly hides the worst case latency. Eg.
> >>>>>> the real improvement in max submission latency is severely understated 
> >>>>>> by
> >>>>>> these numbers.
> >>>>>
> >>>>> Well that would indeed be quite nice to have.
> >>>>>
> >>>>> Regards,
> >>>>> Christian.
> >>>>>
> >>>>>>
> >>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursu...@igalia.com>
> >>>>>> References: 94b1e028e15c ("drm/amdgpu/sdma5.2: add begin/end_use ring 
> >>>>>> callbacks")
> >>>>>> Cc: Mario Limonciello <mario.limoncie...@amd.com>
> >>>>>> Cc: Christian König <christian.koe...@amd.com>
> >>>>>> Cc: Alex Deucher <alexander.deuc...@amd.com>
> >>>>>> ---
> >>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h  | 1 +
> >>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c   | 8 ++++++++
> >>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  | 7 +++++++
> >>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_job.h  | 2 ++
> >>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 3 +++
> >>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 +
> >>>>>>     drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c   | 1 +
> >>>>>>     7 files changed, 23 insertions(+)
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h 
> >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> >>>>>> index 08f268dab8f5..eee40f385793 100644
> >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> >>>>>> @@ -475,6 +475,7 @@ struct amdgpu_gfx {
> >>>>>>        uint32_t                        compute_supported_reset;
> >>>>>>
> >>>>>>        /* gfx off */
> >>>>>> +    bool                            gfx_off_held;       /* true: 
> >>>>>> rings hold gfx_off */
> >>>>>>        bool                            gfx_off_state;      /* true: 
> >>>>>> enabled, false: disabled */
> >>>>>>        struct mutex                    gfx_off_mutex;      /* mutex to 
> >>>>>> change gfxoff state */
> >>>>>>        uint32_t                        gfx_off_req_count;  /* default 
> >>>>>> 1, enable gfx off: dec 1, disable gfx off: add 1 */
> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c 
> >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
> >>>>>> index 206b70acb29a..bf9bffe40235 100644
> >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
> >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
> >>>>>> @@ -191,6 +191,14 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, 
> >>>>>> unsigned int num_ibs,
> >>>>>>                return r;
> >>>>>>        }
> >>>>>>
> >>>>>> +    if (job && adev->gfx.gfx_off_held &&
> >>>>>> +        (ring->funcs->type == AMDGPU_RING_TYPE_GFX ||
> >>>>>> +         ring->funcs->type == AMDGPU_RING_TYPE_COMPUTE ||
> >>>>>> +         ring->funcs->type == AMDGPU_RING_TYPE_SDMA)) {
> >>>>>> +            amdgpu_gfx_off_ctrl(adev, false);
> >>>>>> +            job->gfx_off_held = true;
> >>>>>> +    }
> >>>>>> +
> >>>>>>        need_ctx_switch = ring->current_ctx != fence_ctx;
> >>>>>>        if (ring->funcs->emit_pipeline_sync && job &&
> >>>>>>            ((tmp = amdgpu_sync_get_fence(&job->explicit_sync)) ||
> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>>>>> index 2b58e353cca1..4cfd175ac6df 100644
> >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>>>>> @@ -191,6 +191,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, 
> >>>>>> struct amdgpu_vm *vm,
> >>>>>>        if (!*job)
> >>>>>>                return -ENOMEM;
> >>>>>>
> >>>>>> +    (*job)->adev = adev;
> >>>>>>        (*job)->vm = vm;
> >>>>>>
> >>>>>>        amdgpu_sync_create(&(*job)->explicit_sync);
> >>>>>> @@ -268,6 +269,9 @@ static void amdgpu_job_free_cb(struct 
> >>>>>> drm_sched_job *s_job)
> >>>>>>
> >>>>>>        amdgpu_sync_free(&job->explicit_sync);
> >>>>>>
> >>>>>> +    if (job->gfx_off_held)
> >>>>>> +            amdgpu_gfx_off_ctrl(job->adev, true);
> >>>>>> +
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>>        /* only put the hw fence if has embedded fence */
> >>>>>>        if (!job->hw_fence.base.ops)
> >>>>>>                kfree(job);
> >>>>>> @@ -301,6 +305,9 @@ void amdgpu_job_free(struct amdgpu_job *job)
> >>>>>>        if (job->gang_submit != &job->base.s_fence->scheduled)
> >>>>>>                dma_fence_put(job->gang_submit);
> >>>>>>
> >>>>>> +    if (job->gfx_off_held)
> >>>>>> +            amdgpu_gfx_off_ctrl(job->adev, true);
> >>>>>> +
> >>>>>>        if (!job->hw_fence.base.ops)
> >>>>>>                kfree(job);
> >>>>>>        else
> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h 
> >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> >>>>>> index 2f302266662b..d4ab832ac193 100644
> >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> >>>>>> @@ -46,6 +46,7 @@ enum amdgpu_ib_pool_type;
> >>>>>>
> >>>>>>     struct amdgpu_job {
> >>>>>>        struct drm_sched_job    base;
> >>>>>> +    struct amdgpu_device    *adev;
> >>>>>>        struct amdgpu_vm        *vm;
> >>>>>>        struct amdgpu_sync      explicit_sync;
> >>>>>>        struct amdgpu_fence     hw_fence;
> >>>>>> @@ -55,6 +56,7 @@ struct amdgpu_job {
> >>>>>>        bool                    vm_needs_flush;
> >>>>>>        bool                    gds_switch_needed;
> >>>>>>        bool                    spm_update_needed;
> >>>>>> +    bool                    gfx_off_held;
> >>>>>>        uint64_t                vm_pd_addr;
> >>>>>>        unsigned                vmid;
> >>>>>>        unsigned                pasid;
> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
> >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> >>>>>> index 426834806fbf..22cac94e2f2a 100644
> >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> >>>>>> @@ -350,6 +350,9 @@ int amdgpu_ring_init(struct amdgpu_device *adev, 
> >>>>>> struct amdgpu_ring *ring,
> >>>>>>        ring->max_dw = max_dw;
> >>>>>>        ring->hw_prio = hw_prio;
> >>>>>>
> >>>>>> +    if (ring->funcs->gfx_off_held)
> >>>>>> +            adev->gfx.gfx_off_held = true;
> >>>>>> +
> >>>>>>        if (!ring->no_scheduler && ring->funcs->type < 
> >>>>>> AMDGPU_HW_IP_NUM) {
> >>>>>>                hw_ip = ring->funcs->type;
> >>>>>>                num_sched = &adev->gpu_sched[hw_ip][hw_prio].num_scheds;
> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
> >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> >>>>>> index 784ba2ec354c..afaf951b0b78 100644
> >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> >>>>>> @@ -202,6 +202,7 @@ struct amdgpu_ring_funcs {
> >>>>>>        bool                    support_64bit_ptrs;
> >>>>>>        bool                    no_user_fence;
> >>>>>>        bool                    secure_submission_supported;
> >>>>>> +    bool                    gfx_off_held;
> >>>>>>        unsigned                extra_dw;
> >>>>>>
> >>>>>>        /* ring read/write ptr handling */
> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c 
> >>>>>> b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> >>>>>> index 42a25150f83a..c88de65e82bc 100644
> >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> >>>>>> @@ -1944,6 +1944,7 @@ static const struct amdgpu_ring_funcs 
> >>>>>> sdma_v5_2_ring_funcs = {
> >>>>>>        .nop = SDMA_PKT_NOP_HEADER_OP(SDMA_OP_NOP),
> >>>>>>        .support_64bit_ptrs = true,
> >>>>>>        .secure_submission_supported = true,
> >>>>>> +    .gfx_off_held = true,
> >>>>>>        .get_rptr = sdma_v5_2_ring_get_rptr,
> >>>>>>        .get_wptr = sdma_v5_2_ring_get_wptr,
> >>>>>>        .set_wptr = sdma_v5_2_ring_set_wptr,
> >>>>>
> >>>>
> >>
>

Re: [RFC] drm/amdgpu/sdma5.2: Avoid latencies caused by the powergating workaround

Reply via email to