sdma5.2: Avoid latencies caused by the powergating workaround

Tvrtko Ursulin Wed, 16 Jul 2025 05:51:17 -0700


On 11/07/2025 17:51, Alex Deucher wrote:

On Fri, Jul 11, 2025 at 12:07 PM Tvrtko Ursulin
<tvrtko.ursu...@igalia.com> wrote:



On 11/07/2025 16:51, Alex Deucher wrote:

On Fri, Jul 11, 2025 at 9:58 AM Tvrtko Ursulin
<tvrtko.ursu...@igalia.com> wrote:



On 11/07/2025 14:39, Alex Deucher wrote:

On Fri, Jul 11, 2025 at 9:22 AM Tvrtko Ursulin
<tvrtko.ursu...@igalia.com> wrote:



On 11/07/2025 13:45, Christian König wrote:

On 11.07.25 14:23, Tvrtko Ursulin wrote:

Commit
94b1e028e15c ("drm/amdgpu/sdma5.2: add begin/end_use ring callbacks")
added a workaround which disables GFXOFF for the duration of the job
submit stage (with a 100ms trailing hysteresis).

Empirically the GFXOFF disable/enable request can suffer from significant
latencies (2ms is easily seen) which are then inserted onto the
amdgpu_job_run() path, which slows down the CPU submission of ready jobs.

1)
If the premise of the GFXOFF workaround is to keep it disabled while the
SDMA engine is active, the current workaround achieves that only
partially, for submissions and jobs which take less than 100ms (the GFXOFF
re-enable hysteresis), counting from the ring write phase, up to
completion.

2)
If disabling GFXOFF affects the GFX engine too, basing the workaround
solely on the SDMA activity creates, at minimum, a needless "chatter" on
the SMU communication channel.


IIRC that is intentional. This "needless" chatter is what the workaround was 
all about.


I tried to gather knowledge to how the hardware works from the comment
in sdma_v5_2_ring_begin_use(). Maybe I got it wrong so bear with me please.

To try and explain my questions better. If the GFX ring/engine is busy
is there a point for SDMA to be requesting GFXOFF enable/disable? Or
maybe with diagrams...

1)

SDMA:

       ring-write     ring-commit       job-execute       job-done
     gfxoff-off-req  gfxoff-on-req  >100ms -> gfxoff-on

Was the workaround prematurely dropped in this case (aka is
ring->funcs->end_use() the right place to drop it from)? Probably
theoretical that a SDMA job takes more than 100ms but I am trying to
understand it all.


The firmware controls the power to a subset of the chip which contains
both gfx and sdma.  Normally the firmware dynamically powers up and
down gfx transparently when doorbells come in or the engines go idle
for either engine.  amdgpu_gfx_off_ctrl() tells the firmware to allow
or disallow gfxoff entry.  So what this workaround does is disallow
gfxoff (which results in gfx being powered up) before we touch SDMA.
Once SDMA is active, we can allow gfxoff again as it will dynamically


Hmm so it is "once" and not "while", as the comment says:

          /* SDMA 5.2.3 (RMB) FW doesn't seem to properly
           * disallow GFXOFF in some cases leading to
           * hangs in SDMA.  Disallow GFXOFF while SDMA is active.

?

And for "once active" amdgpu_ring_commit() is what it counts?


Yes, amdgpu_ring_commit() rings the doorbell and at that point the
engine starts running and SDMA is active.


Should I submit a patch to change this comment to say "Disallow GFXOFF
until SDMA is active (doorbell was rang)"?


Sure.


Okay, will do.

be disabled once GFX/SDMA is no longer active.  In this particular
case there was a race condition somewhere in the internal handshaking
with SDMA which led to SDMA missing doorbells sometimes and not
executing the job even if there was work in the ring.


Thank you, more or less than what I assumed.

But in this case there should be no harm in holding GFXOFF disabled
until the job completes (like this patch)? Only a win to avoid the SMU
communication latencies while unit is powered on anyway.


The extra latency is only on the CPU side, once the
amdgpu_ring_commit() is called the SDMA engine is already working.


It is on the CPU side but can create bubbles in the pipeline, no? Is
there no scope with AMD to have GFX and SDMA jobs depend on each other?
Because, as said, I've seen some high latencies from the GFXOFF disable
calls.


The SDMA job is already executing at that point.  The allow gfxoff
message to the firmware shouldn't come until later because it's
handled by a delayed work thread from end_use().  If you have multiple
submissions to SDMA within the delay window, the begin_use() and
end_use() will just be ref count handling and won't actually talk to
the firmware.

I followed up with testing a bunch more games, and is it turns out,Cyberpunk 2077 is the only one which has this submission patterns wheredefault GFX_OFF_DELAY_ENABLE is regularly defeated.

There, around 1.2 times per second the SDMA submissions miss that 100mshysteresis and cause a CPU latency over 100us (I only measured when>100us and ignored the rest). Average latency is ~400us and max is~2ms. So IMHO quite bad.

And the vast majority of those latencies come from the SMU request. Onlyvery rarely someone hits the mutex contention path.

So that was the motivation for the RFC. I suppose I could have alsoproposed to increase the hysteresis, but holding the GFXOFF disabled forthe duration of the job sounded preferable for power consmuption.

Anyway, given I only found Cyberpunk 2077 suffers from this I guess itmaybe isn't to interesting to upstream for you guys. Then again it islimited to specific old SKU so maybe it should not be that controversialeither? Only that Christian NAKed tying it to job lifetime. So I don'tknow, AMDs call.


Regards,

Tvrtko

Plus, the sooner you allow gfxoff again, the sooner it can start
kicking in again.


  From this I read that even while GFX or SDMA rings are executing jobs
GFXOFF can still power down stuff?


Only if all of the blocks are idle which would likely not be the case
if you have outstanding jobs.

Alex


Regards,

Tvrtko


Regards,

Tvrtko

2)


GFX:

       +-----  job executing --------------------------------------+

SDMA:

       ring-write     ring-commit       job-execute       job-done
     gfxoff-off-req  gfxoff-on-req  >100ms -> gfxoff-on


Is it required for the SDMA activity to cause SMU message traffic in
this case, or is the powerdomain implied to be on (GFXOFF cannot turn it
off while GFX is active)?

This is the case I measured latency spikes. While the GFX load was
running I was seeing SDMA->run_job() spike and traced it to the SMU
communication.

Hence the idea from the patch - prevent adev->gfx.gfx_off_req_count
dropping to zero until both GFX and SDMA are idle.

https://imgshare.cc/rdxu2bjl

Above is visual representation of these latencies. Y is SDMA run_job()
duration in micro-seconds, X is seconds wall time. Blue is stock kernel,
orange is with this patch. X goes for ~60 seconds, which is how long
Cyberpunk 2077 benchmark is.

Regards,

Tvrtko

If 1) and 2) hold true, we can improve on the workaround by; a) only
re-enabling GFXOFF once the job had actually completed*, and b) apply the
same workaround on other rings which share the same GFXOFF powergating
domain.


The point of GFXOFF is to turn GFX on/off *without* kernel driver interaction. 
Otherwise we don't need it in the first place.

We just have a hack for the SDMA because that moved into the GFXOFF domain with 
Navi and is broken on some HW generations IIRC.


With these two applied, the GFXOFF re-enable requests are avoided
altogether during persistent activity on the GFX ring and simultaneous
sporadic activity on the SDMA ring.

This has a positive effect of drastically reducing SDMA submission
latencies. For example during the Cyberpunk 2077 benchmark, they are
reduced from an average of 64us (stdev 60) to 9us (stdev 6). Or more
importantly the worst case latency, averaged to a one second window, is
reduced from 305us to 30us**.

*) For ease of implementation we put the re-enable at the job free stage,
since doing it on actual completion is problematic in terms of locking.


Absolutely clear NAK to this. Never ever base anything on the job livetime!

We already had enough trouble with that.


**) Submission latency ewma averaged (DECLARE_EWMA(latency, 6, 4)) -
Approximately 30 SDMA submissions per second, ewma average logged once
per second therefore significantly hides the worst case latency. Eg.
the real improvement in max submission latency is severely understated by
these numbers.


Well that would indeed be quite nice to have.

Regards,
Christian.


Signed-off-by: Tvrtko Ursulin <tvrtko.ursu...@igalia.com>
References: 94b1e028e15c ("drm/amdgpu/sdma5.2: add begin/end_use ring 
callbacks")
Cc: Mario Limonciello <mario.limoncie...@amd.com>
Cc: Christian König <christian.koe...@amd.com>
Cc: Alex Deucher <alexander.deuc...@amd.com>
---
     drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h  | 1 +
     drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c   | 8 ++++++++
     drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  | 7 +++++++
     drivers/gpu/drm/amd/amdgpu/amdgpu_job.h  | 2 ++
     drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 3 +++
     drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 +
     drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c   | 1 +
     7 files changed, 23 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
index 08f268dab8f5..eee40f385793 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
@@ -475,6 +475,7 @@ struct amdgpu_gfx {
        uint32_t                        compute_supported_reset;

        /* gfx off */
+    bool                            gfx_off_held;       /* true: rings hold 
gfx_off */
        bool                            gfx_off_state;      /* true: enabled, 
false: disabled */
        struct mutex                    gfx_off_mutex;      /* mutex to change 
gfxoff state */
        uint32_t                        gfx_off_req_count;  /* default 1, 
enable gfx off: dec 1, disable gfx off: add 1 */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
index 206b70acb29a..bf9bffe40235 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
@@ -191,6 +191,14 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned 
int num_ibs,
                return r;
        }

+    if (job && adev->gfx.gfx_off_held &&
+        (ring->funcs->type == AMDGPU_RING_TYPE_GFX ||
+         ring->funcs->type == AMDGPU_RING_TYPE_COMPUTE ||
+         ring->funcs->type == AMDGPU_RING_TYPE_SDMA)) {
+            amdgpu_gfx_off_ctrl(adev, false);
+            job->gfx_off_held = true;
+    }
+
        need_ctx_switch = ring->current_ctx != fence_ctx;
        if (ring->funcs->emit_pipeline_sync && job &&
            ((tmp = amdgpu_sync_get_fence(&job->explicit_sync)) ||
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 2b58e353cca1..4cfd175ac6df 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -191,6 +191,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct 
amdgpu_vm *vm,
        if (!*job)
                return -ENOMEM;

+    (*job)->adev = adev;
        (*job)->vm = vm;

        amdgpu_sync_create(&(*job)->explicit_sync);
@@ -268,6 +269,9 @@ static void amdgpu_job_free_cb(struct drm_sched_job *s_job)

        amdgpu_sync_free(&job->explicit_sync);

+    if (job->gfx_off_held)
+            amdgpu_gfx_off_ctrl(job->adev, true);
+

        /* only put the hw fence if has embedded fence */
        if (!job->hw_fence.base.ops)
                kfree(job);
@@ -301,6 +305,9 @@ void amdgpu_job_free(struct amdgpu_job *job)
        if (job->gang_submit != &job->base.s_fence->scheduled)
                dma_fence_put(job->gang_submit);

+    if (job->gfx_off_held)
+            amdgpu_gfx_off_ctrl(job->adev, true);
+
        if (!job->hw_fence.base.ops)
                kfree(job);
        else
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
index 2f302266662b..d4ab832ac193 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
@@ -46,6 +46,7 @@ enum amdgpu_ib_pool_type;

     struct amdgpu_job {
        struct drm_sched_job    base;
+    struct amdgpu_device    *adev;
        struct amdgpu_vm        *vm;
        struct amdgpu_sync      explicit_sync;
        struct amdgpu_fence     hw_fence;
@@ -55,6 +56,7 @@ struct amdgpu_job {
        bool                    vm_needs_flush;
        bool                    gds_switch_needed;
        bool                    spm_update_needed;
+    bool                    gfx_off_held;
        uint64_t                vm_pd_addr;
        unsigned                vmid;
        unsigned                pasid;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 426834806fbf..22cac94e2f2a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -350,6 +350,9 @@ int amdgpu_ring_init(struct amdgpu_device *adev, struct 
amdgpu_ring *ring,
        ring->max_dw = max_dw;
        ring->hw_prio = hw_prio;

+    if (ring->funcs->gfx_off_held)
+            adev->gfx.gfx_off_held = true;
+
        if (!ring->no_scheduler && ring->funcs->type < AMDGPU_HW_IP_NUM) {
                hw_ip = ring->funcs->type;
                num_sched = &adev->gpu_sched[hw_ip][hw_prio].num_scheds;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index 784ba2ec354c..afaf951b0b78 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -202,6 +202,7 @@ struct amdgpu_ring_funcs {
        bool                    support_64bit_ptrs;
        bool                    no_user_fence;
        bool                    secure_submission_supported;
+    bool                    gfx_off_held;
        unsigned                extra_dw;

        /* ring read/write ptr handling */
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c 
b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
index 42a25150f83a..c88de65e82bc 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
@@ -1944,6 +1944,7 @@ static const struct amdgpu_ring_funcs 
sdma_v5_2_ring_funcs = {
        .nop = SDMA_PKT_NOP_HEADER_OP(SDMA_OP_NOP),
        .support_64bit_ptrs = true,
        .secure_submission_supported = true,
+    .gfx_off_held = true,
        .get_rptr = sdma_v5_2_ring_get_rptr,
        .get_wptr = sdma_v5_2_ring_get_wptr,
        .set_wptr = sdma_v5_2_ring_set_wptr,

Re: [RFC] drm/amdgpu/sdma5.2: Avoid latencies caused by the powergating workaround

Reply via email to