On 16/07/2025 14:00, Christian König wrote:
On 16.07.25 14:51, Tvrtko Ursulin wrote:
be disabled once GFX/SDMA is no longer active. In this particular
case there was a race condition somewhere in the internal handshaking
with SDMA which led to SDMA missing doorbells sometimes and not
executing the job even if there was work in the ring.
Thank you, more or less than what I assumed.
But in this case there should be no harm in holding GFXOFF disabled
until the job completes (like this patch)? Only a win to avoid the SMU
communication latencies while unit is powered on anyway.
The extra latency is only on the CPU side, once the
amdgpu_ring_commit() is called the SDMA engine is already working.
It is on the CPU side but can create bubbles in the pipeline, no? Is
there no scope with AMD to have GFX and SDMA jobs depend on each other?
Because, as said, I've seen some high latencies from the GFXOFF disable
calls.
The SDMA job is already executing at that point. The allow gfxoff
message to the firmware shouldn't come until later because it's
handled by a delayed work thread from end_use(). If you have multiple
submissions to SDMA within the delay window, the begin_use() and
end_use() will just be ref count handling and won't actually talk to
the firmware.
I followed up with testing a bunch more games, and is it turns out, Cyberpunk
2077 is the only one which has this submission patterns where default
GFX_OFF_DELAY_ENABLE is regularly defeated.
There, around 1.2 times per second the SDMA submissions miss that 100ms hysteresis
and cause a CPU latency over 100us (I only measured when >100us and ignored the
rest). Average latency is ~400us and max is ~2ms. So IMHO quite bad.
What exactly does Cyberpunk do to hit that? Are those SDMA page table updates,
clears or userspace submissions?
I will have to look into that to provide an answer.
And the vast majority of those latencies come from the SMU request. Only very
rarely someone hits the mutex contention path.
So that was the motivation for the RFC. I suppose I could have also proposed to
increase the hysteresis, but holding the GFXOFF disabled for the duration of
the job sounded preferable for power consmuption.
Anyway, given I only found Cyberpunk 2077 suffers from this I guess it maybe
isn't to interesting to upstream for you guys. Then again it is limited to
specific old SKU so maybe it should not be that controversial either? Only that
Christian NAKed tying it to job lifetime. So I don't know, AMDs call.
Well what you could do is to take a look if we couldn't simplify the SMU and/or
adjust the GFX_OFF_DELAY_ENABLED.
SMU stuff, as far as I can follow it, ends up with simply sending some
messages to the firmware. So I am not sure what and how could be
optimised there.
Increasing GFX_OFF_DELAY_ENABLED would work, if large enough, but I
think it could be bad for power usage, depending on the workload.
On the other hand why does it help to keep GFXOFF disabled while running the
SDMA job?
Only because I tied it to both GFX and SDMA.
RFC does this:
1) Marks SDMA as "needs GFXOFF workaround".
2) Propagates "needs GFXOFF workaround" to adev if any active ring has
it set.
3) If adev has it set, it grabs and extra GFXOFF disable for GFX,
COMPUTE and SDMA submissions, and marks those jobs as "hold GFXOFF".
4) Releases the GFXOFF when marked jobs are "completed" (well freed,
since completion is IRQ context so hard).
AFAIU from what Alex said I understood the parts of the chip handling
GFX and SDMA (not sure about compute) are under the same "power gating
domain" (right name?).
What would you suggest to log power use during the game? Something like
once per second or so?
Regards,
Tvrtko