Hi Lijo,
On 8/14/2025 2:11 PM, Lazar, Lijo wrote:
[Public]
We already have a per instance power state that can be tracked. What about
something like attached?
This also has concurrent access of the power state ,
vcn.inst[i].cur_state is not protected by workload_profile_mutex
every where, it can still change while you are holding
workload_profile_mutex and checking it.
Regards,
Sathish
Thanks,
Lijo
-----Original Message-----
From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of Sundararaju,
Sathishkumar
Sent: Thursday, August 14, 2025 4:43 AM
To: Alex Deucher <alexdeuc...@gmail.com>
Cc: Wu, David <david....@amd.com>; Deucher, Alexander
<alexander.deuc...@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu/vcn: fix video profile race condition (v3)
On 8/14/2025 3:38 AM, Alex Deucher wrote:
On Wed, Aug 13, 2025 at 5:16 PM Sundararaju, Sathishkumar
<sathishkumar.sundarar...@amd.com> wrote:
On 8/14/2025 2:33 AM, Alex Deucher wrote:
On Wed, Aug 13, 2025 at 4:58 PM Sundararaju, Sathishkumar
<sathishkumar.sundarar...@amd.com> wrote:
On 8/14/2025 1:35 AM, Alex Deucher wrote:
On Wed, Aug 13, 2025 at 2:23 PM Sundararaju, Sathishkumar
<sathishkumar.sundarar...@amd.com> wrote:
Hi Alex, Hi David,
I see David's concern but his suggestion yet wont solve the
problem, neither the current form , reason :-
The emitted fence count and total submission count are fast
transients which frequently become 0 in between video decodes
(between jobs) even with the atomics and locks there can be a
switch of video power profile, in the current form of patch that
window is minimized, but still can happen if stress tested. But
power state of any instance becoming zero
Can you explain how this can happen? I'm not seeing it.
Consider this situation, inst0 and inst1 actively decoding, inst0
decode completes, delayed idle work starts.
inst0 idle handler can read 0 total fences and 0 total submission
count, even if inst1 is actively decoding, that's between the jobs,
- as begin_use increaments vcn.total_submission_cnt and end_use
decreaments vcn.total_submission_cnt that can be 0.
- if outstanding fences are cleared and no new emitted fence,
between jobs , can be 0.
- both of the above conditions do not mean video decode is
complete on inst1, it is actively decoding.
How can there be active decoding without an outstanding fence? In
that case, total_fences (fences from both instances) would be non-0.
I mean on inst1 the job scheduled is already complete, so 0
outstanding fences, newer job is yet to be scheduled
and commited to ring (inst1) , this doesn't mean decode has stopped
on
inst1 right (I am saying if timing of inst0 idle work coincides with
this),
Or am I wrong in assuming this ? Can't this ever happen ? Please
correct my understanding here.
The flow looks like:
begin_use(inst)
emit_fence(inst)
end_use(inst)
...later
fence signals
...later
work handler
In begin_use we increment the global and per instance submission.
This protects the power gating and profile until end_use. In end use
we decrement the submissions because we don't need to protect anything
any more as we have the fence that was submitted via the ring. That
fence won't signal until the job is complete.
Is a next begin_use always guaranteed to be run before current job fence
signals ?
if not then both total submission and total fence are zero , example delayed
job/packet submissions
from user space, or next job schedule happens after current job fence signals.
if this is never possible then (v3) is perfect.
Regards,
Sathish
For power gating, we
only care about the submission count and fences for that instance, for
the profile, we care about submission count and fences all instances.
Once the fences have signalled, the outstanding fences will be 0 and
there won't be any active work.
Alex
Regards,
Sathish
Alex
Whereas if instances are powered off we are sure idle time is past
and it is powered off, no possible way of active video decode, when
all instances are off we can safely assume no active decode and
global lock protects it against new begin_use on any instance. But
the only distant concern is global common locks w.r.t perf, but we
are already having a global workprofile mutex , so there shouldn't
be any drop in perf, with just one single global lock for all
instances.
Just sending out a patch with this fix, will leave it to you to
decide the right method. If you think outstanding total fences can
never be 0 during decode, then your previous version (v3) itself is
good, there is no real benefit of splitting the handlers as such.
Regards,
Sathish
If it is possible, maybe it would be easier to just split the
profile and powergating into separate handlers. The profile one
would be global and the powergating one would be per instance.
See the attached patches.
Alex
can be a sure shot indication of break in a video decode, the
mistake in my patch was using per instance mutex, I should have
used a common global mutex, then that covers the situation David is trying to
bring out.
Using one global vcn.pg_lock for idle and begin_use and using
flags to track power state could help us totally avoid this situation.
Regards,
Sathish
On 8/13/2025 11:46 PM, Wu, David wrote:
On 8/13/2025 12:51 PM, Alex Deucher wrote:
On Wed, Aug 13, 2025 at 12:39 PM Wu, David <david...@amd.com> wrote:
Hi Alex,
The addition of total_submission_cnt should work - in that it
is unlikely to have a context switch right after the begin_use().
The suggestion of moving it inside the lock (which I prefer in
case someone adds more before the lock and not reviewed
thoroughly)
- up to you to decide.
Reviewed-by: David (Ming Qiang) Wu <david....@amd.com>
Thanks,
David
On 8/13/2025 9:45 AM, Alex Deucher wrote:
If there are multiple instances of the VCN running, we may
end up switching the video profile while another instance is
active because we only take into account the current
instance's submissions. Look at all outstanding fences for
the video profile.
v2: drop early exit in begin_use()
v3: handle possible race between begin_use() work handler
Fixes: 3b669df92c85 ("drm/amdgpu/vcn: adjust workload profile
handling")
Reviewed-by: Sathishkumar S
<sathishkumar.sundarar...@amd.com> (v1)
Signed-off-by: Alex Deucher <alexander.deuc...@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 40
++++++++++++-------------
drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h | 1 +
2 files changed, 21 insertions(+), 20 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
index 9a76e11d1c184..593c1ddf8819b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
@@ -415,19 +415,25 @@ static void
amdgpu_vcn_idle_work_handler(struct work_struct *work)
struct amdgpu_vcn_inst *vcn_inst =
container_of(work, struct amdgpu_vcn_inst,
idle_work.work);
struct amdgpu_device *adev = vcn_inst->adev;
- unsigned int fences = 0, fence[AMDGPU_MAX_VCN_INSTANCES] = {0};
- unsigned int i = vcn_inst->inst, j;
+ unsigned int total_fences = 0,
fence[AMDGPU_MAX_VCN_INSTANCES] = {0};
+ unsigned int i, j;
int r = 0;
- if (adev->vcn.harvest_config & (1 << i))
+ if (adev->vcn.harvest_config & (1 << vcn_inst->inst))
return;
- for (j = 0; j < adev->vcn.inst[i].num_enc_rings; ++j)
- fence[i] +=
amdgpu_fence_count_emitted(&vcn_inst->ring_enc[j]);
+ for (i = 0; i < adev->vcn.num_vcn_inst; ++i) {
+ struct amdgpu_vcn_inst *v = &adev->vcn.inst[i];
+
+ for (j = 0; j < v->num_enc_rings; ++j)
+ fence[i] +=
amdgpu_fence_count_emitted(&v->ring_enc[j]);
+ fence[i] += amdgpu_fence_count_emitted(&v->ring_dec);
+ total_fences += fence[i];
+ }
/* Only set DPG pause for VCN3 or below, VCN4 and
above will be handled by FW */
if (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG &&
- !adev->vcn.inst[i].using_unified_queue) {
+ !vcn_inst->using_unified_queue) {
struct dpg_pause_state new_state;
if (fence[i] || @@ -436,18 +442,18 @@
static void amdgpu_vcn_idle_work_handler(struct work_struct
*work)
else
new_state.fw_based =
VCN_DPG_STATE__UNPAUSE;
- adev->vcn.inst[i].pause_dpg_mode(vcn_inst, &new_state);
+ vcn_inst->pause_dpg_mode(vcn_inst, &new_state);
}
- fence[i] += amdgpu_fence_count_emitted(&vcn_inst->ring_dec);
- fences += fence[i];
-
- if (!fences && !atomic_read(&vcn_inst->total_submission_cnt)) {
+ if (!fence[vcn_inst->inst] &&
!atomic_read(&vcn_inst->total_submission_cnt)) {
+ /* This is specific to this instance */
mutex_lock(&vcn_inst->vcn_pg_lock);
vcn_inst->set_pg_state(vcn_inst, AMD_PG_STATE_GATE);
mutex_unlock(&vcn_inst->vcn_pg_lock);
mutex_lock(&adev->vcn.workload_profile_mutex);
- if (adev->vcn.workload_profile_active) {
+ /* This is global and depends on all VCN instances */
+ if (adev->vcn.workload_profile_active &&
!total_fences &&
+ !atomic_read(&adev->vcn.total_submission_cnt)) {
r =
amdgpu_dpm_switch_power_profile(adev,
PP_SMC_POWER_PROFILE_VIDEO,
false);
if (r) @@ -467,16 +473,10 @@ void
amdgpu_vcn_ring_begin_use(struct amdgpu_ring *ring)
int r = 0;
atomic_inc(&vcn_inst->total_submission_cnt);
+ atomic_inc(&adev->vcn.total_submission_cnt);
move this addition down inside the mutex lock
cancel_delayed_work_sync(&vcn_inst->idle_work);
- /* We can safely return early here because we've cancelled the
- * the delayed work so there is no one else to set it to false
- * and we don't care if someone else sets it to true.
- */
- if (adev->vcn.workload_profile_active)
- goto pg_lock;
-
mutex_lock(&adev->vcn.workload_profile_mutex);
move to here:
atomic_inc(&adev->vcn.total_submission_cnt);
I think this should work for multiple instances.
Why does this need to be protected by the mutex?
hmm.. OK - no need and it is actually better before the mutex.
David
Alex
David
if (!adev->vcn.workload_profile_active) {
r = amdgpu_dpm_switch_power_profile(adev,
PP_SMC_POWER_PROFILE_VIDEO,
@@ -487,7 +487,6 @@ void amdgpu_vcn_ring_begin_use(struct
amdgpu_ring *ring)
}
mutex_unlock(&adev->vcn.workload_profile_mutex);
-pg_lock:
mutex_lock(&vcn_inst->vcn_pg_lock);
vcn_inst->set_pg_state(vcn_inst,
AMD_PG_STATE_UNGATE);
@@ -528,6 +527,7 @@ void amdgpu_vcn_ring_end_use(struct
amdgpu_ring
*ring)
atomic_dec(&ring->adev->vcn.inst[ring->me].dpg_enc_submission
_cnt);
atomic_dec(&ring->adev->vcn.inst[ring->me].total_submission_c
nt);
+ atomic_dec(&ring->adev->vcn.total_submission_cnt);
schedule_delayed_work(&ring->adev->vcn.inst[ring->me].idle_work,
VCN_IDLE_TIMEOUT); diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
index b3fb1d0e43fc9..febc3ce8641ff 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
@@ -352,6 +352,7 @@ struct amdgpu_vcn {
uint16_t inst_mask;
uint8_t num_inst_per_aid;
+ atomic_t total_submission_cnt;
/* IP reg dump */
uint32_t *ip_dump;