On Wed, Aug 13, 2025 at 4:58 PM Sundararaju, Sathishkumar <sathishkumar.sundarar...@amd.com> wrote: > > > On 8/14/2025 1:35 AM, Alex Deucher wrote: > > On Wed, Aug 13, 2025 at 2:23 PM Sundararaju, Sathishkumar > > <sathishkumar.sundarar...@amd.com> wrote: > >> Hi Alex, Hi David, > >> > >> I see David's concern but his suggestion yet wont solve the problem, > >> neither the current form , reason :- > >> > >> The emitted fence count and total submission count are fast transients > >> which frequently become 0 in between video decodes (between jobs) even > >> with the atomics and locks there can be a switch of video power profile, > >> in the current form of patch that window is minimized, but still can > >> happen if stress tested. But power state of any instance becoming zero > > Can you explain how this can happen? I'm not seeing it. > Consider this situation, inst0 and inst1 actively decoding, inst0 decode > completes, delayed idle work starts. > inst0 idle handler can read 0 total fences and 0 total submission count, > even if inst1 is actively decoding, > that's between the jobs, > - as begin_use increaments vcn.total_submission_cnt and end_use > decreaments vcn.total_submission_cnt that can be 0. > - if outstanding fences are cleared and no new emitted fence, between > jobs , can be 0. > - both of the above conditions do not mean video decode is complete on > inst1, it is actively decoding.
How can there be active decoding without an outstanding fence? In that case, total_fences (fences from both instances) would be non-0. Alex > > Whereas if instances are powered off we are sure idle time is past and > it is powered off, no possible way of > active video decode, when all instances are off we can safely assume no > active decode and global lock protects > it against new begin_use on any instance. But the only distant concern > is global common locks w.r.t perf, but we > are already having a global workprofile mutex , so there shouldn't be > any drop in perf, with just one single > global lock for all instances. > > Just sending out a patch with this fix, will leave it to you to decide > the right method. If you think outstanding total fences > can never be 0 during decode, then your previous version (v3) itself is > good, there is no real benefit of splitting the handlers as such. > > Regards, > Sathish > > > > If it is possible, maybe it would be easier to just split the profile > > and powergating into separate handlers. The profile one would be > > global and the powergating one would be per instance. See the > > attached patches. > > > > Alex > > > >> can be a sure shot indication of break in a video decode, the mistake in > >> my patch was using per instance mutex, I should have used a common > >> global mutex, then that covers the situation David is trying to bring out. > >> > >> Using one global vcn.pg_lock for idle and begin_use and using flags to > >> track power state could help us totally avoid this situation. > >> > >> Regards, > >> > >> Sathish > >> > >> On 8/13/2025 11:46 PM, Wu, David wrote: > >>> On 8/13/2025 12:51 PM, Alex Deucher wrote: > >>>> On Wed, Aug 13, 2025 at 12:39 PM Wu, David <david...@amd.com> wrote: > >>>>> Hi Alex, > >>>>> > >>>>> The addition of total_submission_cnt should work - in that > >>>>> it is unlikely to have a context switch right after the begin_use(). > >>>>> The suggestion of moving it inside the lock (which I prefer in case > >>>>> someone > >>>>> adds more before the lock and not reviewed thoroughly) > >>>>> - up to you to decide. > >>>>> > >>>>> Reviewed-by: David (Ming Qiang) Wu <david....@amd.com> > >>>>> > >>>>> Thanks, > >>>>> David > >>>>> On 8/13/2025 9:45 AM, Alex Deucher wrote: > >>>>>> If there are multiple instances of the VCN running, > >>>>>> we may end up switching the video profile while another > >>>>>> instance is active because we only take into account > >>>>>> the current instance's submissions. Look at all > >>>>>> outstanding fences for the video profile. > >>>>>> > >>>>>> v2: drop early exit in begin_use() > >>>>>> v3: handle possible race between begin_use() work handler > >>>>>> > >>>>>> Fixes: 3b669df92c85 ("drm/amdgpu/vcn: adjust workload profile > >>>>>> handling") > >>>>>> Reviewed-by: Sathishkumar S <sathishkumar.sundarar...@amd.com> (v1) > >>>>>> Signed-off-by: Alex Deucher <alexander.deuc...@amd.com> > >>>>>> --- > >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 40 > >>>>>> ++++++++++++------------- > >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h | 1 + > >>>>>> 2 files changed, 21 insertions(+), 20 deletions(-) > >>>>>> > >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c > >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c > >>>>>> index 9a76e11d1c184..593c1ddf8819b 100644 > >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c > >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c > >>>>>> @@ -415,19 +415,25 @@ static void > >>>>>> amdgpu_vcn_idle_work_handler(struct work_struct *work) > >>>>>> struct amdgpu_vcn_inst *vcn_inst = > >>>>>> container_of(work, struct amdgpu_vcn_inst, > >>>>>> idle_work.work); > >>>>>> struct amdgpu_device *adev = vcn_inst->adev; > >>>>>> - unsigned int fences = 0, fence[AMDGPU_MAX_VCN_INSTANCES] = {0}; > >>>>>> - unsigned int i = vcn_inst->inst, j; > >>>>>> + unsigned int total_fences = 0, > >>>>>> fence[AMDGPU_MAX_VCN_INSTANCES] = {0}; > >>>>>> + unsigned int i, j; > >>>>>> int r = 0; > >>>>>> > >>>>>> - if (adev->vcn.harvest_config & (1 << i)) > >>>>>> + if (adev->vcn.harvest_config & (1 << vcn_inst->inst)) > >>>>>> return; > >>>>>> > >>>>>> - for (j = 0; j < adev->vcn.inst[i].num_enc_rings; ++j) > >>>>>> - fence[i] += > >>>>>> amdgpu_fence_count_emitted(&vcn_inst->ring_enc[j]); > >>>>>> + for (i = 0; i < adev->vcn.num_vcn_inst; ++i) { > >>>>>> + struct amdgpu_vcn_inst *v = &adev->vcn.inst[i]; > >>>>>> + > >>>>>> + for (j = 0; j < v->num_enc_rings; ++j) > >>>>>> + fence[i] += > >>>>>> amdgpu_fence_count_emitted(&v->ring_enc[j]); > >>>>>> + fence[i] += amdgpu_fence_count_emitted(&v->ring_dec); > >>>>>> + total_fences += fence[i]; > >>>>>> + } > >>>>>> > >>>>>> /* Only set DPG pause for VCN3 or below, VCN4 and above will > >>>>>> be handled by FW */ > >>>>>> if (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG && > >>>>>> - !adev->vcn.inst[i].using_unified_queue) { > >>>>>> + !vcn_inst->using_unified_queue) { > >>>>>> struct dpg_pause_state new_state; > >>>>>> > >>>>>> if (fence[i] || > >>>>>> @@ -436,18 +442,18 @@ static void > >>>>>> amdgpu_vcn_idle_work_handler(struct work_struct *work) > >>>>>> else > >>>>>> new_state.fw_based = VCN_DPG_STATE__UNPAUSE; > >>>>>> > >>>>>> - adev->vcn.inst[i].pause_dpg_mode(vcn_inst, &new_state); > >>>>>> + vcn_inst->pause_dpg_mode(vcn_inst, &new_state); > >>>>>> } > >>>>>> > >>>>>> - fence[i] += amdgpu_fence_count_emitted(&vcn_inst->ring_dec); > >>>>>> - fences += fence[i]; > >>>>>> - > >>>>>> - if (!fences && !atomic_read(&vcn_inst->total_submission_cnt)) { > >>>>>> + if (!fence[vcn_inst->inst] && > >>>>>> !atomic_read(&vcn_inst->total_submission_cnt)) { > >>>>>> + /* This is specific to this instance */ > >>>>>> mutex_lock(&vcn_inst->vcn_pg_lock); > >>>>>> vcn_inst->set_pg_state(vcn_inst, AMD_PG_STATE_GATE); > >>>>>> mutex_unlock(&vcn_inst->vcn_pg_lock); > >>>>>> mutex_lock(&adev->vcn.workload_profile_mutex); > >>>>>> - if (adev->vcn.workload_profile_active) { > >>>>>> + /* This is global and depends on all VCN instances */ > >>>>>> + if (adev->vcn.workload_profile_active && > >>>>>> !total_fences && > >>>>>> + !atomic_read(&adev->vcn.total_submission_cnt)) { > >>>>>> r = amdgpu_dpm_switch_power_profile(adev, > >>>>>> PP_SMC_POWER_PROFILE_VIDEO, > >>>>>> false); > >>>>>> if (r) > >>>>>> @@ -467,16 +473,10 @@ void amdgpu_vcn_ring_begin_use(struct > >>>>>> amdgpu_ring *ring) > >>>>>> int r = 0; > >>>>>> > >>>>>> atomic_inc(&vcn_inst->total_submission_cnt); > >>>>>> + atomic_inc(&adev->vcn.total_submission_cnt); > >>>>> move this addition down inside the mutex lock > >>>>>> cancel_delayed_work_sync(&vcn_inst->idle_work); > >>>>>> > >>>>>> - /* We can safely return early here because we've cancelled the > >>>>>> - * the delayed work so there is no one else to set it to false > >>>>>> - * and we don't care if someone else sets it to true. > >>>>>> - */ > >>>>>> - if (adev->vcn.workload_profile_active) > >>>>>> - goto pg_lock; > >>>>>> - > >>>>>> mutex_lock(&adev->vcn.workload_profile_mutex); > >>>>> move to here: > >>>>> atomic_inc(&adev->vcn.total_submission_cnt); > >>>>> I think this should work for multiple instances. > >>>> Why does this need to be protected by the mutex? > >>> hmm.. OK - no need and it is actually better before the mutex. > >>> David > >>>> Alex > >>>> > >>>>> David > >>>>>> if (!adev->vcn.workload_profile_active) { > >>>>>> r = amdgpu_dpm_switch_power_profile(adev, > >>>>>> PP_SMC_POWER_PROFILE_VIDEO, > >>>>>> @@ -487,7 +487,6 @@ void amdgpu_vcn_ring_begin_use(struct > >>>>>> amdgpu_ring *ring) > >>>>>> } > >>>>>> mutex_unlock(&adev->vcn.workload_profile_mutex); > >>>>>> > >>>>>> -pg_lock: > >>>>>> mutex_lock(&vcn_inst->vcn_pg_lock); > >>>>>> vcn_inst->set_pg_state(vcn_inst, AMD_PG_STATE_UNGATE); > >>>>>> > >>>>>> @@ -528,6 +527,7 @@ void amdgpu_vcn_ring_end_use(struct amdgpu_ring > >>>>>> *ring) > >>>>>> atomic_dec(&ring->adev->vcn.inst[ring->me].dpg_enc_submission_cnt); > >>>>>> > >>>>>> atomic_dec(&ring->adev->vcn.inst[ring->me].total_submission_cnt); > >>>>>> + atomic_dec(&ring->adev->vcn.total_submission_cnt); > >>>>>> > >>>>>> schedule_delayed_work(&ring->adev->vcn.inst[ring->me].idle_work, > >>>>>> VCN_IDLE_TIMEOUT); > >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h > >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h > >>>>>> index b3fb1d0e43fc9..febc3ce8641ff 100644 > >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h > >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h > >>>>>> @@ -352,6 +352,7 @@ struct amdgpu_vcn { > >>>>>> > >>>>>> uint16_t inst_mask; > >>>>>> uint8_t num_inst_per_aid; > >>>>>> + atomic_t total_submission_cnt; > >>>>>> > >>>>>> /* IP reg dump */ > >>>>>> uint32_t *ip_dump;