On Wed, Aug 13, 2025 at 7:06 PM Wu, David <david...@amd.com> wrote: > > On 8/13/2025 6:11 PM, Alex Deucher wrote: > > On Wed, Aug 13, 2025 at 5:47 PM Wu, David <david...@amd.com> wrote: > >> On 8/13/2025 5:03 PM, Alex Deucher wrote: > >>> On Wed, Aug 13, 2025 at 4:58 PM Sundararaju, Sathishkumar > >>> <sathishkumar.sundarar...@amd.com> wrote: > >>>> On 8/14/2025 1:35 AM, Alex Deucher wrote: > >>>>> On Wed, Aug 13, 2025 at 2:23 PM Sundararaju, Sathishkumar > >>>>> <sathishkumar.sundarar...@amd.com> wrote: > >>>>>> Hi Alex, Hi David, > >>>>>> > >>>>>> I see David's concern but his suggestion yet wont solve the problem, > >>>>>> neither the current form , reason :- > >>>>>> > >>>>>> The emitted fence count and total submission count are fast transients > >>>>>> which frequently become 0 in between video decodes (between jobs) even > >>>>>> with the atomics and locks there can be a switch of video power > >>>>>> profile, > >>>>>> in the current form of patch that window is minimized, but still can > >>>>>> happen if stress tested. But power state of any instance becoming zero > >>>>> Can you explain how this can happen? I'm not seeing it. > >>>> Consider this situation, inst0 and inst1 actively decoding, inst0 decode > >>>> completes, delayed idle work starts. > >>>> inst0 idle handler can read 0 total fences and 0 total submission count, > >>>> even if inst1 is actively decoding, > >>>> that's between the jobs, > >>>> - as begin_use increaments vcn.total_submission_cnt and end_use > >>>> decreaments vcn.total_submission_cnt that can be 0. > >>>> - if outstanding fences are cleared and no new emitted fence, between > >>>> jobs , can be 0. > >>>> - both of the above conditions do not mean video decode is complete > >>>> on > >>>> inst1, it is actively decoding. > >>> How can there be active decoding without an outstanding fence? In > >>> that case, total_fences (fences from both instances) would be non-0. > >> I think it should be non-0. > >> I do see a hiccup possible - i.e the power switching from ON to OFF then > >> ON in the > >> middle of decoding, i.e inst0 idle handler turns it off then inst1 turns > >> it on. > > How would that happen? As long as there submission cnt is non-0 and > > there are outstanding fences on any instance, the video profile will > > stay active. > there could be no jobs but it doesn't timeout yet and new jobs will come in > any ms - note all fences are done at this time. The idle handler sees no > fences > and no jobs so it turns off the power - but just ms later a new job is > submitted > from the same decode session which could be mpv player as it does not > need to > submit jobs without delays. This will turn on the power.
I'm not following. Every submission will start with begin_use(). Alex > David > > Alex > > > >> We should avoid this glitch. This requires the idle handler sets/clears > >> a flag for > >> done for this instance as Sathish's original patch. When all instances > >> set/clear the > >> flag then we can safely power off. > >> David > >>> Alex > >>> > >>>> Whereas if instances are powered off we are sure idle time is past and > >>>> it is powered off, no possible way of > >>>> active video decode, when all instances are off we can safely assume no > >>>> active decode and global lock protects > >>>> it against new begin_use on any instance. But the only distant concern > >>>> is global common locks w.r.t perf, but we > >>>> are already having a global workprofile mutex , so there shouldn't be > >>>> any drop in perf, with just one single > >>>> global lock for all instances. > >>>> > >>>> Just sending out a patch with this fix, will leave it to you to decide > >>>> the right method. If you think outstanding total fences > >>>> can never be 0 during decode, then your previous version (v3) itself is > >>>> good, there is no real benefit of splitting the handlers as such. > >>>> > >>>> Regards, > >>>> Sathish > >>>>> If it is possible, maybe it would be easier to just split the profile > >>>>> and powergating into separate handlers. The profile one would be > >>>>> global and the powergating one would be per instance. See the > >>>>> attached patches. > >>>>> > >>>>> Alex > >>>>> > >>>>>> can be a sure shot indication of break in a video decode, the mistake > >>>>>> in > >>>>>> my patch was using per instance mutex, I should have used a common > >>>>>> global mutex, then that covers the situation David is trying to bring > >>>>>> out. > >>>>>> > >>>>>> Using one global vcn.pg_lock for idle and begin_use and using flags to > >>>>>> track power state could help us totally avoid this situation. > >>>>>> > >>>>>> Regards, > >>>>>> > >>>>>> Sathish > >>>>>> > >>>>>> On 8/13/2025 11:46 PM, Wu, David wrote: > >>>>>>> On 8/13/2025 12:51 PM, Alex Deucher wrote: > >>>>>>>> On Wed, Aug 13, 2025 at 12:39 PM Wu, David <david...@amd.com> wrote: > >>>>>>>>> Hi Alex, > >>>>>>>>> > >>>>>>>>> The addition of total_submission_cnt should work - in that > >>>>>>>>> it is unlikely to have a context switch right after the begin_use(). > >>>>>>>>> The suggestion of moving it inside the lock (which I prefer in case > >>>>>>>>> someone > >>>>>>>>> adds more before the lock and not reviewed thoroughly) > >>>>>>>>> - up to you to decide. > >>>>>>>>> > >>>>>>>>> Reviewed-by: David (Ming Qiang) Wu <david....@amd.com> > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> David > >>>>>>>>> On 8/13/2025 9:45 AM, Alex Deucher wrote: > >>>>>>>>>> If there are multiple instances of the VCN running, > >>>>>>>>>> we may end up switching the video profile while another > >>>>>>>>>> instance is active because we only take into account > >>>>>>>>>> the current instance's submissions. Look at all > >>>>>>>>>> outstanding fences for the video profile. > >>>>>>>>>> > >>>>>>>>>> v2: drop early exit in begin_use() > >>>>>>>>>> v3: handle possible race between begin_use() work handler > >>>>>>>>>> > >>>>>>>>>> Fixes: 3b669df92c85 ("drm/amdgpu/vcn: adjust workload profile > >>>>>>>>>> handling") > >>>>>>>>>> Reviewed-by: Sathishkumar S <sathishkumar.sundarar...@amd.com> (v1) > >>>>>>>>>> Signed-off-by: Alex Deucher <alexander.deuc...@amd.com> > >>>>>>>>>> --- > >>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 40 > >>>>>>>>>> ++++++++++++------------- > >>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h | 1 + > >>>>>>>>>> 2 files changed, 21 insertions(+), 20 deletions(-) > >>>>>>>>>> > >>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c > >>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c > >>>>>>>>>> index 9a76e11d1c184..593c1ddf8819b 100644 > >>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c > >>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c > >>>>>>>>>> @@ -415,19 +415,25 @@ static void > >>>>>>>>>> amdgpu_vcn_idle_work_handler(struct work_struct *work) > >>>>>>>>>> struct amdgpu_vcn_inst *vcn_inst = > >>>>>>>>>> container_of(work, struct amdgpu_vcn_inst, > >>>>>>>>>> idle_work.work); > >>>>>>>>>> struct amdgpu_device *adev = vcn_inst->adev; > >>>>>>>>>> - unsigned int fences = 0, fence[AMDGPU_MAX_VCN_INSTANCES] = > >>>>>>>>>> {0}; > >>>>>>>>>> - unsigned int i = vcn_inst->inst, j; > >>>>>>>>>> + unsigned int total_fences = 0, > >>>>>>>>>> fence[AMDGPU_MAX_VCN_INSTANCES] = {0}; > >>>>>>>>>> + unsigned int i, j; > >>>>>>>>>> int r = 0; > >>>>>>>>>> > >>>>>>>>>> - if (adev->vcn.harvest_config & (1 << i)) > >>>>>>>>>> + if (adev->vcn.harvest_config & (1 << vcn_inst->inst)) > >>>>>>>>>> return; > >>>>>>>>>> > >>>>>>>>>> - for (j = 0; j < adev->vcn.inst[i].num_enc_rings; ++j) > >>>>>>>>>> - fence[i] += > >>>>>>>>>> amdgpu_fence_count_emitted(&vcn_inst->ring_enc[j]); > >>>>>>>>>> + for (i = 0; i < adev->vcn.num_vcn_inst; ++i) { > >>>>>>>>>> + struct amdgpu_vcn_inst *v = &adev->vcn.inst[i]; > >>>>>>>>>> + > >>>>>>>>>> + for (j = 0; j < v->num_enc_rings; ++j) > >>>>>>>>>> + fence[i] += > >>>>>>>>>> amdgpu_fence_count_emitted(&v->ring_enc[j]); > >>>>>>>>>> + fence[i] += amdgpu_fence_count_emitted(&v->ring_dec); > >>>>>>>>>> + total_fences += fence[i]; > >>>>>>>>>> + } > >>>>>>>>>> > >>>>>>>>>> /* Only set DPG pause for VCN3 or below, VCN4 and above > >>>>>>>>>> will > >>>>>>>>>> be handled by FW */ > >>>>>>>>>> if (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG && > >>>>>>>>>> - !adev->vcn.inst[i].using_unified_queue) { > >>>>>>>>>> + !vcn_inst->using_unified_queue) { > >>>>>>>>>> struct dpg_pause_state new_state; > >>>>>>>>>> > >>>>>>>>>> if (fence[i] || > >>>>>>>>>> @@ -436,18 +442,18 @@ static void > >>>>>>>>>> amdgpu_vcn_idle_work_handler(struct work_struct *work) > >>>>>>>>>> else > >>>>>>>>>> new_state.fw_based = > >>>>>>>>>> VCN_DPG_STATE__UNPAUSE; > >>>>>>>>>> > >>>>>>>>>> - adev->vcn.inst[i].pause_dpg_mode(vcn_inst, > >>>>>>>>>> &new_state); > >>>>>>>>>> + vcn_inst->pause_dpg_mode(vcn_inst, &new_state); > >>>>>>>>>> } > >>>>>>>>>> > >>>>>>>>>> - fence[i] += amdgpu_fence_count_emitted(&vcn_inst->ring_dec); > >>>>>>>>>> - fences += fence[i]; > >>>>>>>>>> - > >>>>>>>>>> - if (!fences && > >>>>>>>>>> !atomic_read(&vcn_inst->total_submission_cnt)) { > >>>>>>>>>> + if (!fence[vcn_inst->inst] && > >>>>>>>>>> !atomic_read(&vcn_inst->total_submission_cnt)) { > >>>>>>>>>> + /* This is specific to this instance */ > >>>>>>>>>> mutex_lock(&vcn_inst->vcn_pg_lock); > >>>>>>>>>> vcn_inst->set_pg_state(vcn_inst, > >>>>>>>>>> AMD_PG_STATE_GATE); > >>>>>>>>>> mutex_unlock(&vcn_inst->vcn_pg_lock); > >>>>>>>>>> mutex_lock(&adev->vcn.workload_profile_mutex); > >>>>>>>>>> - if (adev->vcn.workload_profile_active) { > >>>>>>>>>> + /* This is global and depends on all VCN instances */ > >>>>>>>>>> + if (adev->vcn.workload_profile_active && > >>>>>>>>>> !total_fences && > >>>>>>>>>> + !atomic_read(&adev->vcn.total_submission_cnt)) { > >>>>>>>>>> r = amdgpu_dpm_switch_power_profile(adev, > >>>>>>>>>> PP_SMC_POWER_PROFILE_VIDEO, > >>>>>>>>>> false); > >>>>>>>>>> if (r) > >>>>>>>>>> @@ -467,16 +473,10 @@ void amdgpu_vcn_ring_begin_use(struct > >>>>>>>>>> amdgpu_ring *ring) > >>>>>>>>>> int r = 0; > >>>>>>>>>> > >>>>>>>>>> atomic_inc(&vcn_inst->total_submission_cnt); > >>>>>>>>>> + atomic_inc(&adev->vcn.total_submission_cnt); > >>>>>>>>> move this addition down inside the mutex lock > >>>>>>>>>> cancel_delayed_work_sync(&vcn_inst->idle_work); > >>>>>>>>>> > >>>>>>>>>> - /* We can safely return early here because we've cancelled > >>>>>>>>>> the > >>>>>>>>>> - * the delayed work so there is no one else to set it to > >>>>>>>>>> false > >>>>>>>>>> - * and we don't care if someone else sets it to true. > >>>>>>>>>> - */ > >>>>>>>>>> - if (adev->vcn.workload_profile_active) > >>>>>>>>>> - goto pg_lock; > >>>>>>>>>> - > >>>>>>>>>> mutex_lock(&adev->vcn.workload_profile_mutex); > >>>>>>>>> move to here: > >>>>>>>>> atomic_inc(&adev->vcn.total_submission_cnt); > >>>>>>>>> I think this should work for multiple instances. > >>>>>>>> Why does this need to be protected by the mutex? > >>>>>>> hmm.. OK - no need and it is actually better before the mutex. > >>>>>>> David > >>>>>>>> Alex > >>>>>>>> > >>>>>>>>> David > >>>>>>>>>> if (!adev->vcn.workload_profile_active) { > >>>>>>>>>> r = amdgpu_dpm_switch_power_profile(adev, > >>>>>>>>>> PP_SMC_POWER_PROFILE_VIDEO, > >>>>>>>>>> @@ -487,7 +487,6 @@ void amdgpu_vcn_ring_begin_use(struct > >>>>>>>>>> amdgpu_ring *ring) > >>>>>>>>>> } > >>>>>>>>>> mutex_unlock(&adev->vcn.workload_profile_mutex); > >>>>>>>>>> > >>>>>>>>>> -pg_lock: > >>>>>>>>>> mutex_lock(&vcn_inst->vcn_pg_lock); > >>>>>>>>>> vcn_inst->set_pg_state(vcn_inst, AMD_PG_STATE_UNGATE); > >>>>>>>>>> > >>>>>>>>>> @@ -528,6 +527,7 @@ void amdgpu_vcn_ring_end_use(struct amdgpu_ring > >>>>>>>>>> *ring) > >>>>>>>>>> atomic_dec(&ring->adev->vcn.inst[ring->me].dpg_enc_submission_cnt); > >>>>>>>>>> > >>>>>>>>>> atomic_dec(&ring->adev->vcn.inst[ring->me].total_submission_cnt); > >>>>>>>>>> + atomic_dec(&ring->adev->vcn.total_submission_cnt); > >>>>>>>>>> > >>>>>>>>>> schedule_delayed_work(&ring->adev->vcn.inst[ring->me].idle_work, > >>>>>>>>>> VCN_IDLE_TIMEOUT); > >>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h > >>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h > >>>>>>>>>> index b3fb1d0e43fc9..febc3ce8641ff 100644 > >>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h > >>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h > >>>>>>>>>> @@ -352,6 +352,7 @@ struct amdgpu_vcn { > >>>>>>>>>> > >>>>>>>>>> uint16_t inst_mask; > >>>>>>>>>> uint8_t num_inst_per_aid; > >>>>>>>>>> + atomic_t total_submission_cnt; > >>>>>>>>>> > >>>>>>>>>> /* IP reg dump */ > >>>>>>>>>> uint32_t *ip_dump; >