amdgpu: drop the sched_sync

Liu, Monk Mon, 05 Nov 2018 00:59:49 -0800

> So you have a pipeline sync when you don't need one and that is really really 
> bad for things shared between processes, e.g. X/Wayland and it's clients.


Oh, that may explain the thing here: My environment is a no-X-window system 
(customer's cloud gaming user case), so I don't launch X at all, and there is 
only one process running there which is 3dmark_vulkan. 

> But my main question is why do you see any issues with quark? That is a 
> workaround for an issue for Vulkan sync handling and should only surface when 
> a specific test is run many many times.

Quark is only used to hang the gfx ring, and the missing explicit sync is from 
other processes (vulkan cts, vk_example, 3dmark). And I did some changes that 
drops unnecessary vm-flush/pipeline sync after GPU recover, that part is 
different with drm-next ... thanks for the reminding.

BTW: could we let the Job remember the hw fence seq that it need to sync up to 
? e.g. in "drm_sched_entity_clear_dep" we not only wake up scheduler but also 
set the hw fence seq number to the job (and keep the big one), so in the end in 
amdgpu_ib_schedule(), we knows exactly the last seq value we need to pipeline 
sync to, and we can only WAIT_REG_MEM on it ( so we need change pipeline_sync 
routine for gfx, let it receive the seq value as parameter, and we use ">=" 
operation instead of "==")

/Monk
-----Original Message-----
From: Koenig, Christian 
Sent: Monday, November 5, 2018 3:48 PM
To: Liu, Monk <[email protected]>; [email protected]; Zhou, 
David(ChunMing) <[email protected]>
Subject: Re: [PATCH 2/3] drm/amdgpu: drop the sched_sync

Am 05.11.18 um 08:24 schrieb Liu, Monk:
>> David Zhou had an use case which saw a >10% performance drop the last time 
>> he tried it.
> I really don't believe that, because if you insert a WAIT_MEM on an already 
> signaled fence, it only cost GPU couple clocks to move  on, right ? no reason 
> to slow down up to 10% ... with 3dmark vulkan version test, the performance 
> is barely different ... with my patch applied ...

Why do you think that we insert a WAIT_MEM on an already signaled fence? 
The pipeline sync always wait for the last fence value (because we can't handle 
wraparounds in PM4).

So you have a pipeline sync when you don't need one and that is really really 
bad for things shared between processes, e.g. X/Wayland and it's clients.

I also expects that this doesn't effect 3dmark at all, but everything which 
runs in a window which is composed by X could be slowed down massively.

David do you remember which use case was affected when you tried to drop this 
optimization?

>> When a reset happens we flush the VMIDs when re-submitting the jobs to the 
>> rings and while doing so we also always do a pipeline sync.
> I will check that point in my branch, I didn't use drm-next, maybe 
> there is gap in this part

We had that logic for a very long time now, but we recently simplified it. 
Could be that there was a bug introduced doing so.

Maybe we should add a specific flag to run_job() to note that we are re-running 
a job and then always add VM flushes/pipeline syncs?

But my main question is why do you see any issues with quark? That is a 
workaround for an issue for Vulkan sync handling and should only surface when a 
specific test is run many many times.

Regards,
Christian.

>
> /Monk
> -----Original Message-----
> From: Koenig, Christian
> Sent: Monday, November 5, 2018 3:02 AM
> To: Liu, Monk <[email protected]>; [email protected]
> Subject: Re: [PATCH 2/3] drm/amdgpu: drop the sched_sync
>
>> Can you tell me which game/benchmark will have performance drop with this 
>> fix by your understanding ?
> When you sync between submission things like composing X windows are slowed 
> down massively.
>
> David Zhou had an use case which saw a >10% performance drop the last time he 
> tried it.
>
>> The problem I hit is during the massive stress test against 
>> multi-process + quark , if the quark process hang the engine while there is 
>> another two job following the bad job, After the TDR these two job will lose 
>> the explicit and the pipeline-sync was also lost.
> Well that is really strange. This workaround is only for a very specific 
> Vulkan CTS test which we are still not 100% sure is actually valid.
>
> When a reset happens we flush the VMIDs when re-submitting the jobs to the 
> rings and while doing so we also always do a pipeline sync.
>
> So you should never ever run into any issues in quark with that, even when we 
> completely disable this workaround.
>
> Regards,
> Christian.
>
> Am 04.11.18 um 01:48 schrieb Liu, Monk:
>>> NAK, that would result in a severe performance drop.
>>> We need the fence here to determine if we actually need to do the pipeline 
>>> sync or not.
>>> E.g. the explicit requested fence could already be signaled.
>> For the performance issue, only insert a WAIT_REG_MEM on GFX/compute ring 
>> *doesn't* give the "severe" drop (it's mimic in fact) ...  At least I didn't 
>> observe any performance drop with 3dmark benchmark (also tested vulkan CTS), 
>> Can you tell me which game/benchmark will have performance drop with this 
>> fix by your understanding ? let me check it .
>>
>> The problem I hit is during the massive stress test against 
>> multi-process + quark , if the quark process hang the engine while there is 
>> another two job following the bad job, After the TDR these two job will lose 
>> the explicit and the pipeline-sync was also lost.
>>
>>
>> BTW: for original logic, the pipeline sync have another corner case:
>> Assume JobC depend on JobA with explicit flag, and there is jobB inserted in 
>> ring:
>>
>> jobA -> jobB -> (pipe sync)JobC
>>
>> if JobA really cost a lot of time to finish, in the
>> amdgpu_ib_schedule() stage you will insert a pipeline sync for JobC against 
>> its explicit dependency which is JobA, but there is a JobB between A and C 
>> and the pipeline sync of before JobC will wrongly wait on the JobB ...
>>
>> while it is not a big issue but obviously not necessary: C have no 
>> relation with B
>>
>> /Monk
>>
>>
>>
>> -----Original Message-----
>> From: Christian König <[email protected]>
>> Sent: Sunday, November 4, 2018 3:50 AM
>> To: Liu, Monk <[email protected]>; [email protected]
>> Subject: Re: [PATCH 2/3] drm/amdgpu: drop the sched_sync
>>
>> Am 03.11.18 um 06:33 schrieb Monk Liu:
>>> Reasons to drop it:
>>>
>>> 1) simplify the code: just introduce field member "need_pipe_sync"
>>> for job is good enough to tell if the explicit dependency fence need 
>>> followed by a pipeline sync.
>>>
>>> 2) after GPU_recover the explicit fence from sched_syn will not come 
>>> back so the required pipeline_sync following it is missed, consider 
>>> scenario below:
>>>> now on ring buffer:
>>> Job-A -> pipe_sync -> Job-B
>>>> TDR occured on Job-A, and after GPU recover:
>>>> now on ring buffer:
>>> Job-A -> Job-B
>>>
>>> because the fence from sched_sync is used and freed after 
>>> ib_schedule in first time, it will never come back, with this patch 
>>> this issue could be avoided.
>> NAK, that would result in a severe performance drop.
>>
>> We need the fence here to determine if we actually need to do the pipeline 
>> sync or not.
>>
>> E.g. the explicit requested fence could already be signaled.
>>
>> Christian.
>>
>>> Signed-off-by: Monk Liu <[email protected]>
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c  | 16 ++++++----------
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 14 +++-----------
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_job.h |  3 +--
>>>     3 files changed, 10 insertions(+), 23 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>>> index c48207b3..ac7d2da 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>>> @@ -122,7 +122,6 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, 
>>> unsigned num_ibs,
>>>     {
>>>             struct amdgpu_device *adev = ring->adev;
>>>             struct amdgpu_ib *ib = &ibs[0];
>>> -   struct dma_fence *tmp = NULL;
>>>             bool skip_preamble, need_ctx_switch;
>>>             unsigned patch_offset = ~0;
>>>             struct amdgpu_vm *vm;
>>> @@ -166,16 +165,13 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, 
>>> unsigned num_ibs,
>>>             }
>>>     
>>>             need_ctx_switch = ring->current_ctx != fence_ctx;
>>> -   if (ring->funcs->emit_pipeline_sync && job &&
>>> -       ((tmp = amdgpu_sync_get_fence(&job->sched_sync, NULL)) ||
>>> -        (amdgpu_sriov_vf(adev) && need_ctx_switch) ||
>>> -        amdgpu_vm_need_pipeline_sync(ring, job))) {
>>> -           need_pipe_sync = true;
>>>     
>>> -           if (tmp)
>>> -                   trace_amdgpu_ib_pipe_sync(job, tmp);
>>> -
>>> -           dma_fence_put(tmp);
>>> +   if (ring->funcs->emit_pipeline_sync && job) {
>>> +           if ((need_ctx_switch && amdgpu_sriov_vf(adev)) ||
>>> +                   amdgpu_vm_need_pipeline_sync(ring, job))
>>> +                   need_pipe_sync = true;
>>> +           else if (job->need_pipe_sync)
>>> +                   need_pipe_sync = true;
>>>             }
>>>     
>>>             if (ring->funcs->insert_start)
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> index 1d71f8c..dae997d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> @@ -71,7 +71,6 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned 
>>> num_ibs,
>>>             (*job)->num_ibs = num_ibs;
>>>     
>>>             amdgpu_sync_create(&(*job)->sync);
>>> -   amdgpu_sync_create(&(*job)->sched_sync);
>>>             (*job)->vram_lost_counter = 
>>> atomic_read(&adev->vram_lost_counter);
>>>             (*job)->vm_pd_addr = AMDGPU_BO_INVALID_OFFSET;
>>>     
>>> @@ -117,7 +116,6 @@ static void amdgpu_job_free_cb(struct drm_sched_job 
>>> *s_job)
>>>             amdgpu_ring_priority_put(ring, s_job->s_priority);
>>>             dma_fence_put(job->fence);
>>>             amdgpu_sync_free(&job->sync);
>>> -   amdgpu_sync_free(&job->sched_sync);
>>>             kfree(job);
>>>     }
>>>     
>>> @@ -127,7 +125,6 @@ void amdgpu_job_free(struct amdgpu_job *job)
>>>     
>>>             dma_fence_put(job->fence);
>>>             amdgpu_sync_free(&job->sync);
>>> -   amdgpu_sync_free(&job->sched_sync);
>>>             kfree(job);
>>>     }
>>>     
>>> @@ -182,14 +179,9 @@ static struct dma_fence *amdgpu_job_dependency(struct 
>>> drm_sched_job *sched_job,
>>>             bool need_pipe_sync = false;
>>>             int r;
>>>     
>>> -   fence = amdgpu_sync_get_fence(&job->sync, &need_pipe_sync);
>>> -   if (fence && need_pipe_sync) {
>>> -           if (drm_sched_dependency_optimized(fence, s_entity)) {
>>> -                   r = amdgpu_sync_fence(ring->adev, &job->sched_sync,
>>> -                                         fence, false);
>>> -                   if (r)
>>> -                           DRM_ERROR("Error adding fence (%d)\n", r);
>>> -           }
>>> +   if (fence && need_pipe_sync && drm_sched_dependency_optimized(fence, 
>>> s_entity)) {
>>> +           trace_amdgpu_ib_pipe_sync(job, fence);
>>> +           job->need_pipe_sync = true;
>>>             }
>>>     
>>>             while (fence == NULL && vm && !job->vmid) { diff --git 
>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>> index e1b46a6..c1d00f0 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>> @@ -41,7 +41,6 @@ struct amdgpu_job {
>>>             struct drm_sched_job    base;
>>>             struct amdgpu_vm        *vm;
>>>             struct amdgpu_sync      sync;
>>> -   struct amdgpu_sync      sched_sync;
>>>             struct amdgpu_ib        *ibs;
>>>             struct dma_fence        *fence; /* the hw fence */
>>>             uint32_t                preamble_status;
>>> @@ -59,7 +58,7 @@ struct amdgpu_job {
>>>             /* user fence handling */
>>>             uint64_t                uf_addr;
>>>             uint64_t                uf_sequence;
>>> -
>>> +   bool            need_pipe_sync; /* require a pipeline sync for this job 
>>> */
>>>     };
>>>     
>>>     int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned 
>>> num_ibs,

_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH 2/3] drm/amdgpu: drop the sched_sync

Reply via email to