Re: [PATCH] drm/sched: fix the bug of time out calculation(v3)

Christian König Thu, 26 Aug 2021 05:37:49 -0700

Am 26.08.21 um 13:55 schrieb Liu, Monk:

[AMD Official Use Only]

I'm not sure if the work_tdr is initialized when a maximum timeout is 
specified. Please double check.

Ok, will do

BTW: Can we please drop the "tdr" naming from the scheduler? That is just a 
timeout functionality and not related to recovery in any way.

We even do not start hardware recovery in a lot of cases now (when wave kill is 
successfully).

Umm, sounds reasonable, I can rename it to "to" with another patch

Maybe more like job_timeout or timeout_work or something into thatdirection.


Christian.


Thanks

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: Christian König <ckoenig.leichtzumer...@gmail.com>
Sent: Thursday, August 26, 2021 6:09 PM
To: Liu, Monk <monk....@amd.com>; amd-gfx@lists.freedesktop.org
Cc: dri-de...@lists.freedesktop.org
Subject: Re: [PATCH] drm/sched: fix the bug of time out calculation(v3)

Am 26.08.21 um 06:55 schrieb Monk Liu:

issue:
in cleanup_job the cancle_delayed_work will cancel a TO timer even the
its corresponding job is still running.

Yeah, that makes a lot more sense.

fix:
do not cancel the timer in cleanup_job, instead do the cancelling only
when the heading job is signaled, and if there is a "next" job we
start_timeout again.

v2:
further cleanup the logic, and do the TDR timer cancelling if the
signaled job is the last one in its scheduler.

v3:
change the issue description
remove the cancel_delayed_work in the begining of the cleanup_job
recover the implement of drm_sched_job_begin.

TODO:
1)introduce pause/resume scheduler in job_timeout to serial the
handling of scheduler and job_timeout.
2)drop the bad job's del and insert in scheduler due to above
serialization (no race issue anymore with the serialization)

Signed-off-by: Monk Liu <monk....@amd.com>
---
   drivers/gpu/drm/scheduler/sched_main.c | 25 ++++++++++---------------
   1 file changed, 10 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c
b/drivers/gpu/drm/scheduler/sched_main.c
index a2a9536..ecf8140 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -676,13 +676,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
   {
        struct drm_sched_job *job, *next;

- /*

-        * Don't destroy jobs while the timeout worker is running  OR thread
-        * is being parked and hence assumed to not touch pending_list
-        */
-       if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
-           !cancel_delayed_work(&sched->work_tdr)) ||
-           kthread_should_park())
+       if (kthread_should_park())
                return NULL;

spin_lock(&sched->job_list_lock);

@@ -693,17 +687,21 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
        if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
                /* remove job from pending_list */
                list_del_init(&job->list);
+
+               /* cancel this job's TO timer */
+               cancel_delayed_work(&sched->work_tdr);

I'm not sure if the work_tdr is initialized when a maximum timeout is 
specified. Please double check.

BTW: Can we please drop the "tdr" naming from the scheduler? That is just a 
timeout functionality and not related to recovery in any way.

We even do not start hardware recovery in a lot of cases now (when wave kill is 
successfully).

Regards,
Christian.

                /* make the scheduled timestamp more accurate */
                next = list_first_entry_or_null(&sched->pending_list,
                                                typeof(*next), list);
-               if (next)
+
+               if (next) {
                        next->s_fence->scheduled.timestamp =
                                job->s_fence->finished.timestamp;
-
+                       /* start TO timer for next job */
+                       drm_sched_start_timeout(sched);
+               }
        } else {
                job = NULL;
-               /* queue timeout for next job */
-               drm_sched_start_timeout(sched);
        }

spin_unlock(&sched->job_list_lock);

@@ -791,11 +789,8 @@ static int drm_sched_main(void *param)
                                          (entity = 
drm_sched_select_entity(sched))) ||
                                         kthread_should_stop());

- if (cleanup_job) {

+               if (cleanup_job)
                        sched->ops->free_job(cleanup_job);
-                       /* queue timeout for next job */
-                       drm_sched_start_timeout(sched);
-               }

if (!entity)

                        continue;

Re: [PATCH] drm/sched: fix the bug of time out calculation(v3)

Reply via email to