[PATCH] drm/sched: fix the bug of time out calculation(v3)

Monk Liu Wed, 25 Aug 2021 22:28:18 -0700

issue:
in cleanup_job the cancle_delayed_work will cancel a TO timer
even the its corresponding job is still running.


fix:
do not cancel the timer in cleanup_job, instead do the cancelling
only when the heading job is signaled, and if there is a "next" job
we start_timeout again.

v2:
further cleanup the logic, and do the TDR timer cancelling if the signaled job
is the last one in its scheduler.

v3:
change the issue description
remove the cancel_delayed_work in the begining of the cleanup_job
recover the implement of drm_sched_job_begin.

TODO:
1)introduce pause/resume scheduler in job_timeout to serial the handling
of scheduler and job_timeout.
2)drop the bad job's del and insert in scheduler due to above serialization
(no race issue anymore with the serialization)

Signed-off-by: Monk Liu <monk....@amd.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 25 ++++++++++---------------
 1 file changed, 10 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index a2a9536..ecf8140 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -676,13 +676,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
 {
        struct drm_sched_job *job, *next;
 
-       /*
-        * Don't destroy jobs while the timeout worker is running  OR thread
-        * is being parked and hence assumed to not touch pending_list
-        */
-       if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
-           !cancel_delayed_work(&sched->work_tdr)) ||
-           kthread_should_park())
+       if (kthread_should_park())
                return NULL;
 
        spin_lock(&sched->job_list_lock);
@@ -693,17 +687,21 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
        if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
                /* remove job from pending_list */
                list_del_init(&job->list);
+
+               /* cancel this job's TO timer */
+               cancel_delayed_work(&sched->work_tdr);
                /* make the scheduled timestamp more accurate */
                next = list_first_entry_or_null(&sched->pending_list,
                                                typeof(*next), list);
-               if (next)
+
+               if (next) {
                        next->s_fence->scheduled.timestamp =
                                job->s_fence->finished.timestamp;
-
+                       /* start TO timer for next job */
+                       drm_sched_start_timeout(sched);
+               }
        } else {
                job = NULL;
-               /* queue timeout for next job */
-               drm_sched_start_timeout(sched);
        }
 
        spin_unlock(&sched->job_list_lock);
@@ -791,11 +789,8 @@ static int drm_sched_main(void *param)
                                          (entity = 
drm_sched_select_entity(sched))) ||
                                         kthread_should_stop());
 
-               if (cleanup_job) {
+               if (cleanup_job)
                        sched->ops->free_job(cleanup_job);
-                       /* queue timeout for next job */
-                       drm_sched_start_timeout(sched);
-               }
 
                if (!entity)
                        continue;
-- 
2.7.4

[PATCH] drm/sched: fix the bug of time out calculation(v3)

Reply via email to