sched: serialize job_timeout and scheduler

Andrey Grodzovsky Tue, 31 Aug 2021 06:53:48 -0700

It's says patch [2/2] but i can't find patch 1

On 2021-08-31 6:35 a.m., Monk Liu wrote:

tested-by: jingwen chen <[email protected]>
Signed-off-by: Monk Liu <[email protected]>
Signed-off-by: jingwen chen <[email protected]>
---
  drivers/gpu/drm/scheduler/sched_main.c | 24 ++++--------------------
  1 file changed, 4 insertions(+), 20 deletions(-)


diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index ecf8140..894fdb24 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct 
*work)
        sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);

/* Protects against concurrent deletion in drm_sched_get_cleanup_job */

+       if (!__kthread_should_park(sched->thread))
+               kthread_park(sched->thread);
+

As mentioned before, without serializing against other TDR handlers fromotherschedulers you just race here against them, e.g. you parked it now butanotherone in progress will unpark it as part of calling drm_sched_start forother rings[1]

Unless I am missing something since I haven't found patch [1/2]

[1] -https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L5041


Andrey

        spin_lock(&sched->job_list_lock);
        job = list_first_entry_or_null(&sched->pending_list,
                                       struct drm_sched_job, list);

if (job) {

-               /*
-                * Remove the bad job so it cannot be freed by concurrent
-                * drm_sched_cleanup_jobs. It will be reinserted back after 
sched->thread
-                * is parked at which point it's safe.
-                */
-               list_del_init(&job->list);
                spin_unlock(&sched->job_list_lock);

+ /* vendor's timeout_job should call drm_sched_start() */

                status = job->sched->ops->timedout_job(job);

/*

@@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, 
struct drm_sched_job *bad)
        kthread_park(sched->thread);

/*

-        * Reinsert back the bad job here - now it's safe as
-        * drm_sched_get_cleanup_job cannot race against us and release the
-        * bad job at this point - we parked (waited for) any in progress
-        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
-        * now until the scheduler thread is unparked.
-        */
-       if (bad && bad->sched == sched)
-               /*
-                * Add at the head of the queue to reflect it was the earliest
-                * job extracted.
-                */
-               list_add(&bad->list, &sched->pending_list);
-
-       /*
         * Iterate the job list from later to  earlier one and either deactive
         * their HW callbacks or remove them from pending list if they already
         * signaled.

Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

Reply via email to