sched: Prevent any job recoveries after device is unplugged.

Christian König Tue, 24 Nov 2020 09:40:30 -0800

Am 24.11.20 um 18:11 schrieb Luben Tuikov:

On 2020-11-24 2:50 a.m., Christian König wrote:

Am 24.11.20 um 02:12 schrieb Luben Tuikov:

On 2020-11-23 3:06 a.m., Christian König wrote:

Am 23.11.20 um 06:37 schrieb Andrey Grodzovsky:

On 11/22/20 6:57 AM, Christian König wrote:

Am 21.11.20 um 06:21 schrieb Andrey Grodzovsky:

No point to try recovery if device is gone, it's meaningless.

I think that this should go into the device specific recovery
function and not in the scheduler.

The timeout timer is rearmed here, so this prevents any new recovery
work to restart from here
after drm_dev_unplug was executed from amdgpu_pci_remove.It will not
cover other places like
job cleanup or starting new job but those should stop once the
scheduler thread is stopped later.

Yeah, but this is rather unclean. We should probably return an error
code instead if the timer should be rearmed or not.

Christian, this is exactly my work I told you about
last week on Wednesday in our weekly meeting. And
which I wrote to you in an email last year about this
time.

Yeah, that's why I'm suggesting it here as well.

It seems you're suggesting that Andrey do it, while
all too well you know I've been working on this
for some time now.

Changing the return value is just a minimal change and I didn't want toblock Andrey in any way.


I wrote you about this last year same time
in an email. And I discussed it on the Wednesday
meeting.

You could've mentioned that here the first time.

So what do we do now?

Split your patches into smaller parts and submit them chunk by chunk.

E.g. renames first and then functional changes grouped by area they change.

I have, but my final patch, a tiny one but which implements
the core reason for the change seems buggy, and I'm looking
for a way to debug it.

Just send it out in chunks, e.g. non functional changes like renamesshouldn't cause any problems and having them in the branch earlyminimizes conflicts with work from others.


Regards,
Christian.


Regards,
Luben

Regards,
Christian.

I can submit those changes without the last part,
which builds on this change.

I'm still testing the last part and was hoping
to submit it all in one sequence of patches,
after my testing.

Regards,
Luben

Christian.

Andrey

Christian.

Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com>
---
    drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  2 +-
    drivers/gpu/drm/etnaviv/etnaviv_sched.c   |  3 ++-
    drivers/gpu/drm/lima/lima_sched.c         |  3 ++-
    drivers/gpu/drm/panfrost/panfrost_job.c   |  2 +-
    drivers/gpu/drm/scheduler/sched_main.c    | 15 ++++++++++++++-
    drivers/gpu/drm/v3d/v3d_sched.c           | 15 ++++++++++-----
    include/drm/gpu_scheduler.h               |  6 +++++-
    7 files changed, 35 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index d56f402..d0b0021 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -487,7 +487,7 @@ int amdgpu_fence_driver_init_ring(struct
amdgpu_ring *ring,
              r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
                       num_hw_submission, amdgpu_job_hang_limit,
-                   timeout, ring->name);
+                   timeout, ring->name, &adev->ddev);
            if (r) {
                DRM_ERROR("Failed to create scheduler on ring %s.\n",
                      ring->name);
diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
index cd46c88..7678287 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
@@ -185,7 +185,8 @@ int etnaviv_sched_init(struct etnaviv_gpu *gpu)
          ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops,
                     etnaviv_hw_jobs_limit, etnaviv_job_hang_limit,
-                 msecs_to_jiffies(500), dev_name(gpu->dev));
+                 msecs_to_jiffies(500), dev_name(gpu->dev),
+                 gpu->drm);
        if (ret)
            return ret;
    diff --git a/drivers/gpu/drm/lima/lima_sched.c
b/drivers/gpu/drm/lima/lima_sched.c
index dc6df9e..8a7e5d7ca 100644
--- a/drivers/gpu/drm/lima/lima_sched.c
+++ b/drivers/gpu/drm/lima/lima_sched.c
@@ -505,7 +505,8 @@ int lima_sched_pipe_init(struct lima_sched_pipe
*pipe, const char *name)
          return drm_sched_init(&pipe->base, &lima_sched_ops, 1,
                      lima_job_hang_limit, msecs_to_jiffies(timeout),
-                  name);
+                  name,
+                  pipe->ldev->ddev);
    }
      void lima_sched_pipe_fini(struct lima_sched_pipe *pipe)
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 30e7b71..37b03b01 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -520,7 +520,7 @@ int panfrost_job_init(struct panfrost_device
*pfdev)
            ret = drm_sched_init(&js->queue[j].sched,
                         &panfrost_sched_ops,
                         1, 0, msecs_to_jiffies(500),
-                     "pan_js");
+                     "pan_js", pfdev->ddev);
            if (ret) {
                dev_err(pfdev->dev, "Failed to create scheduler: %d.",
ret);
                goto err_sched;
diff --git a/drivers/gpu/drm/scheduler/sched_main.c
b/drivers/gpu/drm/scheduler/sched_main.c
index c3f0bd0..95db8c6 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -53,6 +53,7 @@
    #include <drm/drm_print.h>
    #include <drm/gpu_scheduler.h>
    #include <drm/spsc_queue.h>
+#include <drm/drm_drv.h>
      #define CREATE_TRACE_POINTS
    #include "gpu_scheduler_trace.h"
@@ -283,8 +284,16 @@ static void drm_sched_job_timedout(struct
work_struct *work)
        struct drm_gpu_scheduler *sched;
        struct drm_sched_job *job;
    +    int idx;
+
        sched = container_of(work, struct drm_gpu_scheduler,
work_tdr.work);
    +    if (!drm_dev_enter(sched->ddev, &idx)) {
+        DRM_INFO("%s - device unplugged skipping recovery on
scheduler:%s",
+             __func__, sched->name);
+        return;
+    }
+
        /* Protects against concurrent deletion in
drm_sched_get_cleanup_job */
        spin_lock(&sched->job_list_lock);
        job = list_first_entry_or_null(&sched->ring_mirror_list,
@@ -316,6 +325,8 @@ static void drm_sched_job_timedout(struct
work_struct *work)
        spin_lock(&sched->job_list_lock);
        drm_sched_start_timeout(sched);
        spin_unlock(&sched->job_list_lock);
+
+    drm_dev_exit(idx);
    }
       /**
@@ -845,7 +856,8 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
               unsigned hw_submission,
               unsigned hang_limit,
               long timeout,
-           const char *name)
+           const char *name,
+           struct drm_device *ddev)
    {
        int i, ret;
        sched->ops = ops;
@@ -853,6 +865,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
        sched->name = name;
        sched->timeout = timeout;
        sched->hang_limit = hang_limit;
+    sched->ddev = ddev;
        for (i = DRM_SCHED_PRIORITY_MIN; i < DRM_SCHED_PRIORITY_COUNT;
i++)
            drm_sched_rq_init(sched, &sched->sched_rq[i]);
    diff --git a/drivers/gpu/drm/v3d/v3d_sched.c
b/drivers/gpu/drm/v3d/v3d_sched.c
index 0747614..f5076e5 100644
--- a/drivers/gpu/drm/v3d/v3d_sched.c
+++ b/drivers/gpu/drm/v3d/v3d_sched.c
@@ -401,7 +401,8 @@ v3d_sched_init(struct v3d_dev *v3d)
                     &v3d_bin_sched_ops,
                     hw_jobs_limit, job_hang_limit,
                     msecs_to_jiffies(hang_limit_ms),
-                 "v3d_bin");
+                 "v3d_bin",
+                 &v3d->drm);
        if (ret) {
            dev_err(v3d->drm.dev, "Failed to create bin scheduler:
%d.", ret);
            return ret;
@@ -411,7 +412,8 @@ v3d_sched_init(struct v3d_dev *v3d)
                     &v3d_render_sched_ops,
                     hw_jobs_limit, job_hang_limit,
                     msecs_to_jiffies(hang_limit_ms),
-                 "v3d_render");
+                 "v3d_render",
+                 &v3d->drm);
        if (ret) {
            dev_err(v3d->drm.dev, "Failed to create render scheduler:
%d.",
                ret);
@@ -423,7 +425,8 @@ v3d_sched_init(struct v3d_dev *v3d)
                     &v3d_tfu_sched_ops,
                     hw_jobs_limit, job_hang_limit,
                     msecs_to_jiffies(hang_limit_ms),
-                 "v3d_tfu");
+                 "v3d_tfu",
+                 &v3d->drm);
        if (ret) {
            dev_err(v3d->drm.dev, "Failed to create TFU scheduler: %d.",
                ret);
@@ -436,7 +439,8 @@ v3d_sched_init(struct v3d_dev *v3d)
                         &v3d_csd_sched_ops,
                         hw_jobs_limit, job_hang_limit,
                         msecs_to_jiffies(hang_limit_ms),
-                     "v3d_csd");
+                     "v3d_csd",
+                     &v3d->drm);
            if (ret) {
                dev_err(v3d->drm.dev, "Failed to create CSD scheduler:
%d.",
                    ret);
@@ -448,7 +452,8 @@ v3d_sched_init(struct v3d_dev *v3d)
                         &v3d_cache_clean_sched_ops,
                         hw_jobs_limit, job_hang_limit,
                         msecs_to_jiffies(hang_limit_ms),
-                     "v3d_cache_clean");
+                     "v3d_cache_clean",
+                     &v3d->drm);
            if (ret) {
                dev_err(v3d->drm.dev, "Failed to create CACHE_CLEAN
scheduler: %d.",
                    ret);
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 9243655..a980709 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -32,6 +32,7 @@
      struct drm_gpu_scheduler;
    struct drm_sched_rq;
+struct drm_device;
      /* These are often used as an (initial) index
     * to an array, and as such should start at 0.
@@ -267,6 +268,7 @@ struct drm_sched_backend_ops {
     * @score: score to help loadbalancer pick a idle sched
     * @ready: marks if the underlying HW is ready to work
     * @free_guilty: A hit to time out handler to free the guilty job.
+ * @ddev: Pointer to drm device of this scheduler.
     *
     * One scheduler is implemented for each hardware ring.
     */
@@ -288,12 +290,14 @@ struct drm_gpu_scheduler {
        atomic_t                        score;
        bool                ready;
        bool                free_guilty;
+    struct drm_device        *ddev;
    };
      int drm_sched_init(struct drm_gpu_scheduler *sched,
               const struct drm_sched_backend_ops *ops,
               uint32_t hw_submission, unsigned hang_limit, long timeout,
-           const char *name);
+           const char *name,
+           struct drm_device *ddev);
      void drm_sched_fini(struct drm_gpu_scheduler *sched);
    int drm_sched_job_init(struct drm_sched_job *job,

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Cluben.tuikov%40amd.com%7C644a4f3feb79447fd6a408d8904dab27%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637418010548375418%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=wNLdozuhVS3smIpAuWB0tjFO3XDo1OmmZEgTCxviJaI%3D&amp;reserved=0

_______________________________________________
dri-devel mailing list
dri-de...@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=04%7C01%7Cluben.tuikov%40amd.com%7C644a4f3feb79447fd6a408d8904dab27%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637418010548385367%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=qXKgWmi%2FU042boaDF43w5uIKRLFVNgwiPYrEN%2FxV0pc%3D&amp;reserved=0

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Cluben.tuikov%40amd.com%7C644a4f3feb79447fd6a408d8904dab27%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637418010548385367%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=OZGMVRwFXiuhoG3%2FTP54e6vk0xpMQujqAlNxtCcX7kA%3D&amp;reserved=0

_______________________________________________
dri-devel mailing list
dri-de...@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=04%7C01%7Cluben.tuikov%40amd.com%7C644a4f3feb79447fd6a408d8904dab27%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637418010548385367%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=qXKgWmi%2FU042boaDF43w5uIKRLFVNgwiPYrEN%2FxV0pc%3D&amp;reserved=0

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH v3 07/12] drm/sched: Prevent any job recoveries after device is unplugged.

Reply via email to