This was fixed here:

commit 03877d621db082610c9b7602c6e8cd6ebcb75a8f
Author: Christian König <christian.koe...@amd.com>
Date:   Thu Apr 27 14:05:43 2023 +0200

    drm/scheduler: mark jobs without fence as canceled

    When no hw fence is provided for a job that means that the job didn't executed.

    Signed-off-by: Christian König <christian.koe...@amd.com>
    Reviewed-by: Luben Tuikov <luben.tui...@amd.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20230427122726.1290170-1-christian.koe...@amd.com

Could be that the patch hasn't been merged into the internal branches yet.

Regards,
Christian.

Am 23.08.23 um 10:12 schrieb Yin, ZhenGuo (Chris):
[AMD Official Use Only - General]

Ping..

Actually, I prepare a patch aiming to fix this issue.
But I'm not sure whether this is proper for drm/scheduler.

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index 9654e8942382..35dc0b86a18e 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -463,6 +463,7 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct 
drm_sched_job *bad)
                                               &s_job->cb)) {
                         dma_fence_put(s_job->s_fence->parent);
                         s_job->s_fence->parent = NULL;
+                       dma_fence_set_error(&s_job->s_fence->finished, 
-EHWPOISON);
                         atomic_dec(&sched->hw_rq_count);
                 } else {
                         /*
Best,
Zhenguo
Cloud-GPU Core team, SRDC

-----Original Message-----
From: Yin, ZhenGuo (Chris)
Sent: Thursday, August 17, 2023 4:17 PM
To: Christian König <ckoenig.leichtzumer...@gmail.com>; 
amd-gfx@lists.freedesktop.org
Cc: Tuikov, Luben <luben.tui...@amd.com>; Chen, JingWen (Wayne) <jingwen.ch...@amd.com>; Liu, 
Monk <monk....@amd.com>; Li, Chong(Alan) <chong...@amd.com>; cao, lin <lin....@amd.com>
Subject: RE: [PATCH 1/8] drm/scheduler: properly forward fence errors

Hi, @Christian König

Any updates for the fix?
Recently we found that there will be a page fault after FLR, since an SDMA job 
in the pending list was dropped without forwarding fence errors.


Best,
Zhenguo
Cloud-GPU Core team, SRDC

-----Original Message-----
From: Christian König <ckoenig.leichtzumer...@gmail.com>
Sent: Thursday, April 27, 2023 8:05 PM
To: Yin, ZhenGuo (Chris) <zhenguo....@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Tuikov, Luben <luben.tui...@amd.com>; Chen, JingWen (Wayne) 
<jingwen.ch...@amd.com>; Liu, Monk <monk....@amd.com>
Subject: Re: [PATCH 1/8] drm/scheduler: properly forward fence errors

Well good point, but as part of the effort of the Intel team to move the 
scheduler over to a work item based design those two functions are probably 
about to be removed.

Since we will probably have that in the internal package for a bit longer I'm 
going to send a fix for this.

Regards,
Christian.

Am 27.04.23 um 12:35 schrieb Yin, ZhenGuo (Chris):
[AMD Official Use Only - General]

Hi, Christian

diff --git a/drivers/gpu/drm/scheduler/sched_main.c
b/drivers/gpu/drm/scheduler/sched_main.c
index fcd4bfef7415..649fac2e1ccb 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -533,12 +533,12 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, 
bool full_recovery)
                       r = dma_fence_add_callback(fence, &s_job->cb,
                                                  drm_sched_job_done_cb);
                       if (r == -ENOENT)
-                             drm_sched_job_done(s_job);
+                             drm_sched_job_done(s_job, fence->error);
                       else if (r)
                               DRM_DEV_ERROR(sched->dev, "fence add callback failed 
(%d)\n",
                                         r);
               } else
-                     drm_sched_job_done(s_job);
+                     drm_sched_job_done(s_job, 0);
       }

       if (full_recovery) {

I believe that the finished fence of some skipped jobs during FLR HASN'T been 
set to -ECANCELED.
In function drm_sched_stop, the callback has been removed from hw_fence and 
s_fence->parent has been set to NULL, see commit 
45ecaea738830b9d521c93520c8f201359dcbd95(drm/sched: Partial revert of 'drm/sched: 
Keep s_fence->parent pointer').
In functnion drm_sched_start, jobs in the pending list pretend to be done 
without any errors(drm_sched_job_done(s_job, 0)).


Best,
Zhenguo
Cloud-GPU Core team, SRDC

-----Original Message-----
From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of
Christian König
Sent: Thursday, April 20, 2023 7:58 PM
To: amd-gfx@lists.freedesktop.org
Cc: Tuikov, Luben <luben.tui...@amd.com>
Subject: [PATCH 1/8] drm/scheduler: properly forward fence errors

When a hw fence is signaled with an error properly forward that to the finished 
fence.

Signed-off-by: Christian König <christian.koe...@amd.com>
---
   drivers/gpu/drm/scheduler/sched_entity.c |  4 +---  
drivers/gpu/drm/scheduler/sched_fence.c  |  4 +++-
   drivers/gpu/drm/scheduler/sched_main.c   | 18 ++++++++----------
   include/drm/gpu_scheduler.h              |  2 +-
   4 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
b/drivers/gpu/drm/scheduler/sched_entity.c
index 15d04a0ec623..eaf71fe15ed3 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -144,7 +144,7 @@ static void drm_sched_entity_kill_jobs_work(struct 
work_struct *wrk)  {
       struct drm_sched_job *job = container_of(wrk, typeof(*job), work);

-     drm_sched_fence_finished(job->s_fence);
+     drm_sched_fence_finished(job->s_fence, -ESRCH);
       WARN_ON(job->s_fence->parent);
       job->sched->ops->free_job(job);
   }
@@ -195,8 +195,6 @@ static void drm_sched_entity_kill(struct drm_sched_entity 
*entity)
       while ((job = to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
               struct drm_sched_fence *s_fence = job->s_fence;

-             dma_fence_set_error(&s_fence->finished, -ESRCH);
-
               dma_fence_get(&s_fence->finished);
               if (!prev || dma_fence_add_callback(prev, &job->finish_cb,
                                          drm_sched_entity_kill_jobs_cb)) diff 
--git
a/drivers/gpu/drm/scheduler/sched_fence.c
b/drivers/gpu/drm/scheduler/sched_fence.c
index 7fd869520ef2..1a6bea98c5cc 100644
--- a/drivers/gpu/drm/scheduler/sched_fence.c
+++ b/drivers/gpu/drm/scheduler/sched_fence.c
@@ -53,8 +53,10 @@ void drm_sched_fence_scheduled(struct drm_sched_fence *fence)
       dma_fence_signal(&fence->scheduled);
   }

-void drm_sched_fence_finished(struct drm_sched_fence *fence)
+void drm_sched_fence_finished(struct drm_sched_fence *fence, int
+result)
   {
+     if (result)
+             dma_fence_set_error(&fence->finished, result);
       dma_fence_signal(&fence->finished);
   }

diff --git a/drivers/gpu/drm/scheduler/sched_main.c
b/drivers/gpu/drm/scheduler/sched_main.c
index fcd4bfef7415..649fac2e1ccb 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -257,7 +257,7 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
    *
    * Finish the job's fence and wake up the worker thread.
    */
-static void drm_sched_job_done(struct drm_sched_job *s_job)
+static void drm_sched_job_done(struct drm_sched_job *s_job, int
+result)
   {
       struct drm_sched_fence *s_fence = s_job->s_fence;
       struct drm_gpu_scheduler *sched = s_fence->sched; @@ -268,7 +268,7 @@ 
static void drm_sched_job_done(struct drm_sched_job *s_job)
       trace_drm_sched_process_job(s_fence);

       dma_fence_get(&s_fence->finished);
-     drm_sched_fence_finished(s_fence);
+     drm_sched_fence_finished(s_fence, result);
       dma_fence_put(&s_fence->finished);
       wake_up_interruptible(&sched->wake_up_worker);
   }
@@ -282,7 +282,7 @@ static void drm_sched_job_done_cb(struct dma_fence *f, 
struct dma_fence_cb *cb)  {
       struct drm_sched_job *s_job = container_of(cb, struct
drm_sched_job, cb);

-     drm_sched_job_done(s_job);
+     drm_sched_job_done(s_job, f->error);
   }

   /**
@@ -533,12 +533,12 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, 
bool full_recovery)
                       r = dma_fence_add_callback(fence, &s_job->cb,
                                                  drm_sched_job_done_cb);
                       if (r == -ENOENT)
-                             drm_sched_job_done(s_job);
+                             drm_sched_job_done(s_job, fence->error);
                       else if (r)
                               DRM_DEV_ERROR(sched->dev, "fence add callback failed 
(%d)\n",
                                         r);
               } else
-                     drm_sched_job_done(s_job);
+                     drm_sched_job_done(s_job, 0);
       }

       if (full_recovery) {
@@ -1010,15 +1010,13 @@ static int drm_sched_main(void *param)
                       r = dma_fence_add_callback(fence, &sched_job->cb,
                                                  drm_sched_job_done_cb);
                       if (r == -ENOENT)
-                             drm_sched_job_done(sched_job);
+                             drm_sched_job_done(sched_job, fence->error);
                       else if (r)
                               DRM_DEV_ERROR(sched->dev, "fence add callback failed 
(%d)\n",
                                         r);
               } else {
-                     if (IS_ERR(fence))
-                             dma_fence_set_error(&s_fence->finished, 
PTR_ERR(fence));
-
-                     drm_sched_job_done(sched_job);
+                     drm_sched_job_done(sched_job, IS_ERR(fence) ?
+                                        PTR_ERR(fence) : 0);
               }

               wake_up(&sched->job_scheduled);
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index ca857ec9e7eb..5c1df6b12ced 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -569,7 +569,7 @@ void drm_sched_fence_init(struct drm_sched_fence
*fence,  void drm_sched_fence_free(struct drm_sched_fence *fence);

   void drm_sched_fence_scheduled(struct drm_sched_fence *fence); -void
drm_sched_fence_finished(struct drm_sched_fence *fence);
+void drm_sched_fence_finished(struct drm_sched_fence *fence, int
+result);

   unsigned long drm_sched_suspend_timeout(struct drm_gpu_scheduler
*sched);  void drm_sched_resume_timeout(struct drm_gpu_scheduler
*sched,
--
2.34.1

Reply via email to