Re: [PATCH v3 1/5] drm/scheduler: rework job destruction
On 4/16/19 12:00 PM, Koenig, Christian wrote: > Am 16.04.19 um 17:42 schrieb Grodzovsky, Andrey: >> On 4/16/19 10:58 AM, Grodzovsky, Andrey wrote: >>> On 4/16/19 10:43 AM, Koenig, Christian wrote: Am 16.04.19 um 16:36 schrieb Grodzovsky, Andrey: > On 4/16/19 5:47 AM, Christian König wrote: >> Am 15.04.19 um 23:17 schrieb Eric Anholt: >>> Andrey Grodzovsky writes: >>> From: Christian König We now destroy finished jobs from the worker thread to make sure that we never destroy a job currently in timeout processing. By this we avoid holding lock around ring mirror list in drm_sched_stop which should solve a deadlock reported by a user. v2: Remove unused variable. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109692 Signed-off-by: Christian König Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 17 ++-- drivers/gpu/drm/etnaviv/etnaviv_dump.c | 4 - drivers/gpu/drm/etnaviv/etnaviv_sched.c | 9 +- drivers/gpu/drm/scheduler/sched_main.c | 138 + drivers/gpu/drm/v3d/v3d_sched.c | 9 +- >>> Missing corresponding panfrost and lima updates. You should probably >>> pull in drm-misc for hacking on the scheduler. >>> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c index ce7c737b..8efb091 100644 --- a/drivers/gpu/drm/v3d/v3d_sched.c +++ b/drivers/gpu/drm/v3d/v3d_sched.c @@ -232,11 +232,18 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, struct drm_sched_job *sched_job) /* block scheduler */ for (q = 0; q < V3D_MAX_QUEUES; q++) - drm_sched_stop(>queue[q].sched); + drm_sched_stop(>queue[q].sched, sched_job); if(sched_job) drm_sched_increase_karma(sched_job); + /* + * Guilty job did complete and hence needs to be manually removed + * See drm_sched_stop doc. + */ + if (list_empty(_job->node)) + sched_job->sched->ops->free_job(sched_job); >>> If the if (sched_job) is necessary up above, then this should clearly be >>> under it. >>> >>> But, can we please have a core scheduler thing we call here instead of >>> drivers all replicating it? >> Yeah that's also something I noted before. >> >> Essential problem is that we remove finished jobs from the mirror list >> and so need to destruct them because we otherwise leak them. >> >> Alternative approach here would be to keep the jobs on the ring mirror >> list, but not submit them again. >> >> Regards, >> Christian. > I really prefer to avoid this, it means adding extra flag to sched_job > to check in each iteration of the ring mirror list. Mhm, why actually? We just need to check if the scheduler fence is signaled. >>> OK, i see it's equivalent but this still en extra check for all the >>> iterations. >>> > What about changing > signature of drm_sched_backend_ops.timedout_job to return drm_sched_job* > instead of void, this way we can return the guilty job back from the > driver specific handler to the generic drm_sched_job_timedout and > release it there. Well the timeout handler already has the job, so returning it doesn't make much sense. The problem is rather that the timeout handler doesn't know if it should destroy the job or not. >>> But the driver specific handler does, and actually returning back either >>> the pointer to the job or null will give an indication of that. We can >>> even return bool. >>> >>> Andrey >> Thinking a bit more about this - the way this check is done now "if >> (list_empty(_job->node)) then free the sched_job" actually makes >> it possible to just move this as is from driver specific callbacks into >> drm_sched_job_timeout without any other changes. > Oh, well that sounds like a good idea off hand. > > Need to see the final code, but at least the best idea so far. > > Christian. Unfortunately looks like it's not that good idea at the end, take a look at the attached KASAN print - sched thread's cleanup function races against TDR handler and removes the guilty job from mirror list and we have no way of differentiating if the job was removed from within the TDR handler or from the sched. thread's clean-up function. So looks like we either need 'keep the jobs on the ring mirror list, but not submit them again' as you suggested before or add a flag to sched_job to hint to drm_sched_job_timedout that guilty job requires manual removal. Your suggestion implies we will need an extra check
Re: [PATCH v3 1/5] drm/scheduler: rework job destruction
Am 16.04.19 um 17:42 schrieb Grodzovsky, Andrey: > On 4/16/19 10:58 AM, Grodzovsky, Andrey wrote: >> On 4/16/19 10:43 AM, Koenig, Christian wrote: >>> Am 16.04.19 um 16:36 schrieb Grodzovsky, Andrey: On 4/16/19 5:47 AM, Christian König wrote: > Am 15.04.19 um 23:17 schrieb Eric Anholt: >> Andrey Grodzovsky writes: >> >>> From: Christian König >>> >>> We now destroy finished jobs from the worker thread to make sure that >>> we never destroy a job currently in timeout processing. >>> By this we avoid holding lock around ring mirror list in drm_sched_stop >>> which should solve a deadlock reported by a user. >>> >>> v2: Remove unused variable. >>> >>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109692 >>> >>> Signed-off-by: Christian König >>> Signed-off-by: Andrey Grodzovsky >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 17 ++-- >>> drivers/gpu/drm/etnaviv/etnaviv_dump.c | 4 - >>> drivers/gpu/drm/etnaviv/etnaviv_sched.c | 9 +- >>> drivers/gpu/drm/scheduler/sched_main.c | 138 >>> + >>> drivers/gpu/drm/v3d/v3d_sched.c | 9 +- >> Missing corresponding panfrost and lima updates. You should probably >> pull in drm-misc for hacking on the scheduler. >> >>> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c >>> b/drivers/gpu/drm/v3d/v3d_sched.c >>> index ce7c737b..8efb091 100644 >>> --- a/drivers/gpu/drm/v3d/v3d_sched.c >>> +++ b/drivers/gpu/drm/v3d/v3d_sched.c >>> @@ -232,11 +232,18 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, >>> struct drm_sched_job *sched_job) >>> /* block scheduler */ >>> for (q = 0; q < V3D_MAX_QUEUES; q++) >>> - drm_sched_stop(>queue[q].sched); >>> + drm_sched_stop(>queue[q].sched, sched_job); >>> if(sched_job) >>> drm_sched_increase_karma(sched_job); >>> + /* >>> + * Guilty job did complete and hence needs to be manually removed >>> + * See drm_sched_stop doc. >>> + */ >>> + if (list_empty(_job->node)) >>> + sched_job->sched->ops->free_job(sched_job); >> If the if (sched_job) is necessary up above, then this should clearly be >> under it. >> >> But, can we please have a core scheduler thing we call here instead of >> drivers all replicating it? > Yeah that's also something I noted before. > > Essential problem is that we remove finished jobs from the mirror list > and so need to destruct them because we otherwise leak them. > > Alternative approach here would be to keep the jobs on the ring mirror > list, but not submit them again. > > Regards, > Christian. I really prefer to avoid this, it means adding extra flag to sched_job to check in each iteration of the ring mirror list. >>> Mhm, why actually? We just need to check if the scheduler fence is signaled. >> OK, i see it's equivalent but this still en extra check for all the >> iterations. >> What about changing signature of drm_sched_backend_ops.timedout_job to return drm_sched_job* instead of void, this way we can return the guilty job back from the driver specific handler to the generic drm_sched_job_timedout and release it there. >>> Well the timeout handler already has the job, so returning it doesn't >>> make much sense. >>> >>> The problem is rather that the timeout handler doesn't know if it should >>> destroy the job or not. >> But the driver specific handler does, and actually returning back either >> the pointer to the job or null will give an indication of that. We can >> even return bool. >> >> Andrey > > Thinking a bit more about this - the way this check is done now "if > (list_empty(_job->node)) then free the sched_job" actually makes > it possible to just move this as is from driver specific callbacks into > drm_sched_job_timeout without any other changes. Oh, well that sounds like a good idea off hand. Need to see the final code, but at least the best idea so far. Christian. > > Andrey > >>> Christian. >>> Andrey >>> + >>> /* get the GPU back into the init state */ >>> v3d_reset(v3d); >>> ___ >>> amd-gfx mailing list >>> amd-...@lists.freedesktop.org >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> ___ >> amd-gfx mailing list >> amd-...@lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v3 1/5] drm/scheduler: rework job destruction
On 4/16/19 10:58 AM, Grodzovsky, Andrey wrote: > On 4/16/19 10:43 AM, Koenig, Christian wrote: >> Am 16.04.19 um 16:36 schrieb Grodzovsky, Andrey: >>> On 4/16/19 5:47 AM, Christian König wrote: Am 15.04.19 um 23:17 schrieb Eric Anholt: > Andrey Grodzovsky writes: > >> From: Christian König >> >> We now destroy finished jobs from the worker thread to make sure that >> we never destroy a job currently in timeout processing. >> By this we avoid holding lock around ring mirror list in drm_sched_stop >> which should solve a deadlock reported by a user. >> >> v2: Remove unused variable. >> >> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109692 >> >> Signed-off-by: Christian König >> Signed-off-by: Andrey Grodzovsky >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 17 ++-- >> drivers/gpu/drm/etnaviv/etnaviv_dump.c | 4 - >> drivers/gpu/drm/etnaviv/etnaviv_sched.c | 9 +- >> drivers/gpu/drm/scheduler/sched_main.c | 138 >> + >> drivers/gpu/drm/v3d/v3d_sched.c | 9 +- > Missing corresponding panfrost and lima updates. You should probably > pull in drm-misc for hacking on the scheduler. > >> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c >> b/drivers/gpu/drm/v3d/v3d_sched.c >> index ce7c737b..8efb091 100644 >> --- a/drivers/gpu/drm/v3d/v3d_sched.c >> +++ b/drivers/gpu/drm/v3d/v3d_sched.c >> @@ -232,11 +232,18 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, >> struct drm_sched_job *sched_job) >> /* block scheduler */ >> for (q = 0; q < V3D_MAX_QUEUES; q++) >> - drm_sched_stop(>queue[q].sched); >> + drm_sched_stop(>queue[q].sched, sched_job); >> if(sched_job) >> drm_sched_increase_karma(sched_job); >> + /* >> + * Guilty job did complete and hence needs to be manually removed >> + * See drm_sched_stop doc. >> + */ >> + if (list_empty(_job->node)) >> + sched_job->sched->ops->free_job(sched_job); > If the if (sched_job) is necessary up above, then this should clearly be > under it. > > But, can we please have a core scheduler thing we call here instead of > drivers all replicating it? Yeah that's also something I noted before. Essential problem is that we remove finished jobs from the mirror list and so need to destruct them because we otherwise leak them. Alternative approach here would be to keep the jobs on the ring mirror list, but not submit them again. Regards, Christian. >>> I really prefer to avoid this, it means adding extra flag to sched_job >>> to check in each iteration of the ring mirror list. >> Mhm, why actually? We just need to check if the scheduler fence is signaled. > OK, i see it's equivalent but this still en extra check for all the > iterations. > >>> What about changing >>> signature of drm_sched_backend_ops.timedout_job to return drm_sched_job* >>> instead of void, this way we can return the guilty job back from the >>> driver specific handler to the generic drm_sched_job_timedout and >>> release it there. >> Well the timeout handler already has the job, so returning it doesn't >> make much sense. >> >> The problem is rather that the timeout handler doesn't know if it should >> destroy the job or not. > > But the driver specific handler does, and actually returning back either > the pointer to the job or null will give an indication of that. We can > even return bool. > > Andrey Thinking a bit more about this - the way this check is done now "if (list_empty(_job->node)) then free the sched_job" actually makes it possible to just move this as is from driver specific callbacks into drm_sched_job_timeout without any other changes. Andrey > >> Christian. >> >>> Andrey >>> >> + >> /* get the GPU back into the init state */ >> v3d_reset(v3d); >> ___ >> amd-gfx mailing list >> amd-...@lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > ___ > amd-gfx mailing list > amd-...@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v3 1/5] drm/scheduler: rework job destruction
On 4/16/19 10:43 AM, Koenig, Christian wrote: > Am 16.04.19 um 16:36 schrieb Grodzovsky, Andrey: >> On 4/16/19 5:47 AM, Christian König wrote: >>> Am 15.04.19 um 23:17 schrieb Eric Anholt: Andrey Grodzovsky writes: > From: Christian König > > We now destroy finished jobs from the worker thread to make sure that > we never destroy a job currently in timeout processing. > By this we avoid holding lock around ring mirror list in drm_sched_stop > which should solve a deadlock reported by a user. > > v2: Remove unused variable. > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109692 > > Signed-off-by: Christian König > Signed-off-by: Andrey Grodzovsky > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 17 ++-- > drivers/gpu/drm/etnaviv/etnaviv_dump.c | 4 - > drivers/gpu/drm/etnaviv/etnaviv_sched.c | 9 +- > drivers/gpu/drm/scheduler/sched_main.c | 138 > + > drivers/gpu/drm/v3d/v3d_sched.c | 9 +- Missing corresponding panfrost and lima updates. You should probably pull in drm-misc for hacking on the scheduler. > diff --git a/drivers/gpu/drm/v3d/v3d_sched.c > b/drivers/gpu/drm/v3d/v3d_sched.c > index ce7c737b..8efb091 100644 > --- a/drivers/gpu/drm/v3d/v3d_sched.c > +++ b/drivers/gpu/drm/v3d/v3d_sched.c > @@ -232,11 +232,18 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, > struct drm_sched_job *sched_job) > /* block scheduler */ > for (q = 0; q < V3D_MAX_QUEUES; q++) > - drm_sched_stop(>queue[q].sched); > + drm_sched_stop(>queue[q].sched, sched_job); > if(sched_job) > drm_sched_increase_karma(sched_job); > + /* > + * Guilty job did complete and hence needs to be manually removed > + * See drm_sched_stop doc. > + */ > + if (list_empty(_job->node)) > + sched_job->sched->ops->free_job(sched_job); If the if (sched_job) is necessary up above, then this should clearly be under it. But, can we please have a core scheduler thing we call here instead of drivers all replicating it? >>> Yeah that's also something I noted before. >>> >>> Essential problem is that we remove finished jobs from the mirror list >>> and so need to destruct them because we otherwise leak them. >>> >>> Alternative approach here would be to keep the jobs on the ring mirror >>> list, but not submit them again. >>> >>> Regards, >>> Christian. >> I really prefer to avoid this, it means adding extra flag to sched_job >> to check in each iteration of the ring mirror list. > Mhm, why actually? We just need to check if the scheduler fence is signaled. OK, i see it's equivalent but this still en extra check for all the iterations. > >> What about changing >> signature of drm_sched_backend_ops.timedout_job to return drm_sched_job* >> instead of void, this way we can return the guilty job back from the >> driver specific handler to the generic drm_sched_job_timedout and >> release it there. > Well the timeout handler already has the job, so returning it doesn't > make much sense. > > The problem is rather that the timeout handler doesn't know if it should > destroy the job or not. But the driver specific handler does, and actually returning back either the pointer to the job or null will give an indication of that. We can even return bool. Andrey > > Christian. > >> Andrey >> > + > /* get the GPU back into the init state */ > v3d_reset(v3d); > ___ > amd-gfx mailing list > amd-...@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v3 1/5] drm/scheduler: rework job destruction
Am 16.04.19 um 16:36 schrieb Grodzovsky, Andrey: > On 4/16/19 5:47 AM, Christian König wrote: >> Am 15.04.19 um 23:17 schrieb Eric Anholt: >>> Andrey Grodzovsky writes: >>> From: Christian König We now destroy finished jobs from the worker thread to make sure that we never destroy a job currently in timeout processing. By this we avoid holding lock around ring mirror list in drm_sched_stop which should solve a deadlock reported by a user. v2: Remove unused variable. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109692 Signed-off-by: Christian König Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 17 ++-- drivers/gpu/drm/etnaviv/etnaviv_dump.c | 4 - drivers/gpu/drm/etnaviv/etnaviv_sched.c | 9 +- drivers/gpu/drm/scheduler/sched_main.c | 138 + drivers/gpu/drm/v3d/v3d_sched.c | 9 +- >>> Missing corresponding panfrost and lima updates. You should probably >>> pull in drm-misc for hacking on the scheduler. >>> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c index ce7c737b..8efb091 100644 --- a/drivers/gpu/drm/v3d/v3d_sched.c +++ b/drivers/gpu/drm/v3d/v3d_sched.c @@ -232,11 +232,18 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, struct drm_sched_job *sched_job) /* block scheduler */ for (q = 0; q < V3D_MAX_QUEUES; q++) - drm_sched_stop(>queue[q].sched); + drm_sched_stop(>queue[q].sched, sched_job); if(sched_job) drm_sched_increase_karma(sched_job); + /* + * Guilty job did complete and hence needs to be manually removed + * See drm_sched_stop doc. + */ + if (list_empty(_job->node)) + sched_job->sched->ops->free_job(sched_job); >>> If the if (sched_job) is necessary up above, then this should clearly be >>> under it. >>> >>> But, can we please have a core scheduler thing we call here instead of >>> drivers all replicating it? >> Yeah that's also something I noted before. >> >> Essential problem is that we remove finished jobs from the mirror list >> and so need to destruct them because we otherwise leak them. >> >> Alternative approach here would be to keep the jobs on the ring mirror >> list, but not submit them again. >> >> Regards, >> Christian. > > I really prefer to avoid this, it means adding extra flag to sched_job > to check in each iteration of the ring mirror list. Mhm, why actually? We just need to check if the scheduler fence is signaled. > What about changing > signature of drm_sched_backend_ops.timedout_job to return drm_sched_job* > instead of void, this way we can return the guilty job back from the > driver specific handler to the generic drm_sched_job_timedout and > release it there. Well the timeout handler already has the job, so returning it doesn't make much sense. The problem is rather that the timeout handler doesn't know if it should destroy the job or not. Christian. > > Andrey > + /* get the GPU back into the init state */ v3d_reset(v3d); ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v3 1/5] drm/scheduler: rework job destruction
On 4/16/19 5:47 AM, Christian König wrote: > Am 15.04.19 um 23:17 schrieb Eric Anholt: >> Andrey Grodzovsky writes: >> >>> From: Christian König >>> >>> We now destroy finished jobs from the worker thread to make sure that >>> we never destroy a job currently in timeout processing. >>> By this we avoid holding lock around ring mirror list in drm_sched_stop >>> which should solve a deadlock reported by a user. >>> >>> v2: Remove unused variable. >>> >>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109692 >>> >>> Signed-off-by: Christian König >>> Signed-off-by: Andrey Grodzovsky >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 17 ++-- >>> drivers/gpu/drm/etnaviv/etnaviv_dump.c | 4 - >>> drivers/gpu/drm/etnaviv/etnaviv_sched.c | 9 +- >>> drivers/gpu/drm/scheduler/sched_main.c | 138 >>> + >>> drivers/gpu/drm/v3d/v3d_sched.c | 9 +- >> Missing corresponding panfrost and lima updates. You should probably >> pull in drm-misc for hacking on the scheduler. >> >>> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c >>> b/drivers/gpu/drm/v3d/v3d_sched.c >>> index ce7c737b..8efb091 100644 >>> --- a/drivers/gpu/drm/v3d/v3d_sched.c >>> +++ b/drivers/gpu/drm/v3d/v3d_sched.c >>> @@ -232,11 +232,18 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, >>> struct drm_sched_job *sched_job) >>> /* block scheduler */ >>> for (q = 0; q < V3D_MAX_QUEUES; q++) >>> - drm_sched_stop(>queue[q].sched); >>> + drm_sched_stop(>queue[q].sched, sched_job); >>> if(sched_job) >>> drm_sched_increase_karma(sched_job); >>> + /* >>> + * Guilty job did complete and hence needs to be manually removed >>> + * See drm_sched_stop doc. >>> + */ >>> + if (list_empty(_job->node)) >>> + sched_job->sched->ops->free_job(sched_job); >> If the if (sched_job) is necessary up above, then this should clearly be >> under it. >> >> But, can we please have a core scheduler thing we call here instead of >> drivers all replicating it? > > Yeah that's also something I noted before. > > Essential problem is that we remove finished jobs from the mirror list > and so need to destruct them because we otherwise leak them. > > Alternative approach here would be to keep the jobs on the ring mirror > list, but not submit them again. > > Regards, > Christian. I really prefer to avoid this, it means adding extra flag to sched_job to check in each iteration of the ring mirror list. What about changing signature of drm_sched_backend_ops.timedout_job to return drm_sched_job* instead of void, this way we can return the guilty job back from the driver specific handler to the generic drm_sched_job_timedout and release it there. Andrey > >> >>> + >>> /* get the GPU back into the init state */ >>> v3d_reset(v3d); > ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v3 1/5] drm/scheduler: rework job destruction
Am 15.04.19 um 23:17 schrieb Eric Anholt: Andrey Grodzovsky writes: From: Christian König We now destroy finished jobs from the worker thread to make sure that we never destroy a job currently in timeout processing. By this we avoid holding lock around ring mirror list in drm_sched_stop which should solve a deadlock reported by a user. v2: Remove unused variable. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109692 Signed-off-by: Christian König Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 17 ++-- drivers/gpu/drm/etnaviv/etnaviv_dump.c | 4 - drivers/gpu/drm/etnaviv/etnaviv_sched.c| 9 +- drivers/gpu/drm/scheduler/sched_main.c | 138 + drivers/gpu/drm/v3d/v3d_sched.c| 9 +- Missing corresponding panfrost and lima updates. You should probably pull in drm-misc for hacking on the scheduler. diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c index ce7c737b..8efb091 100644 --- a/drivers/gpu/drm/v3d/v3d_sched.c +++ b/drivers/gpu/drm/v3d/v3d_sched.c @@ -232,11 +232,18 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, struct drm_sched_job *sched_job) /* block scheduler */ for (q = 0; q < V3D_MAX_QUEUES; q++) - drm_sched_stop(>queue[q].sched); + drm_sched_stop(>queue[q].sched, sched_job); if(sched_job) drm_sched_increase_karma(sched_job); + /* +* Guilty job did complete and hence needs to be manually removed +* See drm_sched_stop doc. +*/ + if (list_empty(_job->node)) + sched_job->sched->ops->free_job(sched_job); If the if (sched_job) is necessary up above, then this should clearly be under it. But, can we please have a core scheduler thing we call here instead of drivers all replicating it? Yeah that's also something I noted before. Essential problem is that we remove finished jobs from the mirror list and so need to destruct them because we otherwise leak them. Alternative approach here would be to keep the jobs on the ring mirror list, but not submit them again. Regards, Christian. + /* get the GPU back into the init state */ v3d_reset(v3d); ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v3 1/5] drm/scheduler: rework job destruction
Andrey Grodzovsky writes: > From: Christian König > > We now destroy finished jobs from the worker thread to make sure that > we never destroy a job currently in timeout processing. > By this we avoid holding lock around ring mirror list in drm_sched_stop > which should solve a deadlock reported by a user. > > v2: Remove unused variable. > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109692 > > Signed-off-by: Christian König > Signed-off-by: Andrey Grodzovsky > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 17 ++-- > drivers/gpu/drm/etnaviv/etnaviv_dump.c | 4 - > drivers/gpu/drm/etnaviv/etnaviv_sched.c| 9 +- > drivers/gpu/drm/scheduler/sched_main.c | 138 > + > drivers/gpu/drm/v3d/v3d_sched.c| 9 +- Missing corresponding panfrost and lima updates. You should probably pull in drm-misc for hacking on the scheduler. > diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c > index ce7c737b..8efb091 100644 > --- a/drivers/gpu/drm/v3d/v3d_sched.c > +++ b/drivers/gpu/drm/v3d/v3d_sched.c > @@ -232,11 +232,18 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, struct > drm_sched_job *sched_job) > > /* block scheduler */ > for (q = 0; q < V3D_MAX_QUEUES; q++) > - drm_sched_stop(>queue[q].sched); > + drm_sched_stop(>queue[q].sched, sched_job); > > if(sched_job) > drm_sched_increase_karma(sched_job); > > + /* > + * Guilty job did complete and hence needs to be manually removed > + * See drm_sched_stop doc. > + */ > + if (list_empty(_job->node)) > + sched_job->sched->ops->free_job(sched_job); If the if (sched_job) is necessary up above, then this should clearly be under it. But, can we please have a core scheduler thing we call here instead of drivers all replicating it? > + > /* get the GPU back into the init state */ > v3d_reset(v3d); > signature.asc Description: PGP signature ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[PATCH v3 1/5] drm/scheduler: rework job destruction
From: Christian König We now destroy finished jobs from the worker thread to make sure that we never destroy a job currently in timeout processing. By this we avoid holding lock around ring mirror list in drm_sched_stop which should solve a deadlock reported by a user. v2: Remove unused variable. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109692 Signed-off-by: Christian König Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 17 ++-- drivers/gpu/drm/etnaviv/etnaviv_dump.c | 4 - drivers/gpu/drm/etnaviv/etnaviv_sched.c| 9 +- drivers/gpu/drm/scheduler/sched_main.c | 138 + drivers/gpu/drm/v3d/v3d_sched.c| 9 +- include/drm/gpu_scheduler.h| 6 +- 6 files changed, 108 insertions(+), 75 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index d46675b..8c41a9f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3341,15 +3341,23 @@ static int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev, if (!ring || !ring->sched.thread) continue; - drm_sched_stop(>sched); + drm_sched_stop(>sched, >base); /* after all hw jobs are reset, hw fence is meaningless, so force_completion */ amdgpu_fence_driver_force_completion(ring); } - if(job) + if(job) { drm_sched_increase_karma(>base); + /* +* Guilty job did complete and hence needs to be manually removed +* See drm_sched_stop doc. +*/ + if (list_empty(>base.node)) + job->base.sched->ops->free_job(>base); + } + if (!amdgpu_sriov_vf(adev)) { @@ -3489,8 +3497,7 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive, return r; } -static void amdgpu_device_post_asic_reset(struct amdgpu_device *adev, - struct amdgpu_job *job) +static void amdgpu_device_post_asic_reset(struct amdgpu_device *adev) { int i; @@ -3630,7 +3637,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, /* Post ASIC reset for all devs .*/ list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) { - amdgpu_device_post_asic_reset(tmp_adev, tmp_adev == adev ? job : NULL); + amdgpu_device_post_asic_reset(tmp_adev); if (r) { /* bad news, how to tell it to userspace ? */ diff --git a/drivers/gpu/drm/etnaviv/etnaviv_dump.c b/drivers/gpu/drm/etnaviv/etnaviv_dump.c index 3fbb485..8434715 100644 --- a/drivers/gpu/drm/etnaviv/etnaviv_dump.c +++ b/drivers/gpu/drm/etnaviv/etnaviv_dump.c @@ -135,13 +135,11 @@ void etnaviv_core_dump(struct etnaviv_gpu *gpu) mmu_size + gpu->buffer.size; /* Add in the active command buffers */ - spin_lock_irqsave(>sched.job_list_lock, flags); list_for_each_entry(s_job, >sched.ring_mirror_list, node) { submit = to_etnaviv_submit(s_job); file_size += submit->cmdbuf.size; n_obj++; } - spin_unlock_irqrestore(>sched.job_list_lock, flags); /* Add in the active buffer objects */ list_for_each_entry(vram, >mmu->mappings, mmu_node) { @@ -183,14 +181,12 @@ void etnaviv_core_dump(struct etnaviv_gpu *gpu) gpu->buffer.size, etnaviv_cmdbuf_get_va(>buffer)); - spin_lock_irqsave(>sched.job_list_lock, flags); list_for_each_entry(s_job, >sched.ring_mirror_list, node) { submit = to_etnaviv_submit(s_job); etnaviv_core_dump_mem(, ETDUMP_BUF_CMD, submit->cmdbuf.vaddr, submit->cmdbuf.size, etnaviv_cmdbuf_get_va(>cmdbuf)); } - spin_unlock_irqrestore(>sched.job_list_lock, flags); /* Reserve space for the bomap */ if (n_bomap_pages) { diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c index 67ae266..8795c19 100644 --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c @@ -109,11 +109,18 @@ static void etnaviv_sched_timedout_job(struct drm_sched_job *sched_job) } /* block scheduler */ - drm_sched_stop(>sched); + drm_sched_stop(>sched, sched_job); if(sched_job) drm_sched_increase_karma(sched_job); + /* +* Guilty job did complete and hence needs to be manually removed +* See drm_sched_stop doc. +*/ + if (list_empty(_job->node)) + sched_job->sched->ops->free_job(sched_job); + /* get the GPU back into the