[AMD Official Use Only - AMD Internal Distribution Only]

Hi Philipp,

I updated the commit message to better clarify the issue. Would you mind 
reviewing if this revised description adequately explains the bug and the 
rationale behind the fix?

When an entity from application B is killed, drm_sched_entity_kill()
removes all jobs belonging to that entity through
drm_sched_entity_kill_jobs_work(). If application A's job depends on a
scheduled fence from application B's job, and that fence is not properly
signaled during the killing process, application A's dependency cannot be
cleared.

This leads to application A hanging indefinitely while waiting for a
dependency that will never be resolved. Fix this issue by ensuring that
scheduled fences are properly signaled when an entity is killed, allowing
dependent applications to continue execution.

Thanks,
Lin

________________________________
From: Koenig, Christian <christian.koe...@amd.com>
Sent: Thursday, May 15, 2025 20:39
To: pha...@kernel.org <pha...@kernel.org>; cao, lin <lin....@amd.com>; 
dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; 
aamd-...@lists.freedesktop.org <aamd-...@lists.freedesktop.org>
Cc: Chang, HaiJun <haijun.ch...@amd.com>; Yin, ZhenGuo (Chris) 
<zhenguo....@amd.com>; Danilo Krummrich <d...@kernel.org>
Subject: Re: [PATCH] drm/scheduler: signal scheduled fence when kill job

On 5/15/25 11:05, Philipp Stanner wrote:
> On Thu, 2025-05-15 at 10:48 +0200, Christian König wrote:
>> Explicitly adding the scheduler maintainers.
>>
>> On 5/15/25 04:07, Lin.Cao wrote:
>>> Previously we only signaled finished fence which may cause some
>>> submission's dependency cannot be cleared the cause benchmark hang.
>>> Signal both scheduled fence and finished fence could fix this
>>> issue.
>
> Code seems legit to me; but be so kind and also pimp up the commit
> message a bit, Christian. It's not very clear what the bug is and why
> setting the parent to NULL solves it. Or is the issue simply that the
> fence might be dropped unsignaled, being a bug by definition? Needs to
> be written down.

The later, we simply forgot to signal the scheduled fence when an application 
was killed.

> Grammar is also a bit too broken.
>
> And running the unit tests before pushing is probably also a good idea.

And maybe even writing a new unit test for that.

Regards,
Christian.

>
>>>
>>> Signed-off-by: Lin.Cao <linca...@amd.com>
>
> Acked-by: Philipp Stanner <pha...@kernel.org>
>
>>
>> Reviewed-by: Christian König <christian.koe...@amd.com>
>>
>> Danilo & Philipp can we quickly get an rb for that? I'm volunteering
>> to push it to drm-misc-fixes and add the necessary stable tags since
>> this is a fix for a rather ugly bug.
>>
>> Regards,
>> Christian.
>>
>>
>>> ---
>>>  drivers/gpu/drm/scheduler/sched_entity.c | 1 +
>>>  1 file changed, 1 insertion(+)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>> index bd39db7bb240..e671aa241720 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>> @@ -176,6 +176,7 @@ static void
>>> drm_sched_entity_kill_jobs_work(struct work_struct *wrk)
>>>  {
>>>     struct drm_sched_job *job = container_of(wrk,
>>> typeof(*job), work);
>>>
>>> +   drm_sched_fence_scheduled(job->s_fence, NULL);
>>>     drm_sched_fence_finished(job->s_fence, -ESRCH);
>>>     WARN_ON(job->s_fence->parent);
>>>     job->sched->ops->free_job(job);
>>
>

Reply via email to