Re: [PATCH v4] drm/panthor: Make the timeout per-queue instead of per-job

Boris Brezillon Mon, 26 May 2025 00:25:49 -0700

On Sat, 24 May 2025 16:03:37 +0100
Daniel Stone <dan...@fooishbar.org> wrote:


> Hi Ashley,
> 
> On Fri, 23 May 2025 at 16:10, Ashley Smith <ashley.sm...@collabora.com> wrote:
> > The timeout logic provided by drm_sched leads to races when we try
> > to suspend it while the drm_sched workqueue queues more jobs. Let's
> > overhaul the timeout handling in panthor to have our own delayed work
> > that's resumed/suspended when a group is resumed/suspended. When an
> > actual timeout occurs, we call drm_sched_fault() to report it
> > through drm_sched, still. But otherwise, the drm_sched timeout is
> > disabled (set to MAX_SCHEDULE_TIMEOUT), which leaves us in control of
> > how we protect modifications on the timer.
> >
> > One issue seems to be when we call drm_sched_suspend_timeout() from
> > both queue_run_job() and tick_work() which could lead to races due to
> > drm_sched_suspend_timeout() not having a lock. Another issue seems to
> > be in queue_run_job() if the group is not scheduled, we suspend the
> > timeout again which undoes what drm_sched_job_begin() did when calling
> > drm_sched_start_timeout(). So the timeout does not reset when a job
> > is finished.
> >
> > Co-developed-by: Boris Brezillon <boris.brezil...@collabora.com>
> > Signed-off-by: Boris Brezillon <boris.brezil...@collabora.com>
> > Tested-by: Daniel Stone <dani...@collabora.com>
> > Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block")  
> 
> Unfortunately I have to revoke my T-b as we're seeing a pile of
> failures in a CI stress test with this, e.g.
> https://gitlab.freedesktop.org/daniels/mesa/-/jobs/77004047

Note that you need [1] too, which I don't see in your tree. Ashley, a
note for next time: when you have dependencies between patches, like is
the case here, it's usually better to post them in the same patchset,
so that:

1. They are applied in the right order
2. Cherry-pickers/reviewers know that they need to consider both to
have a working branch.

Regards,

Boris

[1]https://lkml.org/lkml/2025/5/15/742

Re: [PATCH v4] drm/panthor: Make the timeout per-queue instead of per-job

Reply via email to