Hi Ashley, On Fri, 23 May 2025 at 16:10, Ashley Smith <ashley.sm...@collabora.com> wrote: > The timeout logic provided by drm_sched leads to races when we try > to suspend it while the drm_sched workqueue queues more jobs. Let's > overhaul the timeout handling in panthor to have our own delayed work > that's resumed/suspended when a group is resumed/suspended. When an > actual timeout occurs, we call drm_sched_fault() to report it > through drm_sched, still. But otherwise, the drm_sched timeout is > disabled (set to MAX_SCHEDULE_TIMEOUT), which leaves us in control of > how we protect modifications on the timer. > > One issue seems to be when we call drm_sched_suspend_timeout() from > both queue_run_job() and tick_work() which could lead to races due to > drm_sched_suspend_timeout() not having a lock. Another issue seems to > be in queue_run_job() if the group is not scheduled, we suspend the > timeout again which undoes what drm_sched_job_begin() did when calling > drm_sched_start_timeout(). So the timeout does not reset when a job > is finished. > > Co-developed-by: Boris Brezillon <boris.brezil...@collabora.com> > Signed-off-by: Boris Brezillon <boris.brezil...@collabora.com> > Tested-by: Daniel Stone <dani...@collabora.com> > Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block")
Unfortunately I have to revoke my T-b as we're seeing a pile of failures in a CI stress test with this, e.g. https://gitlab.freedesktop.org/daniels/mesa/-/jobs/77004047 Cheers, Daniel