On Sat, 24 May 2025 16:03:37 +0100 Daniel Stone <dan...@fooishbar.org> wrote:
> Hi Ashley, > > On Fri, 23 May 2025 at 16:10, Ashley Smith <ashley.sm...@collabora.com> wrote: > > The timeout logic provided by drm_sched leads to races when we try > > to suspend it while the drm_sched workqueue queues more jobs. Let's > > overhaul the timeout handling in panthor to have our own delayed work > > that's resumed/suspended when a group is resumed/suspended. When an > > actual timeout occurs, we call drm_sched_fault() to report it > > through drm_sched, still. But otherwise, the drm_sched timeout is > > disabled (set to MAX_SCHEDULE_TIMEOUT), which leaves us in control of > > how we protect modifications on the timer. > > > > One issue seems to be when we call drm_sched_suspend_timeout() from > > both queue_run_job() and tick_work() which could lead to races due to > > drm_sched_suspend_timeout() not having a lock. Another issue seems to > > be in queue_run_job() if the group is not scheduled, we suspend the > > timeout again which undoes what drm_sched_job_begin() did when calling > > drm_sched_start_timeout(). So the timeout does not reset when a job > > is finished. > > > > Co-developed-by: Boris Brezillon <boris.brezil...@collabora.com> > > Signed-off-by: Boris Brezillon <boris.brezil...@collabora.com> > > Tested-by: Daniel Stone <dani...@collabora.com> > > Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block") > > Unfortunately I have to revoke my T-b as we're seeing a pile of > failures in a CI stress test with this, e.g. > https://gitlab.freedesktop.org/daniels/mesa/-/jobs/77004047 Note that you need [1] too, which I don't see in your tree. Ashley, a note for next time: when you have dependencies between patches, like is the case here, it's usually better to post them in the same patchset, so that: 1. They are applied in the right order 2. Cherry-pickers/reviewers know that they need to consider both to have a working branch. Regards, Boris [1]https://lkml.org/lkml/2025/5/15/742