On Sun, Mar 15, 2026 at 09:05:20PM -0700, Matthew Brost wrote: > On Thu, Mar 05, 2026 at 10:47:32AM +0100, Philipp Stanner wrote: >
Obviously this was intended as a private communication — I hit the wrong button. I apologize to anyone I offended here. Matt > Off the list... I don’t think airing our personal attacks publicly is a > good look. I’m going to be blunt here in an effort to help you. > > > On Thu, 2026-03-05 at 01:10 -0800, Matthew Brost wrote: > > > On Thu, Mar 05, 2026 at 09:38:16AM +0100, Philipp Stanner wrote: > > > > On Thu, 2026-03-05 at 09:27 +0100, Boris Brezillon wrote: > > > > > > > > > > > > > […] > > > > > > > Honestly, I'm not thrilled by this fast-path/call-run_job-directly > > > > > idea > > > > > you're describing. There's just so many things we can forget that > > > > > would > > > > > lead to races/ordering issues that will end up being hard to trigger > > > > > and > > > > > debug. > > > > > > > > > > > > > +1 > > > > > > > > I'm not thrilled either. More like the opposite of thrilled actually. > > > > > > > > Even if we could get that to work. This is more of a maintainability > > > > issue. > > > > > > > > The scheduler is full of insane performance hacks for this or that > > > > driver. Lockless accesses, a special lockless queue only used by that > > > > one party in the kernel (a lockless queue which is nowadays, after N > > > > reworks, being used with a lock. Ah well). > > > > > > > > > > This is not relevant to this discussion—see below. In general, I agree > > > that the lockless tricks in the scheduler are not great, nor is the fact > > > that the scheduler became a dumping ground for driver-specific features. > > > But again, that is not what we’re talking about here—see below. > > > > > > > In the past discussions Danilo and I made it clear that more major > > > > features in _new_ patch series aimed at getting merged into drm/sched > > > > must be preceded by cleanup work to address some of the scheduler's > > > > major problems. > > > > > > Ah, we've moved to dictatorship quickly. Noted. > > > > I prefer the term "benevolent presidency" /s > > > > Or even better: s/dictatorship/accountability enforcement. > > > > It’s very hard to take this seriously when I reply to threads saying > something breaks dma-fence rules and the response is, “what are > dma-fence rules?” Or I read through the jobqueue thread and see you > asking why a dma-fence would come from anywhere other than your own > driver — that’s the entire point of dma-fence; it’s a cross-driver > contract. I could go on, but I’d encourage you to take a hard look at > your understanding of DRM, and whether your responses — to me and to > others — are backed by the necessary technical knowledge. > > Even better — what first annoyed me was your XDC presentation. You gave > an example of my driver modifying the pending list without a lock while > scheduling was stopped, and claimed you fixed a bug. That was not a bug > - Xe would explode if it was as we test our code. The pending list can > be modified without a lock if scheduling is stopped. I almost grabbed > the mic to correct you. Yes, it’s a layering violation, but presenting > it aa a bug shows a clear lack of understanding. > > > How does it come that everyone is here and ready so quickly when it > > I’ve suggested ideas to fix DRM sched (refcounting, clear teardown > flows), but they were immediately met with resistance — typically from > Christian with you agreeing. My willingness to fight with Christian is > low; I really don’t need another person to argue with. > > > comes to new use cases and features, yet I never saw anyone except for > > Tvrtko and Maíra investing even 15 minutes to write a simple patch to > > address some of the *various* significant issues in that code base? > > > > You were on CC on all discussions we've had here for the last years > > afair, but I rarely saw you participate. And you know what it's like: > > I’ll admit I’m busy with many other things, so my bandwidth is limited. > But again, if I chime in and explain how I solved something in Xe (e.g., > refcounting) and it’s met with resistance, I’ll likely move on — I’ve > already solved it, and I’ll just let you fail (see cancel_job). > > > who doesn't speak up silently agrees in open source. > > > > But tell me one thing, if you can be so kind: > > > > I'm glad you asked this, and it inspired me to fix this, more below [1]. > > > What is your theory why drm/sched came to be in such horrible shape? > > drm/sched was ported from AMDGPU into common code. It carried many > AMDGPU-specific hacks, had no object-lifetime model thought out as a > common component, and included teardown nightmares that “worked,” but > other drivers immediately had to work around. With Christian involved — > who is notoriously hostile — everyone did their best to paper over > issues driver-side rather than get into fights and fix things properly. > Asahi Linux publicly aired grievances about this situation years ago. > > > What circumstances, what human behavioral patterns have caused this? > > > > See above. > > > The DRM subsystem has a bad reputation regarding stability among Linux > > users, as far as I have sensed. How can we do better? > > > > Write sane code and test it. fwiw, Google shared a doc with me > indicating that Xe has unprecedented stability, and to be honest, when I > first wrote Xe I barely knew what I was doing — but I did know how to > test. I’ve since cleaned up most of my mistakes though. > > So how can we do better... We can [1]. > > I started on [1] after you asking what the problems in DRM sched - which > got me thinking about what it would look like if we took the good parts > (stop/start control plane, dependency tracking, ordering, finished > fences, etc.), dropped the bad parts (no object-lifetime model, no > refcounting, overly complex queue teardown, messy fence manipulation, > hardware-scheduling baggage, lack of annotations, etc.), and wrote > something that addresses all of these problems from the start > specifically for firmware-scheduling models. > > It turns out pretty good. > > Main patch [2]. > > Xe is fully converted, tested, and working. AMDXNDA and Panthor are > compiling. Nouveau and PVR seem like good candidates to convert as well. > Rust bindings are also possible given the clear object model with > refcounting and well-defined object lifetimes. > > Thinking further, hardware schedulers should be able to be implemented > on top of this by embedding the objects in [2] and layering a > backend/API on top. > > Let me know if you have any feedback (off-list) before I share this > publicly. So far, Dave, Sima, Danilo, and the other Xe maintainers have > been looped in. > > Matt > > [1] > https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svn-perf-6-15-2025/-/tree/local_dev/new_scheduler.post?ref_type=heads > [2] > https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svn-perf-6-15-2025/-/commit/0538a3bc2a3b562dc0427a5922958189e0be8271 > > > > > > > > > > > > > > I can't say I agree with either of you here. > > > > > > In about an hour, I seemingly have a bypass path working in DRM sched + > > > Xe, and my diff is: > > > > > > 108 insertions(+), 31 deletions(-) > > > > LOC is a bad metric for complexity. > > > > > > > > About 40 lines of the insertions are kernel-doc, so I'm not buying that > > > this is a maintenance issue or a major feature - it is literally a > > > single new function. > > > > > > I understand a bypass path can create issues—for example, on certain > > > queues in Xe I definitely can't use the bypass path, so Xe simply > > > wouldn’t use it in those cases. This is the driver's choice to use or > > > not. If a driver doesn't know how to use the scheduler, well, that’s on > > > the driver. Providing a simple, documented function as a fast path > > > really isn't some crazy idea. > > > > We're effectively talking about a deviation from the default submission > > mechanism, and all that seems to be desired for a luxury feature. > > > > Then you end up with two submission mechanisms, whose correctness in > > the future relies on someone remembering what the background was, why > > it was added, and what the rules are.. > > > > The current scheduler rules are / were often not even documented, and > > sometimes even Christian took a few weeks to remember again why > > something had been added – and whether it can now be removed again or > > not. > > > > > > > > The alternative—asking for RT workqueues or changing the design to use > > > kthread_worker—actually is. > > > > > > > That's especially true if it's features aimed at performance buffs. > > > > > > > > > > With the above mindset, I'm actually very confused why this series [1] > > > would even be considered as this order of magnitude greater in > > > complexity than my suggestion here. > > > > > > Matt > > > > > > [1] https://patchwork.freedesktop.org/series/159025/ > > > > The discussions about Tvrtko's CFS series were precisely the point > > where Danilo brought up that after this can be merged, future rework of > > the scheduler must focus on addressing some of the pending fundamental > > issues. > > > > The background is that Tvrtko has worked on that series already for > > well over a year, it actually simplifies some things in the sense of > > removing unused code (obviously it's a complex series, no argument > > about that), and we agreed on XDC that this can be merged. So this is a > > question of fairness to the contributor. > > > > But at one point you have to finally draw a line. No one will ever > > address major scheduler issues unless we demand it. Even very > > experienced devs usually prefer to hack around the central design > > issues in their drivers instead of fixing the shared infrastructure. > > > > > > P.
