On Wed Feb 11, 2026 at 11:20 AM CET, Boris Brezillon wrote: > On Wed, 11 Feb 2026 10:57:27 +0100 > "Danilo Krummrich" <[email protected]> wrote: > >> (Cc: Xe maintainers) >> >> On Tue Feb 10, 2026 at 12:40 PM CET, Alice Ryhl wrote: >> > On Tue, Feb 10, 2026 at 11:46:44AM +0100, Christian König wrote: >> >> On 2/10/26 11:36, Danilo Krummrich wrote: >> >> > On Tue Feb 10, 2026 at 11:15 AM CET, Alice Ryhl wrote: >> >> >> One way you can see this is by looking at what we require of the >> >> >> workqueue. For all this to work, it's pretty important that we never >> >> >> schedule anything on the workqueue that's not signalling safe, since >> >> >> otherwise you could have a deadlock where the workqueue is executes >> >> >> some >> >> >> random job calling kmalloc(GFP_KERNEL) and then blocks on our fence, >> >> >> meaning that the VM_BIND job never gets scheduled since the workqueue >> >> >> is never freed up. Deadlock. >> >> > >> >> > Yes, I also pointed this out multiple times in the past in the context >> >> > of C GPU >> >> > scheduler discussions. It really depends on the workqueue and how it is >> >> > used. >> >> > >> >> > In the C GPU scheduler the driver can pass its own workqueue to the >> >> > scheduler, >> >> > which means that the driver has to ensure that at least one out of the >> >> > wq->max_active works is free for the scheduler to make progress on the >> >> > scheduler's run and free job work. >> >> > >> >> > Or in other words, there must be no more than wq->max_active - 1 works >> >> > that >> >> > execute code violating the DMA fence signalling rules. >> > >> > Ouch, is that really the best way to do that? Why not two workqueues? >> >> Most drivers making use of this re-use the same workqueue for multiple GPU >> scheduler instances in firmware scheduling mode (i.e. 1:1 relationship >> between >> scheduler and entity). This is equivalent to the JobQ use-case. >> >> Note that we will have one JobQ instance per userspace queue, so sharing the >> workqueue between JobQ instances can make sense. > > Definitely, but I think that's orthogonal to allowing this common > workqueue to be used for work items that don't comply with the > dma-fence signalling rules, isn't it?
Yes and no. If we allow passing around shared WQs without a corresponding type abstraction we open the door for drivers to abuse it the schedule their own work. I.e. sharing a workqueue between JobQs is fine, but we have to ensure they can't be used for anything else. >> Besides that, IIRC Xe was re-using the workqueue for something else, but that >> doesn't seem to be the case anymore. I can only find [1], which more seems >> like >> some custom GPU scheduler extention [2] to me... > > Yep, I think it can be the problematic case. It doesn't mean we can't > schedule work items that don't signal fences, but I think it'd be > simpler if we were forcing those to follow the same rules (no blocking > alloc, no locks taken that are also taken in other paths were blocking > allocs happen, etc) regardless of this wq->max_active value. > >> >> [1] >> https://elixir.bootlin.com/linux/v6.18.6/source/drivers/gpu/drm/xe/xe_gpu_scheduler.c#L40 >> [2] >> https://elixir.bootlin.com/linux/v6.18.6/source/drivers/gpu/drm/xe/xe_gpu_scheduler_types.h#L28
