On Wed, 11 Feb 2026 12:00:30 +0100 "Danilo Krummrich" <[email protected]> wrote:
> On Wed Feb 11, 2026 at 11:20 AM CET, Boris Brezillon wrote: > > On Wed, 11 Feb 2026 10:57:27 +0100 > > "Danilo Krummrich" <[email protected]> wrote: > > > >> (Cc: Xe maintainers) > >> > >> On Tue Feb 10, 2026 at 12:40 PM CET, Alice Ryhl wrote: > >> > On Tue, Feb 10, 2026 at 11:46:44AM +0100, Christian König wrote: > >> >> On 2/10/26 11:36, Danilo Krummrich wrote: > >> >> > On Tue Feb 10, 2026 at 11:15 AM CET, Alice Ryhl wrote: > >> >> >> One way you can see this is by looking at what we require of the > >> >> >> workqueue. For all this to work, it's pretty important that we never > >> >> >> schedule anything on the workqueue that's not signalling safe, since > >> >> >> otherwise you could have a deadlock where the workqueue is executes > >> >> >> some > >> >> >> random job calling kmalloc(GFP_KERNEL) and then blocks on our fence, > >> >> >> meaning that the VM_BIND job never gets scheduled since the workqueue > >> >> >> is never freed up. Deadlock. > >> >> > > >> >> > Yes, I also pointed this out multiple times in the past in the > >> >> > context of C GPU > >> >> > scheduler discussions. It really depends on the workqueue and how it > >> >> > is used. > >> >> > > >> >> > In the C GPU scheduler the driver can pass its own workqueue to the > >> >> > scheduler, > >> >> > which means that the driver has to ensure that at least one out of the > >> >> > wq->max_active works is free for the scheduler to make progress on the > >> >> > scheduler's run and free job work. > >> >> > > >> >> > Or in other words, there must be no more than wq->max_active - 1 > >> >> > works that > >> >> > execute code violating the DMA fence signalling rules. > >> > > >> > Ouch, is that really the best way to do that? Why not two workqueues? > >> > >> Most drivers making use of this re-use the same workqueue for multiple GPU > >> scheduler instances in firmware scheduling mode (i.e. 1:1 relationship > >> between > >> scheduler and entity). This is equivalent to the JobQ use-case. > >> > >> Note that we will have one JobQ instance per userspace queue, so sharing > >> the > >> workqueue between JobQ instances can make sense. > > > > Definitely, but I think that's orthogonal to allowing this common > > workqueue to be used for work items that don't comply with the > > dma-fence signalling rules, isn't it? > > Yes and no. If we allow passing around shared WQs without a corresponding type > abstraction we open the door for drivers to abuse it the schedule their own > work. > > I.e. sharing a workqueue between JobQs is fine, but we have to ensure they > can't > be used for anything else. Totally agree with that, and that's where I was going with this special DmaFenceWorkqueue wrapper/abstract, that would only accept scheduling MaySignalDmaFencesWorkItem objects. > > >> Besides that, IIRC Xe was re-using the workqueue for something else, but > >> that > >> doesn't seem to be the case anymore. I can only find [1], which more seems > >> like > >> some custom GPU scheduler extention [2] to me... > > > > Yep, I think it can be the problematic case. It doesn't mean we can't > > schedule work items that don't signal fences, but I think it'd be > > simpler if we were forcing those to follow the same rules (no blocking > > alloc, no locks taken that are also taken in other paths were blocking > > allocs happen, etc) regardless of this wq->max_active value. > > > >> > >> [1] > >> https://elixir.bootlin.com/linux/v6.18.6/source/drivers/gpu/drm/xe/xe_gpu_scheduler.c#L40 > >> [2] > >> https://elixir.bootlin.com/linux/v6.18.6/source/drivers/gpu/drm/xe/xe_gpu_scheduler_types.h#L28 > >> >
