On Wed, 11 Feb 2026 12:00:30 +0100
"Danilo Krummrich" <[email protected]> wrote:

> On Wed Feb 11, 2026 at 11:20 AM CET, Boris Brezillon wrote:
> > On Wed, 11 Feb 2026 10:57:27 +0100
> > "Danilo Krummrich" <[email protected]> wrote:
> >  
> >> (Cc: Xe maintainers)
> >> 
> >> On Tue Feb 10, 2026 at 12:40 PM CET, Alice Ryhl wrote:  
> >> > On Tue, Feb 10, 2026 at 11:46:44AM +0100, Christian König wrote:    
> >> >> On 2/10/26 11:36, Danilo Krummrich wrote:    
> >> >> > On Tue Feb 10, 2026 at 11:15 AM CET, Alice Ryhl wrote:    
> >> >> >> One way you can see this is by looking at what we require of the
> >> >> >> workqueue. For all this to work, it's pretty important that we never
> >> >> >> schedule anything on the workqueue that's not signalling safe, since
> >> >> >> otherwise you could have a deadlock where the workqueue is executes 
> >> >> >> some
> >> >> >> random job calling kmalloc(GFP_KERNEL) and then blocks on our fence,
> >> >> >> meaning that the VM_BIND job never gets scheduled since the workqueue
> >> >> >> is never freed up. Deadlock.    
> >> >> > 
> >> >> > Yes, I also pointed this out multiple times in the past in the 
> >> >> > context of C GPU
> >> >> > scheduler discussions. It really depends on the workqueue and how it 
> >> >> > is used.
> >> >> > 
> >> >> > In the C GPU scheduler the driver can pass its own workqueue to the 
> >> >> > scheduler,
> >> >> > which means that the driver has to ensure that at least one out of the
> >> >> > wq->max_active works is free for the scheduler to make progress on the
> >> >> > scheduler's run and free job work.
> >> >> > 
> >> >> > Or in other words, there must be no more than wq->max_active - 1 
> >> >> > works that
> >> >> > execute code violating the DMA fence signalling rules.    
> >> >
> >> > Ouch, is that really the best way to do that? Why not two workqueues?    
> >> 
> >> Most drivers making use of this re-use the same workqueue for multiple GPU
> >> scheduler instances in firmware scheduling mode (i.e. 1:1 relationship 
> >> between
> >> scheduler and entity). This is equivalent to the JobQ use-case.
> >> 
> >> Note that we will have one JobQ instance per userspace queue, so sharing 
> >> the
> >> workqueue between JobQ instances can make sense.  
> >
> > Definitely, but I think that's orthogonal to allowing this common
> > workqueue to be used for work items that don't comply with the
> > dma-fence signalling rules, isn't it?  
> 
> Yes and no. If we allow passing around shared WQs without a corresponding type
> abstraction we open the door for drivers to abuse it the schedule their own
> work.
> 
> I.e. sharing a workqueue between JobQs is fine, but we have to ensure they 
> can't
> be used for anything else.

Totally agree with that, and that's where I was going with this special
DmaFenceWorkqueue wrapper/abstract, that would only accept
scheduling MaySignalDmaFencesWorkItem objects.

> 
> >> Besides that, IIRC Xe was re-using the workqueue for something else, but 
> >> that
> >> doesn't seem to be the case anymore. I can only find [1], which more seems 
> >> like
> >> some custom GPU scheduler extention [2] to me...  
> >
> > Yep, I think it can be the problematic case. It doesn't mean we can't
> > schedule work items that don't signal fences, but I think it'd be
> > simpler if we were forcing those to follow the same rules (no blocking
> > alloc, no locks taken that are also taken in other paths were blocking
> > allocs happen, etc) regardless of this wq->max_active value.
> >  
> >> 
> >> [1] 
> >> https://elixir.bootlin.com/linux/v6.18.6/source/drivers/gpu/drm/xe/xe_gpu_scheduler.c#L40
> >> [2] 
> >> https://elixir.bootlin.com/linux/v6.18.6/source/drivers/gpu/drm/xe/xe_gpu_scheduler_types.h#L28
> >>   
> 

Reply via email to