On Wed, 11 Feb 2026 10:57:27 +0100
"Danilo Krummrich" <[email protected]> wrote:

> (Cc: Xe maintainers)
> 
> On Tue Feb 10, 2026 at 12:40 PM CET, Alice Ryhl wrote:
> > On Tue, Feb 10, 2026 at 11:46:44AM +0100, Christian König wrote:  
> >> On 2/10/26 11:36, Danilo Krummrich wrote:  
> >> > On Tue Feb 10, 2026 at 11:15 AM CET, Alice Ryhl wrote:  
> >> >> One way you can see this is by looking at what we require of the
> >> >> workqueue. For all this to work, it's pretty important that we never
> >> >> schedule anything on the workqueue that's not signalling safe, since
> >> >> otherwise you could have a deadlock where the workqueue is executes some
> >> >> random job calling kmalloc(GFP_KERNEL) and then blocks on our fence,
> >> >> meaning that the VM_BIND job never gets scheduled since the workqueue
> >> >> is never freed up. Deadlock.  
> >> > 
> >> > Yes, I also pointed this out multiple times in the past in the context 
> >> > of C GPU
> >> > scheduler discussions. It really depends on the workqueue and how it is 
> >> > used.
> >> > 
> >> > In the C GPU scheduler the driver can pass its own workqueue to the 
> >> > scheduler,
> >> > which means that the driver has to ensure that at least one out of the
> >> > wq->max_active works is free for the scheduler to make progress on the
> >> > scheduler's run and free job work.
> >> > 
> >> > Or in other words, there must be no more than wq->max_active - 1 works 
> >> > that
> >> > execute code violating the DMA fence signalling rules.  
> >
> > Ouch, is that really the best way to do that? Why not two workqueues?  
> 
> Most drivers making use of this re-use the same workqueue for multiple GPU
> scheduler instances in firmware scheduling mode (i.e. 1:1 relationship between
> scheduler and entity). This is equivalent to the JobQ use-case.
> 
> Note that we will have one JobQ instance per userspace queue, so sharing the
> workqueue between JobQ instances can make sense.

Definitely, but I think that's orthogonal to allowing this common
workqueue to be used for work items that don't comply with the
dma-fence signalling rules, isn't it?

> 
> Besides that, IIRC Xe was re-using the workqueue for something else, but that
> doesn't seem to be the case anymore. I can only find [1], which more seems 
> like
> some custom GPU scheduler extention [2] to me...

Yep, I think it can be the problematic case. It doesn't mean we can't
schedule work items that don't signal fences, but I think it'd be
simpler if we were forcing those to follow the same rules (no blocking
alloc, no locks taken that are also taken in other paths were blocking
allocs happen, etc) regardless of this wq->max_active value.

> 
> [1] 
> https://elixir.bootlin.com/linux/v6.18.6/source/drivers/gpu/drm/xe/xe_gpu_scheduler.c#L40
> [2] 
> https://elixir.bootlin.com/linux/v6.18.6/source/drivers/gpu/drm/xe/xe_gpu_scheduler_types.h#L28

Reply via email to