On Wed Feb 11, 2026 at 11:20 AM CET, Boris Brezillon wrote:
> On Wed, 11 Feb 2026 10:57:27 +0100
> "Danilo Krummrich" <[email protected]> wrote:
>
>> (Cc: Xe maintainers)
>> 
>> On Tue Feb 10, 2026 at 12:40 PM CET, Alice Ryhl wrote:
>> > On Tue, Feb 10, 2026 at 11:46:44AM +0100, Christian König wrote:  
>> >> On 2/10/26 11:36, Danilo Krummrich wrote:  
>> >> > On Tue Feb 10, 2026 at 11:15 AM CET, Alice Ryhl wrote:  
>> >> >> One way you can see this is by looking at what we require of the
>> >> >> workqueue. For all this to work, it's pretty important that we never
>> >> >> schedule anything on the workqueue that's not signalling safe, since
>> >> >> otherwise you could have a deadlock where the workqueue is executes 
>> >> >> some
>> >> >> random job calling kmalloc(GFP_KERNEL) and then blocks on our fence,
>> >> >> meaning that the VM_BIND job never gets scheduled since the workqueue
>> >> >> is never freed up. Deadlock.  
>> >> > 
>> >> > Yes, I also pointed this out multiple times in the past in the context 
>> >> > of C GPU
>> >> > scheduler discussions. It really depends on the workqueue and how it is 
>> >> > used.
>> >> > 
>> >> > In the C GPU scheduler the driver can pass its own workqueue to the 
>> >> > scheduler,
>> >> > which means that the driver has to ensure that at least one out of the
>> >> > wq->max_active works is free for the scheduler to make progress on the
>> >> > scheduler's run and free job work.
>> >> > 
>> >> > Or in other words, there must be no more than wq->max_active - 1 works 
>> >> > that
>> >> > execute code violating the DMA fence signalling rules.  
>> >
>> > Ouch, is that really the best way to do that? Why not two workqueues?  
>> 
>> Most drivers making use of this re-use the same workqueue for multiple GPU
>> scheduler instances in firmware scheduling mode (i.e. 1:1 relationship 
>> between
>> scheduler and entity). This is equivalent to the JobQ use-case.
>> 
>> Note that we will have one JobQ instance per userspace queue, so sharing the
>> workqueue between JobQ instances can make sense.
>
> Definitely, but I think that's orthogonal to allowing this common
> workqueue to be used for work items that don't comply with the
> dma-fence signalling rules, isn't it?

Yes and no. If we allow passing around shared WQs without a corresponding type
abstraction we open the door for drivers to abuse it the schedule their own
work.

I.e. sharing a workqueue between JobQs is fine, but we have to ensure they can't
be used for anything else.

>> Besides that, IIRC Xe was re-using the workqueue for something else, but that
>> doesn't seem to be the case anymore. I can only find [1], which more seems 
>> like
>> some custom GPU scheduler extention [2] to me...
>
> Yep, I think it can be the problematic case. It doesn't mean we can't
> schedule work items that don't signal fences, but I think it'd be
> simpler if we were forcing those to follow the same rules (no blocking
> alloc, no locks taken that are also taken in other paths were blocking
> allocs happen, etc) regardless of this wq->max_active value.
>
>> 
>> [1] 
>> https://elixir.bootlin.com/linux/v6.18.6/source/drivers/gpu/drm/xe/xe_gpu_scheduler.c#L40
>> [2] 
>> https://elixir.bootlin.com/linux/v6.18.6/source/drivers/gpu/drm/xe/xe_gpu_scheduler_types.h#L28

Reply via email to