On 2/17/26 15:28, Philipp Stanner wrote:
> On Tue, 2026-02-17 at 15:22 +0100, Christian König wrote:
>> On 2/17/26 15:09, Alice Ryhl wrote:
>>> On Tue, Feb 17, 2026 at 3:04 PM Philipp Stanner <[email protected]> wrote:
>>>>>>>
>>>>>>>
> 
> […]
> 
>>>>>>> Thinking more about it you should probably enforce that there is only
>>>>>>> one signaling path for each fence signaling.
>>>>>>
>>>>>> I'm not really convinced by this.
>>>>>>
>>>>>> First, the timeout path must be a fence signalling path because the
>>>>>> reason you have a timeout in the first place is because the hw might
>>>>>> never signal the fence. So if the timeout path deadlocks on a
>>>>>> kmalloc(GFP_KERNEL) and the hw never comes around to wake you up, boom.
>>>>>
>>>>> Mhm, good point. On the other hand the timeout handling should probably 
>>>>> be considered part of the normal signaling path.
>>>>
>>>>
>>>> Why would anyone want to allocate in a timeout path in the first place – 
>>>> especially for jobqueue?
>>>>
>>>> Timeout -> close the associated ring. Done.
>>>> JobQueue will signal the done_fences with -ECANCELED.
>>>>
>>>> What would the driver want to allocate in its timeout path, i.e.: timeout 
>>>> callback.
>>>
>>> Maybe you need an allocation to hold the struct delayed_work_struct
>>> field that you use to enqueue the timeout?
>>
>> And the workqueue were you schedule the delayed_work on must have the 
>> reclaim bit set.
>>
>> Otherwise it can be that the workqueue finds all kthreads busy and tries to 
>> start a new one, e.g. allocating task structure......
> 
> OK, maybe I'm lost, but what delayed_work?
> 
> The jobqueue's delayed work item gets either created on JQ::new() or in
> jq.submit_job(). Why would anyone – that is: any driver – implement a
> delayed work in its timeout callback?
> 
> That doesn't make sense.
> 
> JQ notifies the driver from its delayed_work through
> timeout_callback(), and in that callback the driver closes the
> associated firmware ring.
> 
> And it drops the JQ. So it is gone. A new JQ will get a new timeout
> work item.
> 
> That's basically all the driver must ever do. Maybe some logging and
> stuff.
> 
> With firmware scheduling it should really be that simple.
> 
> And signalling / notifying userspace gets done by jobqueue.
> 
> Right?

Correct, I just wanted to point that jobqueue needs to keep the workqueue rules 
in mind as well.

But you really need to double check what drivers are doing. We had more than 
one kmalloc() added because we had a warning of to many variables on the stack 
in DAL/DC...

This turns into a debugging nightmare if you need to re-init the display during 
timeout handling.

>>
>> You also potentially want device core dumps. Those usually use GFP_NOWAIT so 
>> that they can't cycle back and wait for some fence. The down side is that 
>> they can trivially fail under even light memory pressure.
> 
> Simply logging into dmesg should do the trick, shouldn't it?

Nope, not even remotely. A device core dump can easily be hundreds of megabyte 
in size.

In other words it's the HW state you usually attach to a crash report to figure 
out what's going on.

Christian.

> 
> 
> P.

Reply via email to