fence: give some reasonable maximum signaling timeout

Christian König Tue, 25 Nov 2025 00:48:50 -0800

On 11/25/25 09:13, Philipp Stanner wrote:
> On Tue, 2025-11-25 at 09:03 +0100, Christian König wrote:
>> On 11/25/25 08:55, Philipp Stanner wrote:
>>>>  
>>>> +/**
>>>> + * define DMA_FENCE_MAX_REASONABLE_TIMEOUT - max reasonable signaling 
>>>> timeout
>>>> + *
>>>> + * The dma_fence object has a deep inter dependency with core memory
>>>> + * management, for a detailed explanation see section DMA Fences under
>>>> + * Documentation/driver-api/dma-buf.rst.
>>>> + *
>>>> + * Because of this all dma_fence implementations must guarantee that each 
>>>> fence
>>>> + * completes in a finite time. This define here now gives a reasonable 
>>>> value for
>>>> + * the timeout to use. It is possible to use a longer timeout in an
>>>> + * implementation but that should taint the kernel.
>>>> + */
>>>> +#define DMA_FENCE_MAX_REASONABLE_TIMEOUT (2*HZ)
>>>
>>> HZ can change depending on the config. Is that really a good choice? I
>>> could see racy situations arising in some configs vs others
>>
>> 2*HZ is always two seconds expressed in number of jiffies, I can use 
>> msecs_to_jiffies(2000) to make that more obvious.
> 
> On AMD64 maybe. What about the other architectures?


HZ is defined as jiffies per second, So even if it changes to 10,100 or 1000 
depending on the architecture 2*HZ is always two seconds expressed in jiffies.

The HZ define is actually there to make it architecture independent.

>>
>> The GPU scheduler has a very similar define, MAX_WAIT_SCHED_ENTITY_Q_EMPTY 
>> which is currently just 1 second.
>>
>> The real question is what is the maximum amount of time we can wait for the 
>> HW before we should trigger a timeout?
> 
> That's a question only the drivers can answer, which is why I like to
> think that setting global constants constraining all parties is not the
> right thing to do.

Exactly that's the reason why I bring that up. I think that drivers should be 
in charge of timeouts is the wrong approach.

See the reason why we have the timeout (and documented that it is a must have) 
is because we have both core memory management as well a desktop responsiveness 
depend on it.

> What is even your motivation? What problem does this solve? Is the OOM
> killer currently hanging for anyone? Can you link a bug report?

I'm not sure if we have an external bug report (we have an internal one), but 
for amdgpu there were customer complains that 10 seconds is to long.

So we changed it to 2 seconds for amdgpu, and now there are complains from 
internal AMD teams that 2 seconds is to short.

While working on that I realized that the timeout is actually not driver 
dependent at all.

What can maybe argued is that a desktop system should have a shorter timeout 
than some server, but that one driver needs a different timeout than another 
driver doesn't really makes sense to me.

I mean what is actually HW dependent on the requirement that I need a 
responsive desktop system?

>>
>> Some AMD internal team is pushing for 10 seconds, but that also means that 
>> for example we wait 10 seconds for the OOM killer to do something. That 
>> sounds like way to long.
>>
> 
> Nouveau has timeout = 10 seconds. AFAIK we've never seen bugs because
> of that. Have you seen some?

Thanks for that info. And to answer the question, yes certainly.

Regards,
Christian.

> 
> 
> P.

Re: [PATCH 1/4] dma-buf/fence: give some reasonable maximum signaling timeout

Reply via email to