fence: give some reasonable maximum signaling timeout

Lucas Stach Wed, 26 Nov 2025 07:59:57 -0800

Am Mittwoch, dem 26.11.2025 um 16:44 +0100 schrieb Philipp Stanner:
> On Wed, 2025-11-26 at 16:03 +0100, Christian König wrote:
> > 
> > 
> > On 11/26/25 13:37, Philipp Stanner wrote:
> > > On Wed, 2025-11-26 at 13:31 +0100, Christian König wrote:
> > > > 
> 
> […]
> 
> > > > Well the question is how do you detect *reliable* that there is
> > > > still forward progress?
> > > 
> > > My understanding is that that's impossible since the internals of
> > > command submissions are only really understood by userspace, who
> > > submits them.
> > 
> > Right, but we can still try to do our best in the kernel to mitigate
> > the situation.
> > 
> > I think for now amdgpu will implement something like checking if the
> > HW still makes progress after a timeout but only a limited number of
> > re-tries until we say that's it and reset anyway.
> 
> Oh oh, isn't that our dear hang_limit? :)


Not really. The hang limit is the limit on how many times a hanging
submit might be retried.

Limiting the number of timeout extensions is more of a safety net
against a workloads which might appear to make progress to the kernel
driver but in reality are stuck. After all, the kernel driver can only
have limited knowledge of the GPU state and any progress check will
have limited precision with false positives/negatives being a part of
reality we have to deal with.

> 
> We agree that you can never really now whether userspace just submitted
> a while(true) job, don't we? Even if some GPU register still indicates
> "progress".

Yea, this is really hardware dependent on what you can read at
runtime. 

For etnaviv we define "progress" as the command frontend moving towards
the end of the command buffer. As a single draw call in valid workloads
can blow through our timeout we also use debug registers to look at the
current primitive ID within a draw call.
If userspace submits a workload that requires more than 500ms per
primitive to finish we consider this an invalid workload and go through
the reset/recovery motions.

Regards,
Lucas

Re: [PATCH 1/4] dma-buf/fence: give some reasonable maximum signaling timeout

Reply via email to