On 11/25/25 11:56, Philipp Stanner wrote:
>>>>
>>>> The GPU scheduler has a very similar define, MAX_WAIT_SCHED_ENTITY_Q_EMPTY 
>>>> which is currently just 1 second.
>>>>
>>>> The real question is what is the maximum amount of time we can wait for 
>>>> the HW before we should trigger a timeout?
>>>
>>> That's a question only the drivers can answer, which is why I like to
>>> think that setting global constants constraining all parties is not the
>>> right thing to do.
>>
>> Exactly that's the reason why I bring that up. I think that drivers should 
>> be in charge of timeouts is the wrong approach.
>>
>> See the reason why we have the timeout (and documented that it is a must 
>> have) is because we have both core memory management as well a desktop 
>> responsiveness depend on it.
> 
> Good and well, but then patch 4 becomes even more problematic:
> 
> So we'd just have drivers fire warnings, and then they would still have
> the freedom to set timeouts for drm/sched, as long as those timeouts
> are smaller than your new global constant.
> 
> Why then not remove drm/sched's timeout parameter API completely and
> always use your maximum value internally in drm/sched? Or maybe
> truncate it with a warning?

I have considered that as well, but then thought that we should at least give 
end users the possibility to override the timeout while still tainting the 
kernel so that we know about this in bug reports, core dumps etc...

> "Maximum timeout parameter exceeded, truncating to %ld.\n"
> 
> I suppose some drivers want even higher responsiveness than those 2
> seconds.

As far as I know some medical use cases for example have timeouts like 
100-200ms. But again that is the use case and not the driver.

> I do believe that more of the driver folks should be made aware of this
> intended change.

I have no real intention of actually pushing those patches, at least not as 
they are. I just wanted to kick of some discussion.

>>
>>> What is even your motivation? What problem does this solve? Is the OOM
>>> killer currently hanging for anyone? Can you link a bug report?
>>
>> I'm not sure if we have an external bug report (we have an internal one), 
>> but for amdgpu there were customer complains that 10 seconds is to long.
>>
>> So we changed it to 2 seconds for amdgpu, and now there are complains from 
>> internal AMD teams that 2 seconds is to short.
>>
>> While working on that I realized that the timeout is actually not driver 
>> dependent at all.
>>
>> What can maybe argued is that a desktop system should have a shorter timeout 
>> than some server, but that one driver needs a different timeout than another 
>> driver doesn't really makes sense to me.
>>
>> I mean what is actually HW dependent on the requirement that I need a 
>> responsive desktop system?
> 
> I suppose some drivers are indeed only used for server hardware. And
> for compute you might not care about responsiveness as long as your
> result drops off at some point. But there's cloud gaming, too..

Good point with the cloud gaming.

> I agree that distinguishing the use case that way is not ideal.
> However, who has the knowledge of how the hardware is being used by
> customers / users, if not the driver?

Well the end user.

Maybe we should move the whole timeout topic into the DRM layer or the 
scheduler component.

Something like 2 seconds default (which BTW is the default on Windows as well), 
which can be overridden on a global, per device, per queue name basis.

And 10 seconds maximum with only a warning that a not default timeout is used 
and everything above 10 seconds taints the kernel and should really only be 
used for testing/debugging.

Thoughts?

Regards,
Christian.


> 
> 
> P.

Reply via email to