Hi,

On 4/23/2025 4:23 PM, Jeff Hugo wrote:
> On 4/23/2025 1:23 AM, Jacek Lawrynowicz wrote:
>> Hi,
>>
>> On 4/18/2025 5:27 PM, Jeffrey Hugo wrote:
>>> On 4/16/2025 4:25 AM, Maciej Falkowski wrote:
>>>> From: Karol Wachowski <karol.wachow...@intel.com>
>>>>
>>>> Introduce a heartbeat-based Timeout Detection and Recovery (TDR) mechanism.
>>>> The enhancement aims to improve the reliability of device hang detection by
>>>> monitoring heartbeat updates.
>>>>
>>>> Each progressing inference will update heartbeat counter allowing driver to
>>>> monitor its progression. Limit maximum number of reschedules when heartbeat
>>>> indicates progression to 30.
>>>
>>> Code looks good.  However, why 30?  This would artificially limit how long 
>>> a job could run, no?
>>
>> Yes, we still need a time based limit. There may be workloads that are stuck 
>> in infinite loop for example.
>> With this patch the max time the job can run is extended from 2 to 60 
>> seconds.
>> We are not aware of any workloads that exceed this timeout at the moment.
> 
> Infinite loop vs something that just happens to be running long by design is 
> a difficult problem.  60 seconds does not seem all that long to me.  Perhaps 
> consider some kind of override so that if/when a workload comes along that 
> needs more than 60 seconds, the user doesn't need to recompile their kernel 
> to make it work?  I suspect that would be outside the scope of this change.
> 
> For this change, I think it would be good to add some info from your 
> response.  I think the commit text would be improved by stating this 
> increases the max runtime from 2 seconds to 60, and that this covers all 
> known workloads.  Also, I think a comment on PM_TDR_HEARTBEAT_LIMIT that 
> tells how long the limit is (60 seconds) would be helpful to future readers, 
> instead of needing to parse through multiple functions and how they all 
> interact.
> 
> With the commit text update -
> 
> Reviewed-by: Jeff Hugo <jeff.h...@oss.qualcomm.com>
> 
> The PM_TDR_HEARTBEAT_LIMIT comment is optional to me.

OK, I will update the commit message.

There is a module param that can be used to increase the overall timeout 
(tdr_timeout_ms).
The total timeout is (tdr_timeout_ms * 30) but this is not too intuitive after 
this change.
I will figure better solution that's easier to use.


Regards,
Jacek

Reply via email to