Hi, On 4/23/2025 4:23 PM, Jeff Hugo wrote: > On 4/23/2025 1:23 AM, Jacek Lawrynowicz wrote: >> Hi, >> >> On 4/18/2025 5:27 PM, Jeffrey Hugo wrote: >>> On 4/16/2025 4:25 AM, Maciej Falkowski wrote: >>>> From: Karol Wachowski <karol.wachow...@intel.com> >>>> >>>> Introduce a heartbeat-based Timeout Detection and Recovery (TDR) mechanism. >>>> The enhancement aims to improve the reliability of device hang detection by >>>> monitoring heartbeat updates. >>>> >>>> Each progressing inference will update heartbeat counter allowing driver to >>>> monitor its progression. Limit maximum number of reschedules when heartbeat >>>> indicates progression to 30. >>> >>> Code looks good. However, why 30? This would artificially limit how long >>> a job could run, no? >> >> Yes, we still need a time based limit. There may be workloads that are stuck >> in infinite loop for example. >> With this patch the max time the job can run is extended from 2 to 60 >> seconds. >> We are not aware of any workloads that exceed this timeout at the moment. > > Infinite loop vs something that just happens to be running long by design is > a difficult problem. 60 seconds does not seem all that long to me. Perhaps > consider some kind of override so that if/when a workload comes along that > needs more than 60 seconds, the user doesn't need to recompile their kernel > to make it work? I suspect that would be outside the scope of this change. > > For this change, I think it would be good to add some info from your > response. I think the commit text would be improved by stating this > increases the max runtime from 2 seconds to 60, and that this covers all > known workloads. Also, I think a comment on PM_TDR_HEARTBEAT_LIMIT that > tells how long the limit is (60 seconds) would be helpful to future readers, > instead of needing to parse through multiple functions and how they all > interact. > > With the commit text update - > > Reviewed-by: Jeff Hugo <jeff.h...@oss.qualcomm.com> > > The PM_TDR_HEARTBEAT_LIMIT comment is optional to me.
OK, I will update the commit message. There is a module param that can be used to increase the overall timeout (tdr_timeout_ms). The total timeout is (tdr_timeout_ms * 30) but this is not too intuitive after this change. I will figure better solution that's easier to use. Regards, Jacek