> On Aug 24, 2018, at 12:05 PM, Eric Covener <[email protected]> wrote: > > On Fri, Aug 24, 2018 at 11:57 AM Christophe JAILLET > <[email protected] <mailto:[email protected]>> wrote: >> >> Le 24/08/2018 à 16:40, Jim Jagielski a écrit : >>> I was wondering if someone wanted to provide a sanity check >>> on the above PR and what's "expected" by the health check code. >>> >>> It would be very easy to adjust so that hcinterval was not >>> the time between successive checks but the interval between >>> the end of one and the start of another, but I'm not sure that >>> is as useful. In other words, I think the current behavior >>> is right (but think the docs need to be updated), but am >>> willing to have my mind changed :) >>> >> Hi Jim, >> >> the current behavior is also what I would expect. >> If I configure a check every 10s, I would expect 6 checks each minute, >> even if the test itself takes time to perform. > > > Bug describes something else IIUC. Because the watchdog calls us 10 > times per second, it continuously sees that the worker hasn't been > health checked within the desired interval and queues up a check, it > doesn't know one is queued.
But that is only an issue, afaict, if the time taken to do the health check is greater than the interval chosen... Or am I misunderstanding? That is, if the interval is 200ms, and the health check takes 100ms, all is fine, we get 5 checks a second. I guess what we could do is emit a warning if when a check is queued, we already have one queued, or in process. This would some info to the sysadmin. We could also track the time taken to perform a check and have that available via mod_status as well. But these all assume that the underlying logic, and how it's implemented, is sane.
