Hi Kevin,

On Tue, May 29, 2012 at 03:08:17PM -0400, Kevin M Lange wrote:
> I've been monitoring our service availability check (http head of a 
> resource that truly provides availability status of the application).  
> Under normal circumstances, the check takes 2-3 seconds.  We found 
> periods of time where the application would take 15+seconds and fail (I 
> did not capture HTTP code, but I'm pretty sure it was a 500 series from 
> what I've been looking through).  These failure periods match the times 
> where haproxy was indicating timeouts of 1002ms.  So, it looks like 
> haproxy is doing its job.  Is this then a bug in the logging of the 
> timeout value (reporting 1002ms vs 15000+ms)?

This is the strange part, as I didn't manage to get this indication on
my test platform. Would you accept to send me in private the network
capture for a series of checks that were mis-reported ? Depending on
how it's segmented and aborted, maybe I could get a clue about what is
happening.

> We haven't had any problems since 25 May, but we're keeping watch.

It reminds me the old days of early Opterons where clock was unsynced
between the cores and was jumping back and forth, causing early timeouts
and wrong timer reports. The issue comes back with the use of VMs
everywhere. This led me to implement the internal monotonic clock which
compensates for jumps, which cannot exceed 1s now. But even with a 1s
jump, this does not explain 15000 -> 1002ms, so right now I'm a bit stuck.

Regards,
Willy


Reply via email to