On Tue, Feb 12, 2013 at 12:10:00PM +0100, Finn Arne Gangstad wrote:
> New haproxy running, may take some hours before it happens again. I'll
> see if I can get it nailed down to the second it happens. In the
> meantime, I did some digging into the CPU logs, the last incident
> happened at 00:02:00 +- 15 seconds, here is the log output from
> haproxy around that time (excluding all valid results).
> 
> The names are edited, backend/s3 is the relevant server that has the
> file descriptor that epoll returns.

Thank you for the details.

I think the scenario looks approximately like this :

  1) server is up
  2) a request is sent to the server
  3) check task starts an async connect()
  4) request from 2) causes an error which forces the server down due to
     observe L7.
  5) connect() from 3) succeeds and wakes the check task up
  6) the check task sees the server is in failed state and aborts the check
  7) the connection is released but the FD is never closed

I'm still auditing the code to ensure the scenario above is realist.
If I can confirm this behaviour, I'll propose you a patch.

Regards,
Willy


Reply via email to