Hi Mark,
On Mon, Mar 06, 2017 at 02:49:28PM -0500, Mark S wrote:
> As for the timing issue, I can add to the discussion with a few related data
> points. In short, system uptime does not seem to be a commonality to my
> situation.
thanks!
> 1) I had this issue affect 6 servers, spread across 5 data centers (only 2
> servers are in the same facility.) All servers stopped processing requests
> at roughly the same moment, certainly within the same minute. All servers
> running FreeBSD 11.0-RELEASE-p2 with HAProxy compiled locally against
> OpenSSL-1.0.2k
OK.
> 2) System uptime was not at all similar across these servers, although
> chances are most servers HAProxy process start time would be similar. The
> servers with the highest system uptime were at about 27 days at the time of
> the incident, while the shortest were under a day or two.
OK so that means that haproxy could have hung in a day or two, then your
case is much more common than one of the other reports. If your fdront LB
is fair between the 6 servers, that could be related to a total number of
requests or connections or something like this.
> 3) HAProxy configurations are similar, but not exactly consistent between
> servers - different IPs on the frontend, different ACLs and backends.
OK.
> 4) The only synchronized application common to all of these servers is
> OpenNTPd.
Is there any risk that the ntpd causes time jumps in the future or in
the past for whatever reasons ? Maybe there's something with kqueue and
time jumps in recent versions ?
> 5) I have since upgraded to HAProxy-1.7.3, same build process: the full
> version output is below - and will of course report any observed issues.
>
> haproxy -vv
> HA-Proxy version 1.7.3 2017/02/28
(...)
Everything there looks pretty standard. If it dies again it could be good
to try with "nokqueue" in the global section (or start haproxy with -dk)
to disable kqueue and switch to poll. It will eat a bit more CPU, so don't
do this on all nodes at once.
I'm thinking about other things :
- if you're doing a lot of SSL we could imagine an issue with random
generation using /dev/random instead of /dev/urandom. I've met this
issue a long time ago on some apache servers where all the entropy
was progressively consumed until it was not possible anymore to get
a connection.
- it could be useful to run "netstat -an" on a dead node before killing
haproxy and archive this for later analysis. It may reveal that all
file descriptors were used by close_wait connections (indicating a
close bug in haproxy) or something like this. If instead you see a
lot of FIN_WAIT1 or FIN_WAIT2 it may indicate an issue with some
external firewall or pf blocking some final traffic and leading to
socket space exhaustion.
If you have the same issue that was reported with kevent() being called
in loops and returning an error, you may definitely see tons of close_wait
and it will indicate an issue with this poller, though I have no idea
which one, especially since it doesn't change often and *seems* to work
with previous versions.
Best regards,
Willy