Willy,

per your comment on /dev/random exhaustion. I think running haveged on servers doing crypto work is/should be best practice.

jerry
On 3/6/17 12:02 PM, Willy Tarreau wrote:
Hi Mark,

On Mon, Mar 06, 2017 at 02:49:28PM -0500, Mark S wrote:
As for the timing issue, I can add to the discussion with a few related data
points.  In short, system uptime does not seem to be a commonality to my
situation.
thanks!

1) I had this issue affect 6 servers, spread across 5 data centers (only 2
servers are in the same facility.)  All servers stopped processing requests
at roughly the same moment, certainly within the same minute.  All servers
running FreeBSD 11.0-RELEASE-p2 with HAProxy compiled locally against
OpenSSL-1.0.2k
OK.

2) System uptime was not at all similar across these servers, although
chances are most servers HAProxy process start time would be similar.  The
servers with the highest system uptime were at about 27 days at the time of
the incident, while the shortest were under a day or two.
OK so that means that haproxy could have hung in a day or two, then your
case is much more common than one of the other reports. If your fdront LB
is fair between the 6 servers, that could be related to a total number of
requests or connections or something like this.

3) HAProxy configurations are similar, but not exactly consistent between
servers - different IPs on the frontend, different ACLs and backends.
OK.

4) The only synchronized application common to all of these servers is
OpenNTPd.
Is there any risk that the ntpd causes time jumps in the future or in
the past for whatever reasons ? Maybe there's something with kqueue and
time jumps in recent versions ?

5) I have since upgraded to HAProxy-1.7.3, same build process: the full
version output is below - and will of course report any observed issues.

haproxy -vv
HA-Proxy version 1.7.3 2017/02/28
(...)

Everything there looks pretty standard. If it dies again it could be good
to try with "nokqueue" in the global section (or start haproxy with -dk)
to disable kqueue and switch to poll. It will eat a bit more CPU, so don't
do this on all nodes at once.

I'm thinking about other things :
   - if you're doing a lot of SSL we could imagine an issue with random
     generation using /dev/random instead of /dev/urandom. I've met this
     issue a long time ago on some apache servers where all the entropy
     was progressively consumed until it was not possible anymore to get
     a connection.

   - it could be useful to run "netstat -an" on a dead node before killing
     haproxy and archive this for later analysis. It may reveal that all
     file descriptors were used by close_wait connections (indicating a
     close bug in haproxy) or something like this. If instead you see a
     lot of FIN_WAIT1 or FIN_WAIT2 it may indicate an issue with some
     external firewall or pf blocking some final traffic and leading to
     socket space exhaustion.

If you have the same issue that was reported with kevent() being called
in loops and returning an error, you may definitely see tons of close_wait
and it will indicate an issue with this poller, though I have no idea
which one, especially since it doesn't change often and *seems* to work
with previous versions.

Best regards,
Willy


--
Soundhound Devops
"What could possibly go wrong?"


Reply via email to