Hi Folks,
This is a strange one and I haven't yet been able to duplicate. But I
wanted to report the description of what did happen in case it was either
a known issue or one that would seem likely based on the code.
The servers in question are running HAProxy 1.7.1 on FreeBSD-11.
I made a change to a small text file used for ACL comparisons that, to my
understanding, is only read when HAPRoxy is started or reloaded. This
file was pushed out to all servers at essentially the same time, but there
was no corresponding reload command issued at that time, and no other
changes were made in this time frame - in fact, no actions taken at all to
these servers for about 10 hours prior.
This file is referenced like so in a "frontend" block:
acl aclname src -f /path/to/file.txt
Approximately 23 minutes after pushing out this change, all HAProxy
servers stopped accepting packets. TCP connections were being made at
least for some time, but this may have been due to the OS accept queue
rather than HAProxy actually accepting packets. In any case, no requests
were being handled or logged.
No errors were logged, nor were health checks being issued or logged once
things ground to a halt. Although the haproxy service was running and
therefore not generating any alerts or automated service restarts, our
external monitoring picked up the situation.
Seeing as these servers are in production my first action was to issue a
reload to all servers, which took care of the immediate problem, after
which I had time to dig through the logs looking for information about
exactly what had failed.
What I found in addition to the above, was that instead of a single
HAProxy process, there were two listed in ps:
92539 www 1 103 0 187M 133M CPU6 6 109:28 100.00%
haproxy
12993 www 1 29 0 147M 96636K kqread 12 5:57 16.06%
haproxy
truss showed no activity on the first PID, and killing that process had no
negative effect upon the second, running process.
So, to sum up, what I observed was
1) changed a file referenced by an ACL (actual change was moving the
location of a line within the file, not adding or deleting the total
number of lines)
2) HAProxy ceased operating about 23 minutes later, but was still listed
in 'ps'
3) after issuing a reload, there remained an HAProxy process, presumably
the one that had stopped accepting requests.
Is it possible that a modification to a file referenced by an ACL (without
an associated reload) could cause this sort of issue?
Best,
-=Mark