I've attached an haproxy.cfg that is as minimal as I felt comfortable. We are using admin sockets for dynamic configuration of backends so left the server-templating in, but no other application was configured to orchestrate haproxy at the time of testing.
I've also attached output from
$ sudo timeout 60s strace -e setsockopt,close -p $(ps -lf -C haproxy | tail
-n 1 | awk -e '{print $4}') 2>&1 | tee 1.8.17
Which shows the significant decrease in setting of SO_LINGER. I guess I
lied earlier when I said there were none, but over 60s it looks like I
1.9.8 had 1/17 the number of SO_LINGER setsockopt calls vs 1.8.17.
Unfortunately the number of sockets sitting in TIME_WAIT fluctuates to the
point where there's not a great metric to use. Looking at SO_LINGER
settings does appear to be consistent though. I bet if I spawned 700
backend instances instead of 7 it would be more pronounced.
I got perf stack traces for setsockopt from both versions on our production
servers, but inlining made those traces mostly useless.
Let me know if there's anything else i can grab.
Dave.
On Thu, Jun 13, 2019 at 1:30 AM Willy Tarreau <[email protected]> wrote:
> On Wed, Jun 12, 2019 at 12:08:03PM -0500, Dave Chiluk wrote:
> > I did a bit more introspection on our TIME_WAIT connections. The
> increase
> > in sockets in TIME_WAIT is definitely from old connections to our backend
> > server instances. Considering the fact that this server is doesn't
> > actually serve real traffic we can make a reasonable assumptions that
> this
> > is almost entirely due to increases in healthchecks.
>
> Great!
>
> > Doing an strace on haproxy 1.8.17 we see
> > vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
> > sudo strace -e setsockopt,close -p 15743
> > strace: Process 15743 attached
> > setsockopt(17, SOL_TCP, TCP_NODELAY, [1], 4) = 0
> > setsockopt(17, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8) = 0
> > close(17) = 0
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >
> > Doing the same strace on 1.9.8 we see
> > vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
> > sudo strace -e setsockopt,close -p 6670
> > strace: Process 6670 attached
> > setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
> > close(4) = 0
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >
> > The calls to setsockopt(17, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0},
> 8)
> > = 0
> > appear to be missing.
>
> Awesome, that's exactly the info I was missing. I suspected that for
> whatever reason the lingering was not disabled, at least now we have
> a proof of this! Now the trick is to figure why :-/
>
> > We are running centos 7 with kernel 3.10.0-957.1.3.el7.x86_64.
>
> OK, and with the setsockopt it should behave properly.
>
> > I'll keep digging into this, and see if I can get stack traces that
> result
> > in teh setsockopt calls on 1.8.17 so the stack can be more closely
> > inspected.
>
> Don't worry for this now, this is something we at least need to resolve
> before issuing 2.0 or it will cause some trouble. Then we'll backport the
> fix once the cause is figured out.
>
> However when I try here I don't have the problem, either in 1.9.8 or
> 2.0-dev7 :
>
> 08:27:30.212570 connect(14, {sa_family=AF_INET, sin_port=htons(9003),
> sin_addr=inet_addr("127.0.0.1")}, 16) = 0
> 08:27:30.212590 recvfrom(14, NULL, 2147483647,
> MSG_TRUNC|MSG_DONTWAIT|MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource
> temporarily unavailable)
> 08:27:30.212610 setsockopt(14, SOL_SOCKET, SO_LINGER, {l_onoff=1,
> l_linger=0}, 8) = 0
> 08:27:30.212630 close(14) = 0
> 08:27:30.212659 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0,
> tv_nsec=6993282}) = 0
>
> So it must depend on the type of check. Could you please share a
> minimalistic
> config that reproduces this ?
>
> Thanks,
> Willy
>
1.8.17
Description: Binary data
domain_map
Description: Binary data
1.9.8
Description: Binary data
haproxy.cfg
Description: Binary data

