I did a bit more introspection on our TIME_WAIT connections.  The increase
in sockets in TIME_WAIT is definitely from old connections to our backend
 server instances.  Considering the fact that this server is doesn't
actually serve real traffic we can make a reasonable assumptions that this
is almost entirely due to increases in healthchecks.

Doing an strace on haproxy 1.8.17 we see
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
sudo strace -e setsockopt,close -p 15743
strace: Process 15743 attached
setsockopt(17, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(17, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8) = 0
close(17)                               = 0
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Doing the same strace on 1.9.8 we see
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
sudo strace -e setsockopt,close -p 6670
strace: Process 6670 attached
setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
close(4)                                = 0
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The calls to setsockopt(17, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8)
= 0
appear to be missing.

We are running centos 7 with kernel 3.10.0-957.1.3.el7.x86_64.

I'll keep digging into this, and see if I can get stack traces that result
in teh setsockopt calls on 1.8.17 so the stack can be more closely
inspected.

Thanks for any help,
Dave


On Tue, Jun 11, 2019 at 2:29 AM Willy Tarreau <w...@1wt.eu> wrote:

> On Mon, Jun 10, 2019 at 04:01:27PM -0500, Dave Chiluk wrote:
> > We are in the process of evaluating upgrading to 1.9.8 from 1.8.17,
> > and we are seeing a roughly 70% increase in sockets in TIME_WAIT on
> > our haproxy servers with a mostly idle server cluster
> > $ sudo netstat | grep 'TIME_WAIT' | wc -l
>
> Be careful, TIME_WAIT on the frontend is neither important nor
> representative of anything, only the backend counts.
>
> > Looking at the source/destination of this it seems likely that this
> > comes from healthchecks.  We also see a corresponding load increase on
> > the backend applications serving the healthchecks.
>
> It's very possible and problematic at the same time.
>
> > Checking the git logs for healthcheck was unfruitful.  Any clue what
> > might be going on?
>
> Normally we make lots of efforts to close health-check responses with
> a TCP RST (by disabling lingering before closing). I don't see why it
> wouldn't be done here. What OS are you running on and what do your
> health checks look like in the configuration ?
>
> Thanks,
> Willy
>

Reply via email to