Hi Richard,
On Tue, Jan 17, 2017 at 05:27:22PM +1300, Richard Gray wrote:
> tcp_fin_timeout is set to 60 seconds on my system, but as I understand it,
> this parameter only applies to orphaned connections. In this case, HAProxy
> appears to be holding the socket open, so still has responsibility for
> cleaning up the connection. Am I correct in thinking you use shutdown to
> close the write side of the socket, leaving the read side open in case the
> client still has data to send?
Yes that's it.
> I've written a simple Python client that doesn't close a connection in
> response to the FIN, and with this I can occasionally reproduce the
> behaviour. The sequence of events is as follows, with the state of the
> client to HAProxy connection in parentheses:
>
> 1. Client opens connection to server via HAProxy. (ESTABLISHED)
> 2. Server closes connection, causing HAProxy to send a FIN to the client.
> (FIN_WAIT1)
> 3. Client ACKs the FIN, but does not send a FIN of its own. (FIN_WAIT2)
> 4. HAProxy timeout-tunnel period elapses (FIN_WAIT2 (orphaned))
> 5. tcp_fin_timeout period elapses. (Socket state is removed by the kernel)
>
> I do wonder if there's a race somewhere though, as sometimes at step 3 the
> client-fin timeout (30s in my case) seems to kick in, and the connection
> state is cleaned up quickly.
I'm surprized that it only kicks in "sometimes". That makes me wonder
whether the problem is not here.
I'm thinking about two things :
- it would be useful to check (using netstat -atnp) if these eternal
FIN_WAIT2 sockets still belong to haproxy or are real orphans. I
suspect that most of them are still attached to haproxy given your
timeouts ; if you manage to find some very old ones still attached
to haproxy then it would mean the issue is in haproxy.
- your idea of a race is interesting. I'm wondering what happens if
the connection spends more than 60s still attached to haproxy and
only then becomes orphaned. It could be possible that the kernel
timeout only strikes when the socket age reaches the timeout while
already being orphaned.
> > In your case I'd have a look at tcp_fin_timeout to possibly lower it, but
> > that's all. I wouldn't be worried by this number of FIN_WAIT2 connections
> > though I understand that at least the cause needs to be figured out and
> > possibly addressed.
> It's not a huge problem as the FIN_WAIT2 connections aren't using much
> system resource. It's more of an annoyance really, as they're throwing out
> my HAProxy stats. I.e. nearly 50% of the current sessions do not correspond
> to an active connection through the proxy.
I see. But sometimes such small annoyances can hide a real problem that
we'd rather figure and fix while it's not dramatic!
> I'm going to do some more testing to see if I can figure out why it's not
> reliably reproducible, and perhaps try a 1.7 build to see if I get different
> results there.
OK fine!
Regards,
Willy