Hi,
Thanks Willy for looking into this !
On Tue, Oct 08, Willy Tarreau wrote:
> On Fri, Oct 04, 2019 at 07:28:15PM +0300, Jarno Huuskonen wrote:
> > I sent pcap/strace offlist.
>
> Thanks, that was very useful.
>
> > (strace -f -o -ttt, tcpdump -n -p -s 16384 -w ... host 127.0.0.1 and
> > port 8080).
> >
> > I think in packet capture the second health checks causes
> > "client_addr: 127.0.0.1 client_port: 2779] hr_read(): Connection reset by
> > peer [plugins/http/http.c line 917]"
> > (I think uswgi logs client_port incorrectly, ntohs(2779) gives 56074
> > (and port 56074 is in packet capture)).
>
> It's as usual, when you start to troubleshoot a bug you find a myriad of
> other ones around :-)
>
> > (haproxy version: HA-Proxy version 2.1-dev2 2019/10/01).
> >
> > I tried to reproduce with very minimal flask/uwsgi hello world app
> > and there hr_read happens very rarely.
> > With alerta(.io) app this happens more regularly (AFAIK not with every
> > check).
> > So maybe this is weird timing issue or bug in uwsgi.
>
> So it is indeed a timing issue. The check running on port 56062 shows the
> following sequence:
>
> 17:10:57.877876 IP localhost.56062 > localhost.8080: Flags [S]
> 17:10:57.877898 IP localhost.8080 > localhost.56062: Flags [S.]
> 17:10:57.877909 IP localhost.56062 > localhost.8080: Flags [.]
> 17:10:57.878065 IP localhost.56062 > localhost.8080: Flags [P.]
> 17:10:57.878078 IP localhost.8080 > localhost.56062: Flags [.]
> 17:10:57.879933 IP localhost.8080 > localhost.56062: Flags [P.]
> 17:10:57.879939 IP localhost.56062 > localhost.8080: Flags [.]
> 17:10:57.880008 IP localhost.8080 > localhost.56062: Flags [F.]
> 17:10:57.880333 IP localhost.56062 > localhost.8080: Flags [F.]
> 17:10:57.880341 IP localhost.8080 > localhost.56062: Flags [.]
>
> Note the FIN sent 75 microseconds after the response. This resulted in
> recvfrom() returning zero and the connection to be cleanly closed. Now
> regarding port 56074 that is causing excess logs :
>
> 17:11:04.132867 IP localhost.56074 > localhost.8080: Flags [S]
> 17:11:04.132890 IP localhost.8080 > localhost.56074: Flags [S.]
> 17:11:04.132904 IP localhost.56074 > localhost.8080: Flags [.]
> 17:11:04.133083 IP localhost.56074 > localhost.8080: Flags [P.]
> 17:11:04.133098 IP localhost.8080 > localhost.56074: Flags [.]
> 17:11:04.135101 IP localhost.8080 > localhost.56074: Flags [P.]
> 17:11:04.135107 IP localhost.56074 > localhost.8080: Flags [.]
> 17:11:04.135316 IP localhost.56074 > localhost.8080: Flags [R.]
>
> As you can see, even 215 microseconds after the response there is still
> no FIN. I've checked and in both cases the headers are the same. There
> is indeed a "connection: close" header emitted by haproxy, none is
> advertised in the response, but clearly it's a matter of delay. And
> this causes recvfrom() to return EAGAIN, so haproxy is forced to hard-
> close the connection (as if it were doing option http-server-close).
> Please also note that returned data are properly ACKed so there is not
> much more we can do there.
>
> Before the patch that you bisected, you didn't face this because of a
> bug that was preventing the hard close from working, and instead you
> were accumulating TIME_WAIT sockets on the client side. So I'm afraid
> to say that for now the best I can say is that you'll still have to
> live with these occasional logs :-/
Thanks. Glad to see this wasn't a bug on haproxy side.
I tested the alerta app with gunicorn and gunicorn doesn't seem to mind
that haproxy uses forced close on the health check, so switching
uwsgi->gunicorn is another option.
> I'd really like to see a massive rework being done on the checks. Your
> report made me realize that there is one feature we're currently missing,
> which is the ability to wait for a close once we've got a response. At
> the moment we cannot add this because checks work in 2 steps :
>
> 1) connect and send the "request"
> 2) receive the "reponse" and close.
And AFAIK this works just fine with most sane backend servers.
> There's no real state machine there, it's roughly "if request was not
> sent, send it, otherwise try to receive a response; if response is
> received, then process it and close otherwise wait for response". So
> we cannot remain in a "waiting for closing" state after we receive a
> response. I'm wondering if that couldn't be achieved using tcp-checks
> however. I quickly tried but couldn't find a reliable way of doing this
> but I'm thing that we could possibly extend the tcp-check rules to have
> "tcp-check expect close" that would wait for a read0.
Should there be a somekind of timeout how long to wait for close ? I'm
thinking about backend servers that have very long/infinite keepalive
(and/or don't respect Connection: close).
-Jarno
--
Jarno Huuskonen