On Thu, May 17, 2012 at 3:52 PM, Willy Tarreau <[email protected]> wrote:
>
> Hi,
>
> first, let me congratulate you for such a well documented analysis.
>
> On Thu, May 17, 2012 at 03:28:40PM -0700, Yehuda Sadeh wrote:
> > Hi,
> >
> >    We have a setup where there's haproxy (1.4.15) in front of a few apache
> > servers. We have a test suite that occasionally fails with 502 when going
> > through haproxy (this doesn't happen when going directly to one of the
> > apache servers). I was able to have a tcp capture while that was happening,
> > and I do see some strange behavior. From what I can see haproxy sends a
> > request to the apache server, then hangs up the tcp connection (while
> > response is arriving), and reopens a second tcp connection on which it
> > sends nothing. After 20 seconds apache closes down the second connection
> > and haproxy sends 502 to the client.
> >
> > The haproxy log shows the following:
> >
> > May 17 14:03:20 localhost haproxy[3695]:
> > 10.214.131.22:38909[17/May/2012:14:02:59.347] api-http-ipv4
> > radosgw-http/plana93-1
> > 0/0/2/-1/21020 502 204 - - SH-- 0/0/0/0/1 0/0 "GET
> > /jenkins-d15dxs9og2dxhfhlywf8i-182/testobj HTTP/1.1"
> >
> > Which means, if I decipher it correctly, that the server hung the
> > connection after more than 20 seconds.
>
> ... and a retry! I think this is the key of the mystery.
>
> > However, looking at the server logs
> > I couldn't find evidence for that happening. I do see the server responding
> > immediately.
> > I was taking a tcpdump while that was happening, and I do see some strange
> > behavior. The haproxy log shows the following (captured on the haproxy
> > machine):
> >
> > 10.214.131.22 client
> > 10.214.131.25 haproxy
> > 10.214.133.23 apache
> >
> >
> > 1. client <-> haproxy
> >
> >  Client sends request to haproxy
> >
> > 18073   2.787839 10.214.131.22 -> 10.214.131.25 TCP 38909 > http [SYN]
> > Seq=0 Win=14600 Len=0 MSS=1460 TSV=25089868 TSER=0 WS=7
> > 18074   2.787867 10.214.131.25 -> 10.214.131.22 TCP http > 38909 [SYN, ACK]
> > Seq=0 Ack=1 Win=14480 Len=0 MSS=1460 TSV=25177841 TSER=25089868 WS=7
> > 18075   2.787948 10.214.131.22 -> 10.214.131.25 TCP 38909 > http [ACK]
> > Seq=1 Ack=1 Win=14720 Len=0 TSV=25089868 TSER=25177841
> > 18076   2.788028 10.214.131.22 -> 10.214.131.25 HTTP GET
> > /jenkins-d15dxs9og2dxhfhlywf8i-182/testobj HTTP/1.1
> >
> >
> > 2. haproxy <-> apache (connection 1)
> >
> > haproxy sends request to apache
> >
> > 18077   2.788185 10.214.131.25 -> 10.214.133.23 TCP 39579 > http [SYN]
> > Seq=0 Win=14600 Len=0 MSS=1460 TSV=25177841 TSER=0 WS=7
> > 18078   2.789498 10.214.133.23 -> 10.214.131.25 TCP http > 39579 [SYN, ACK]
> > Seq=0 Ack=1 Win=14480 Len=0 MSS=1460 TSV=248773204 TSER=25177841 WS=7
> > 18079   2.789530 10.214.131.25 -> 10.214.133.23 TCP 39579 > http [ACK]
> > Seq=1 Ack=1 Win=14720 Len=0 TSV=25177841 TSER=248773204
> > 18080   2.789576 10.214.131.25 -> 10.214.133.23 HTTP GET
> > /jenkins-d15dxs9og2dxhfhlywf8i-182/testobj HTTP/1.1
> > 18081   2.789605 10.214.131.25 -> 10.214.133.23 TCP 39579 > http [FIN, ACK]
> > Seq=333 Ack=1 Win=14720 Len=0 TSV=25177841 TSER=248773204
>
> This one is really strange. The only reason why I could imagine the FIN here
> would be if the client had used one *and* there was "option abortonclose",
> but none of these hypothesis were true.
>
> Ah I just saw Cyril's discovery about your erroneous timeouts, he's plain
> right !
>
> > 3. haproxy <-> apache (connection 2)
> >
> > haproxy creates a second tcp connection to the same apache server
> >
> > 18082   2.789650 10.214.131.25 -> 10.214.133.23 TCP 39580 > http [SYN]
> > Seq=0 Win=14600 Len=0 MSS=1460 TSV=25177841 TSER=0 WS=7
> > 18083   2.790896 10.214.133.23 -> 10.214.131.25 TCP http > 39580 [SYN, ACK]
> > Seq=0 Ack=1 Win=14480 Len=0 MSS=1460 TSV=248773204 TSER=25177841 WS=7
> > 18084   2.790926 10.214.131.25 -> 10.214.133.23 TCP 39580 > http [ACK]
> > Seq=1 Ack=1 Win=14720 Len=0 TSV=25177841 TSER=248773204
>
> Here we have a bug in my opinion, despite the timeout issue. The new
> connection should not have been attempted since the request had already
> been sent. And if the connection remains open, it's precisely because
> there is no more request to send !
>
>
> I'll have to dig through the code to find how it is possible to get
> into this situation.
>
> Please in parallel recheck with more correct timeouts (at least 4s
> each).
>

Sure, checking now. It used to take a while to reproduce, so it might
take a bit.

Thanks,
Yehuda

Reply via email to