On Thu, May 17, 2012 at 4:29 PM, Yehuda Sadeh <[email protected]> wrote: > On Thu, May 17, 2012 at 3:52 PM, Willy Tarreau <[email protected]> wrote: >> >> Hi, >> >> first, let me congratulate you for such a well documented analysis. >> >> On Thu, May 17, 2012 at 03:28:40PM -0700, Yehuda Sadeh wrote: >> > Hi, >> > >> > We have a setup where there's haproxy (1.4.15) in front of a few apache >> > servers. We have a test suite that occasionally fails with 502 when going >> > through haproxy (this doesn't happen when going directly to one of the >> > apache servers). I was able to have a tcp capture while that was happening, >> > and I do see some strange behavior. From what I can see haproxy sends a >> > request to the apache server, then hangs up the tcp connection (while >> > response is arriving), and reopens a second tcp connection on which it >> > sends nothing. After 20 seconds apache closes down the second connection >> > and haproxy sends 502 to the client. >> > >> > The haproxy log shows the following: >> > >> > May 17 14:03:20 localhost haproxy[3695]: >> > 10.214.131.22:38909[17/May/2012:14:02:59.347] api-http-ipv4 >> > radosgw-http/plana93-1 >> > 0/0/2/-1/21020 502 204 - - SH-- 0/0/0/0/1 0/0 "GET >> > /jenkins-d15dxs9og2dxhfhlywf8i-182/testobj HTTP/1.1" >> > >> > Which means, if I decipher it correctly, that the server hung the >> > connection after more than 20 seconds. >> >> ... and a retry! I think this is the key of the mystery. >> >> > However, looking at the server logs >> > I couldn't find evidence for that happening. I do see the server responding >> > immediately. >> > I was taking a tcpdump while that was happening, and I do see some strange >> > behavior. The haproxy log shows the following (captured on the haproxy >> > machine): >> > >> > 10.214.131.22 client >> > 10.214.131.25 haproxy >> > 10.214.133.23 apache >> > >> > >> > 1. client <-> haproxy >> > >> > Client sends request to haproxy >> > >> > 18073 2.787839 10.214.131.22 -> 10.214.131.25 TCP 38909 > http [SYN] >> > Seq=0 Win=14600 Len=0 MSS=1460 TSV=25089868 TSER=0 WS=7 >> > 18074 2.787867 10.214.131.25 -> 10.214.131.22 TCP http > 38909 [SYN, ACK] >> > Seq=0 Ack=1 Win=14480 Len=0 MSS=1460 TSV=25177841 TSER=25089868 WS=7 >> > 18075 2.787948 10.214.131.22 -> 10.214.131.25 TCP 38909 > http [ACK] >> > Seq=1 Ack=1 Win=14720 Len=0 TSV=25089868 TSER=25177841 >> > 18076 2.788028 10.214.131.22 -> 10.214.131.25 HTTP GET >> > /jenkins-d15dxs9og2dxhfhlywf8i-182/testobj HTTP/1.1 >> > >> > >> > 2. haproxy <-> apache (connection 1) >> > >> > haproxy sends request to apache >> > >> > 18077 2.788185 10.214.131.25 -> 10.214.133.23 TCP 39579 > http [SYN] >> > Seq=0 Win=14600 Len=0 MSS=1460 TSV=25177841 TSER=0 WS=7 >> > 18078 2.789498 10.214.133.23 -> 10.214.131.25 TCP http > 39579 [SYN, ACK] >> > Seq=0 Ack=1 Win=14480 Len=0 MSS=1460 TSV=248773204 TSER=25177841 WS=7 >> > 18079 2.789530 10.214.131.25 -> 10.214.133.23 TCP 39579 > http [ACK] >> > Seq=1 Ack=1 Win=14720 Len=0 TSV=25177841 TSER=248773204 >> > 18080 2.789576 10.214.131.25 -> 10.214.133.23 HTTP GET >> > /jenkins-d15dxs9og2dxhfhlywf8i-182/testobj HTTP/1.1 >> > 18081 2.789605 10.214.131.25 -> 10.214.133.23 TCP 39579 > http [FIN, ACK] >> > Seq=333 Ack=1 Win=14720 Len=0 TSV=25177841 TSER=248773204 >> >> This one is really strange. The only reason why I could imagine the FIN here >> would be if the client had used one *and* there was "option abortonclose", >> but none of these hypothesis were true. >> >> Ah I just saw Cyril's discovery about your erroneous timeouts, he's plain >> right ! >> >> > 3. haproxy <-> apache (connection 2) >> > >> > haproxy creates a second tcp connection to the same apache server >> > >> > 18082 2.789650 10.214.131.25 -> 10.214.133.23 TCP 39580 > http [SYN] >> > Seq=0 Win=14600 Len=0 MSS=1460 TSV=25177841 TSER=0 WS=7 >> > 18083 2.790896 10.214.133.23 -> 10.214.131.25 TCP http > 39580 [SYN, ACK] >> > Seq=0 Ack=1 Win=14480 Len=0 MSS=1460 TSV=248773204 TSER=25177841 WS=7 >> > 18084 2.790926 10.214.131.25 -> 10.214.133.23 TCP 39580 > http [ACK] >> > Seq=1 Ack=1 Win=14720 Len=0 TSV=25177841 TSER=248773204 >> >> Here we have a bug in my opinion, despite the timeout issue. The new >> connection should not have been attempted since the request had already >> been sent. And if the connection remains open, it's precisely because >> there is no more request to send ! >> >> >> I'll have to dig through the code to find how it is possible to get >> into this situation. >> >> Please in parallel recheck with more correct timeouts (at least 4s >> each). >> > > Sure, checking now. It used to take a while to reproduce, so it might > take a bit. > Yeah, it definitely solved the issue.
Thanks! Yehuda

