Hi,

first, let me congratulate you for such a well documented analysis.

On Thu, May 17, 2012 at 03:28:40PM -0700, Yehuda Sadeh wrote:
> Hi,
> 
>    We have a setup where there's haproxy (1.4.15) in front of a few apache
> servers. We have a test suite that occasionally fails with 502 when going
> through haproxy (this doesn't happen when going directly to one of the
> apache servers). I was able to have a tcp capture while that was happening,
> and I do see some strange behavior. From what I can see haproxy sends a
> request to the apache server, then hangs up the tcp connection (while
> response is arriving), and reopens a second tcp connection on which it
> sends nothing. After 20 seconds apache closes down the second connection
> and haproxy sends 502 to the client.
> 
> The haproxy log shows the following:
> 
> May 17 14:03:20 localhost haproxy[3695]:
> 10.214.131.22:38909[17/May/2012:14:02:59.347] api-http-ipv4
> radosgw-http/plana93-1
> 0/0/2/-1/21020 502 204 - - SH-- 0/0/0/0/1 0/0 "GET
> /jenkins-d15dxs9og2dxhfhlywf8i-182/testobj HTTP/1.1"
> 
> Which means, if I decipher it correctly, that the server hung the
> connection after more than 20 seconds.

... and a retry! I think this is the key of the mystery.

> However, looking at the server logs
> I couldn't find evidence for that happening. I do see the server responding
> immediately.
> I was taking a tcpdump while that was happening, and I do see some strange
> behavior. The haproxy log shows the following (captured on the haproxy
> machine):
> 
> 10.214.131.22 client
> 10.214.131.25 haproxy
> 10.214.133.23 apache
> 
> 
> 1. client <-> haproxy
> 
>  Client sends request to haproxy
> 
> 18073   2.787839 10.214.131.22 -> 10.214.131.25 TCP 38909 > http [SYN]
> Seq=0 Win=14600 Len=0 MSS=1460 TSV=25089868 TSER=0 WS=7
> 18074   2.787867 10.214.131.25 -> 10.214.131.22 TCP http > 38909 [SYN, ACK]
> Seq=0 Ack=1 Win=14480 Len=0 MSS=1460 TSV=25177841 TSER=25089868 WS=7
> 18075   2.787948 10.214.131.22 -> 10.214.131.25 TCP 38909 > http [ACK]
> Seq=1 Ack=1 Win=14720 Len=0 TSV=25089868 TSER=25177841
> 18076   2.788028 10.214.131.22 -> 10.214.131.25 HTTP GET
> /jenkins-d15dxs9og2dxhfhlywf8i-182/testobj HTTP/1.1
> 
> 
> 2. haproxy <-> apache (connection 1)
> 
> haproxy sends request to apache
> 
> 18077   2.788185 10.214.131.25 -> 10.214.133.23 TCP 39579 > http [SYN]
> Seq=0 Win=14600 Len=0 MSS=1460 TSV=25177841 TSER=0 WS=7
> 18078   2.789498 10.214.133.23 -> 10.214.131.25 TCP http > 39579 [SYN, ACK]
> Seq=0 Ack=1 Win=14480 Len=0 MSS=1460 TSV=248773204 TSER=25177841 WS=7
> 18079   2.789530 10.214.131.25 -> 10.214.133.23 TCP 39579 > http [ACK]
> Seq=1 Ack=1 Win=14720 Len=0 TSV=25177841 TSER=248773204
> 18080   2.789576 10.214.131.25 -> 10.214.133.23 HTTP GET
> /jenkins-d15dxs9og2dxhfhlywf8i-182/testobj HTTP/1.1
> 18081   2.789605 10.214.131.25 -> 10.214.133.23 TCP 39579 > http [FIN, ACK]
> Seq=333 Ack=1 Win=14720 Len=0 TSV=25177841 TSER=248773204

This one is really strange. The only reason why I could imagine the FIN here
would be if the client had used one *and* there was "option abortonclose",
but none of these hypothesis were true.

Ah I just saw Cyril's discovery about your erroneous timeouts, he's plain
right !

> 3. haproxy <-> apache (connection 2)
> 
> haproxy creates a second tcp connection to the same apache server
> 
> 18082   2.789650 10.214.131.25 -> 10.214.133.23 TCP 39580 > http [SYN]
> Seq=0 Win=14600 Len=0 MSS=1460 TSV=25177841 TSER=0 WS=7
> 18083   2.790896 10.214.133.23 -> 10.214.131.25 TCP http > 39580 [SYN, ACK]
> Seq=0 Ack=1 Win=14480 Len=0 MSS=1460 TSV=248773204 TSER=25177841 WS=7
> 18084   2.790926 10.214.131.25 -> 10.214.133.23 TCP 39580 > http [ACK]
> Seq=1 Ack=1 Win=14720 Len=0 TSV=25177841 TSER=248773204

Here we have a bug in my opinion, despite the timeout issue. The new
connection should not have been attempted since the request had already
been sent. And if the connection remains open, it's precisely because
there is no more request to send !

I'll have to dig through the code to find how it is possible to get
into this situation.

Please in parallel recheck with more correct timeouts (at least 4s
each).

Thanks,
Willy


Reply via email to