Hi John,
On Thu, May 22, 2014 at 12:03:44PM +0000, JDzialo John wrote:
> Hi Willy,
>
> Our content-length is correct on the header side. This intermittent issue
> happens once in a while. In Fiddler it shows that the client is waiting for
> the right content-length but only gets a fraction of it. I am trying to
> determine if haproxy is dropping the server connection once in a while
> mid-stream or if the backend server is dropping packets.
At first glance I couldn't spot a blocked response involving a content-length,
but I'm a bit disturbed by seeing some responses advertising chunked mode and
not complete (there isn't the trailing 0 CRLF CRLF). These are the ones you
noticed with a RST. For example, port 20799. The RST always happens 15-20s
after the last packet so it seems that it's a timeout experienced by the
server when pushing data. It could be that haproxy freezes on this transfer
or that the chunk stream is invalid and advertises an end earlier, but I
didn't find that pattern either.
Unfortunately we don't have the traffic from haproxy to the server so it's
hard to say if the data were ACKed. But basing my analysis on the echoed
TCP timestamps, it seems that 22 TCP segments were sent at once, totalizing
31kB with no ACK in between. And seeing that the RST was sent with the same
sequence number as the first segment of this sequence tends to confirm that
there was no ACK during this period.
> I also upgraded to 1.4.25, yet we are still seeing a problem.
>
> Our backend servers have a file server under heavy load and all the requests
> that fail usually have two things in common, haproxy and this file server.
That's interesting, do you think the network link between the two could
be congested from haproxy to the server ? It could be that some ACKs are
lost and never reach the server. Losing packets in that direction will
rarely be noticeable for ACKs, but losting a few in a row will cause the
session to timeout on the server, exactly what we observe. In this case,
you should also observe occasional retransmits during the connection setup,
represented as long connect times in the logs (eg: 3000 ms instead of 0
or 1 ms).
> I ran a capture on the server side. See attached. I'm new to using tcpdump
> but used wireshark to read the results and I am getting some random [RST,
> ACK] in red in wireshark.
>
> Is this showing that we are randomly losing packets from our backend server
> (source) to haproxy (destination)?
It could be. Depending on where packets are lost (if at all), it could
even be a NIC losing packets due to a high load. If it mostly happens
between haproxy and a specific server, it could be realted to that
server. For example I experienced a lot of problems with Broadcom
NetXtreme 2 NICs in various firmware flavors (versions 1.9 to 3),
always causing random drops at high loads. But I know that it's not
always easy to switch to another NIC to test.
Before trying with another NIC, it would be safer to take a new
capture with traffic in both directions :
tcpdump -s0 -npi eth0 -w trace-server.cap host <server-ip>
That way we'll see if the TCP stack sends ACKs, indicating that
they're lost in between, or if it stops sending them indicating
that haproxy is being stuck.
Also, note that if one of your servers is more loaded than the
other ones, you can reduce its load by assigning a lower weight
than the other ones (using the "weight" attribute on the server
lines). It could also be an easy way to see if this issue is
related to the load or not.
Regards,
Willy