Hi Matt,

On Thu, Jun 09, 2011 at 01:50:11PM -0700, Matt Christiansen wrote:
> Hi Willy,
> 
> I agree the haproxy logs show that, but we also monitor the time spent
> processing the request which takes in to account, GC, reading data off
> the FS and a number of things inside the app and I see no 3sec times
> in there or anything near it. Also I have no 3 sec outliers in output
> from my test so that seems a little weird it says 3secs.

What I really hate with 3sec is that it's the common TCP retransmit time,
and normally it indicates packet losses. I have implicitly excluded that
possibility since it runs well with nginx on the same machine, but still
that must not be omitted.

Still, the time measured by application servers generally does not include
the time spent in queues, so you should be very careful with this. For all
components there will always be an un monitored area. For instance, haproxy
cannot know the time spent by the request in the system's backlog, which can
be huge under a syn flood attack or when maxconn is too low.

> Also I have
> the connections set really high to prevent queueing for now, we
> usually only have around 1000-2000 connections open.
> 
> uname -a
> 
> Linux 2.6.18-194.17.1.el5 #1 SMP Wed Sep 29 12:50:31 EDT 2010 x86_64
> x86_64 x86_64 GNU/Linux

OK, RH5 so I agree you won't do TCP splicing on this one.

Could you check if the number of TCP retransmits increases between two
runs (with netstat -s) ? It's worth archiving a full copy before and
after the dump in order to focus on things we could discover there.

Also, would you happen to have nf_conntrack running (check with lsmod) ?
When this is the case, we always have very ugly results, but it mainly
affects connect times and in your case I saw large response times too.

> haproxy -vv
> 
> HA-Proxy version 1.4.15 2011/04/08
> Copyright 2000-2010 Willy Tarreau <[email protected]>
> 
> Build options :
>   TARGET  = linux26
>   CPU     = generic
>   CC      = gcc
>   CFLAGS  = -m64 -march=x86-64 -O2 -g -fno-strict-aliasing
>   OPTIONS = USE_PCRE=1

Everything's fine here.
(...)

> My config

Everything OK here too. You said that numbers slightly improved
with tcp-smart-accept and tcp-smart-connect. Normally it can be
caused by congested network or by losses. What really puzzles me
is that while those issues are very common, I don't see why they
wouldn't show up with nginx too.

Oh one thing I forgot which can make a difference : buffer sizes.
The larger the buffer, the smoother losses will be absorbed because
they'll induce fewer timeouts/RTTs. I don't know what size nginx
uses, but I remember it has dynamic buffer sizes. Haproxy defaults
to 16 kB. You can try to increase to 64 kB and see if it changes
anything :

   global
        tune.bufsize 65536

Maybe you should run a tcpdump between haproxy and the server, or
even better, on the haproxy machine AND on one of the servers (you
can disable a number of servers if it's a test config). That way
we'll know how the response time spreads around.

Regards,
Willy


Reply via email to