Re: Very low session rate with simple benchmark setup

Willy Tarreau Fri, 04 Jul 2014 00:46:29 -0700

On Fri, Jul 04, 2014 at 08:28:50AM +0200, Maxime Brugidou wrote:
> On Jul 4, 2014 8:00 AM, "Willy Tarreau" <[email protected]> wrote:
> >
> > On Fri, Jul 04, 2014 at 01:44:54AM +0200, Maxime Brugidou wrote:
> 
> > > * got back to the standard igb shipped in the kernel (it gets the 8
> > > virtual channels by default for the NIC)
> >
> > OK, but you need to ensure they never deliver IRQs to the second socket,
> > and if possible not to any hyperthread. So in practice a good approach
> > would be to spread them across cpus 0-3 for example and have haproxy on
> > cpu 4 or 5.
> 
> I did exactly that in a second test later before going to sleep and went up
> to 50k session/sec.


Ah great!

> I actually added a second haproxy process on cpu5 and it easily doubles the
> traffic which is nice.

For a 2GHz CPU, I'm not surprized a lot I must confess.

> > > * set ethtool -C eth0 rx-usecs 500 on both nginx and haproxy hosts
> > > (it's 3 by default)
> >
> > 500 is huge and will result in delays which can fill the queues. It's
> > particularly important for incoming ACKs because a Tx packet will
> > occupy the TCP buffers for at least 500 us because of this. 3 is an
> > intel-specific value meaning "auto-adaptative". It tends to work
> > well enough in most cases. Otherwise I use 30-100 depending on the
> > NICs and the workload. Larger values reduce CPU usage for large
> > objects but prevent you from reaching high connection rates. Lower
> > ones do the opposite.
> 
> I realized that too in my next tests and went back to 3. I forgot about the
> RX buffer which is at 256 by default. I tried raising it to 4096 (the
> maximum) with some little improvements.

It generally does not help since too large buffers means larger rings
that are less efficient to process (ie take more cache space). 128-512
are generally the best options for TCP, with 256 often being optimal.

> > > So now the blocking part is not IRQs anymore (they run at worst at 60%
> > > of cpu0, now i get at 100% on cpu1 with haproxy and latency spikes to
> > > 100ms (instead of 3ms without load).
> >
> > Note that you'll always reach 100% unless you have a huge issue. Haproxy
> > aggregates processing. For example, at a low load, you'll get one full
> > round of I/O and events processing for a single connection, but at higher
> > load if you can accept multiple connections at once, you'll process two
> > of them in a single round. So you can remain at 100% and continue to scale
> > for a while. To give you some metrics, using small objects, we're used to
> > see about 120k conn/s through haproxy on a single core of a xeon 1620.
> 
> You are right and also I didn't realize that what I should also monitor is
> latency since when I run the benchmark it goes through the roof (and the
> nginx backend is not blocking since requesting it directly is super fast).
> 
> I will try to do further testing now that I have a better understanding of
> this. Not sure if there any http tool like siege that are easier to monitor
> object size and latency?

I'm used to use "inject" which I wrote many years ago. It provides one line
per second (like vmstat) with some metrics of req/s, data/s, avg resp time
and standard deviation. It doesn't support SSL nor keep-alive but I find it
useful enough not to switch to other tools. Legends are still in french but
it should not be a problem for you :-)

  http://git.formilux.org/?p=people/willy/inject.git
  http://1wt.eu/tools/inject/  (for the doc)

> > > I am still testing with small
> > > packets. The goal is to max out the session per sec not the bandwidth.
> >
> > OK but you need first to ensure that you *can* max out the bandwidth,
> > otherwise it definitely indicates a setup problem.
> 
> OK I'll try to do that with large objects first. I can also increase MTU
> with the backend, use splicing and maybe LRO it should max out the
> bandwidth.

You should never need to increase MTU at such rates. Even at 40 Gbps I'm
working with 1500. Splicing tends to be slower than copying with many gig
NICs, so reserve it for 10G+ NICs unless your tests show that it's better.
LRO is useless at such low speeds.

> > > - have you disabled irq-balance ?
> > > no
> >
> > You should definitely, otherwise you bind your IRQs by hand, and when
> > you don't look, it changes them in your back.
> 
> Sorry this was meant to be a yes. IRQ balance is not active and has never
> been.

OK.

> > Note: you're not the first one to encounter performance issues on a low
> > frequency CPU, if we could pin point the limiting factor and document it
> > for future reference, it would be useful. It would also be useful if we
> > can indicate that some hardware combinations must be avoided (eg: NICs,
> > CPUs, RAM speed, etc).
> >
> 
> Yes. Thanks for the help. So I get it that 2GHz is slow especially because
> we use a low power chassis since most of our hardware does not need this
> kind of performance in general.

Whenever you want low latency or high bandwidth, you need to test hardware
before selecting the one you'll need. It took me 6 months to find hardware
capable of 10 Gbps in 2009. 10G NICs will provide you much better performance
even at rates below 1G. I know a few web sites very sensitive to response
time which have switched to 10G just for this reason. Myricom NICs will
provide you with a very low latency, but will hardly scale to 10G unless
you're mostly dealing with huge objects. Intel NICs will reach higher bit
rates, but come with a higher CPU usage so are not necessarily relevant
for rates of 1G or less.

> I am still a bit disappointed by the connections speed I reach.

You're too impatient :-)
After 1 day and 3 mails you doubled your performance !

> I'll update you later today trying out all the solutions and getting better
> data.

OK.

Willy

Re: Very low session rate with simple benchmark setup

Reply via email to