On Jul 4, 2014 8:00 AM, "Willy Tarreau" <[email protected]> wrote:
>
> On Fri, Jul 04, 2014 at 01:44:54AM +0200, Maxime Brugidou wrote:

> > * got back to the standard igb shipped in the kernel (it gets the 8
> > virtual channels by default for the NIC)
>
> OK, but you need to ensure they never deliver IRQs to the second socket,
> and if possible not to any hyperthread. So in practice a good approach
> would be to spread them across cpus 0-3 for example and have haproxy on
> cpu 4 or 5.

I did exactly that in a second test later before going to sleep and went up
to 50k session/sec.

I actually added a second haproxy process on cpu5 and it easily doubles the
traffic which is nice.

> > * set ethtool -C eth0 rx-usecs 500 on both nginx and haproxy hosts
> > (it's 3 by default)
>
> 500 is huge and will result in delays which can fill the queues. It's
> particularly important for incoming ACKs because a Tx packet will
> occupy the TCP buffers for at least 500 us because of this. 3 is an
> intel-specific value meaning "auto-adaptative". It tends to work
> well enough in most cases. Otherwise I use 30-100 depending on the
> NICs and the workload. Larger values reduce CPU usage for large
> objects but prevent you from reaching high connection rates. Lower
> ones do the opposite.

I realized that too in my next tests and went back to 3. I forgot about the
RX buffer which is at 256 by default. I tried raising it to 4096 (the
maximum) with some little improvements.

>
> > So now the blocking part is not IRQs anymore (they run at worst at 60%
> > of cpu0, now i get at 100% on cpu1 with haproxy and latency spikes to
> > 100ms (instead of 3ms without load).
>
> Note that you'll always reach 100% unless you have a huge issue. Haproxy
> aggregates processing. For example, at a low load, you'll get one full
> round of I/O and events processing for a single connection, but at higher
> load if you can accept multiple connections at once, you'll process two
> of them in a single round. So you can remain at 100% and continue to scale
> for a while. To give you some metrics, using small objects, we're used to
> see about 120k conn/s through haproxy on a single core of a xeon 1620.

You are right and also I didn't realize that what I should also monitor is
latency since when I run the benchmark it goes through the roof (and the
nginx backend is not blocking since requesting it directly is super fast).

I will try to do further testing now that I have a better understanding of
this. Not sure if there any http tool like siege that are easier to monitor
object size and latency?

>
> > I am still testing with small
> > packets. The goal is to max out the session per sec not the bandwidth.
>
> OK but you need first to ensure that you *can* max out the bandwidth,
> otherwise it definitely indicates a setup problem.

OK I'll try to do that with large objects first. I can also increase MTU
with the backend, use splicing and maybe LRO it should max out the
bandwidth.

> > - have you disabled irq-balance ?
> > no
>
> You should definitely, otherwise you bind your IRQs by hand, and when
> you don't look, it changes them in your back.

Sorry this was meant to be a yes. IRQ balance is not active and has never
been.

> Note: you're not the first one to encounter performance issues on a low
> frequency CPU, if we could pin point the limiting factor and document it
> for future reference, it would be useful. It would also be useful if we
> can indicate that some hardware combinations must be avoided (eg: NICs,
> CPUs, RAM speed, etc).
>

Yes. Thanks for the help. So I get it that 2GHz is slow especially because
we use a low power chassis since most of our hardware does not need this
kind of performance in general.

I am still a bit disappointed by the connections speed I reach.

I'll update you later today trying out all the solutions and getting better
data.

Reply via email to