On Fri, Jul 04, 2014 at 01:44:54AM +0200, Maxime Brugidou wrote:
> Actually i checked and the first 6 cores are the first threads of the
> first socket, then the 6 next are for the second socket then the next
> 12 cores are the hyperthreads.

OK.

> So all this should not be an issue although you are right i should
> deactivate hyperthreading anyway to simplify tests.

Yes it generally helps.

> >> I think that we have been looking way too deep in the problem and the
> >> solution must be right in front of us.
> >>
> >> Does anyone have ideas?
> >
> > Could you check your network card's traffic (ideally on the switch) in
> > terms of bit rate and packet rate in each direction ? At 15khps it depends
> > a lot on the object size, especially when running on gigabit NICs which
> > are easily overload
> 
> OK so i ran new tests with a separate nginx server (using multipll e
> workers to handle the load).
> The bottleneck seem to be clearly on the network stack, especially the
> number of packets per second.

Not much surprizing with a gig NIC. I don't use them anymore in high
speed tests, whenever you get "close" to the bandwidth limit, you
don't know what you're measuring and they quickly become the bottleneck.

> What i did:
> * got back to the standard igb shipped in the kernel (it gets the 8
> virtual channels by default for the NIC)

OK, but you need to ensure they never deliver IRQs to the second socket,
and if possible not to any hyperthread. So in practice a good approach
would be to spread them across cpus 0-3 for example and have haproxy on
cpu 4 or 5.

> * went back to 4 siege clients trying both with and without
> keep-alive, using 800 concurrency per siege
> * removes all the smp_affinity of any IRQ (actually by default
> everything seem to go to cpu0)

So you're currently at risk of having some packets processed by socket 2
for the network and socket 1 for haproxy.

> * pinned the haproxy process to cpu1 using cpu-map config

just a hint, I find it easier to pin haproxy to the highest core of the
first socket (here it's 5) because that leaves me with contiguous CPU
masks for the IRQs' smp_affinity if I want to experiment with different
values.

> * set ethtool -C eth0 rx-usecs 500 on both nginx and haproxy hosts
> (it's 3 by default)

500 is huge and will result in delays which can fill the queues. It's
particularly important for incoming ACKs because a Tx packet will
occupy the TCP buffers for at least 500 us because of this. 3 is an
intel-specific value meaning "auto-adaptative". It tends to work
well enough in most cases. Otherwise I use 30-100 depending on the
NICs and the workload. Larger values reduce CPU usage for large
objects but prevent you from reaching high connection rates. Lower
ones do the opposite.

> * deactivate haproxy logging
> * deactivated splice
> * activated tcp-smart-connect and http-keep-alive
> 
> All in all it got me to around 25k request/sec without keep alive and
> 39k with keep alive.

OK that's much better now.

> It also solves the soft-interrupt CPU i was seeing on cpu1.
> 
> I couldn't check on the switch but used iptraf to get some stats:
> 
> * with very small packets we get at roughly 200kpacket/sec (100k RX
> 100k TX) at about 50Mbps

OK there's some margin left.

> * with larger response (a default index.html file for nginx on CentOS
> EPEL) we almost max out the 1G NIC at around ~800Mbps

Using large objects you should reach 948 Mbps of TCP payload, or
975 Mbps of IP payload, so your 800 Mbps are still too low, possibly
because of the too large rx_usecs.

Here a quick connection test between my PC (sky2) and my laptop
(e1000e) using inject+httpterm gives me this :
   - small objects (empty) : 55000 connections per second, limited
     by local PC,
     225 kpps in, 170 kpps out, 180 Mbps in, 165 Mbps out

   - medium objects (2kB) : 47000 connections per second,
     233 kpps in, 188 kpps out, 180 Mbps in, 916 Mbps out

   - large objects (20kB) : 5700 connections per second,
     58 kpps in, 91 kpps out, 38 Mbps in, 977 Mbps out.

> So now the blocking part is not IRQs anymore (they run at worst at 60%
> of cpu0, now i get at 100% on cpu1 with haproxy and latency spikes to
> 100ms (instead of 3ms without load).

Note that you'll always reach 100% unless you have a huge issue. Haproxy
aggregates processing. For example, at a low load, you'll get one full
round of I/O and events processing for a single connection, but at higher
load if you can accept multiple connections at once, you'll process two
of them in a single round. So you can remain at 100% and continue to scale
for a while. To give you some metrics, using small objects, we're used to
see about 120k conn/s through haproxy on a single core of a xeon 1620.

> I am still testing with small
> packets. The goal is to max out the session per sec not the bandwidth.

OK but you need first to ensure that you *can* max out the bandwidth,
otherwise it definitely indicates a setup problem.

> I can still improve the IRQ part using the 5 cores i have left on the
> same socket since i have 8 virtual channels i can divide by 4 the
> number of interrupt per core.

Yes that was the idea above.

> However now the bottleneck is the haproxy process at 100% (mostly system).

What you can do is play with maxaccept and maxpollevents, you can reduce
them to 1. It will artificially limit haproxy's performance and make it
reach 100% faster. But it will give a measure that's almost independant
of the workload and will tell you if that's the system which is limiting
or not. Doing so gives me 29400 conn/s on the small objects test above,
with haproxy running at 100% (20% user, 55% system, 25% softirq). Without
these settings, the load climbs to 41000 conns/s, still at 100% (18% us,
56% sys, 26% si).

So I expect that you must reach at least 29k conn/s on your server, given
that I'm getting that on a dual-core dual-thread laptop with irqs and
server polluting each other.

> I still think that getting 25krps without keep-alive a very low.

Yes I agree.

> For Baptiste questions:
> - what is the exact reference of your CPU ?
> - what is the frequency of your CPU ?
> 
> from DMI (The server is a HP Gen8 DL360e):
>         Version:  Intel(R) Xeon(R) CPU E5-2430L 0 @ 2.00GHz

OK, keep in mind that you can't expect to go very far with a 2 GHz CPU,
but still you should get more than what you get.

>         Voltage: 1.4 V
>         External Clock: 100 MHz
>         Max Speed: 4800 MHz
>         Current Speed: 2000 MHz
>         Status: Populated, Enabled
>         Upgrade: Socket LGA1356
> 
> 
> - what is the command line you run on the "client" side (siege)
> siege -b -c 800 -t30S http://my-lb/ (i use .siegerc for keep-alive too)
> 
> - have you disabled irq-balance ?
> no

You should definitely, otherwise you bind your IRQs by hand, and when
you don't look, it changes them in your back.

> - what type of network interface are you using? (and which driver)
> igb kernel driver version:        5.0.5-k
> 
> - are you benchmarking in keep-alive mode or not?
> i was not, keep-alive improves performance in terms of requests per
> second a bit but i try not to use it for now

I agree that's better without to get the minimal performance level you
can guarantee.

> Thanks for all the help, this is really interesting feedback that i got.

Note: you're not the first one to encounter performance issues on a low
frequency CPU, if we could pin point the limiting factor and document it
for future reference, it would be useful. It would also be useful if we
can indicate that some hardware combinations must be avoided (eg: NICs,
CPUs, RAM speed, etc).

Willy


Reply via email to