On Jul 4, 2014 8:00 AM, "Willy Tarreau" <[email protected]> wrote: > > On Fri, Jul 04, 2014 at 01:44:54AM +0200, Maxime Brugidou wrote:
> > * got back to the standard igb shipped in the kernel (it gets the 8 > > virtual channels by default for the NIC) > > OK, but you need to ensure they never deliver IRQs to the second socket, > and if possible not to any hyperthread. So in practice a good approach > would be to spread them across cpus 0-3 for example and have haproxy on > cpu 4 or 5. I did exactly that in a second test later before going to sleep and went up to 50k session/sec. I actually added a second haproxy process on cpu5 and it easily doubles the traffic which is nice. > > * set ethtool -C eth0 rx-usecs 500 on both nginx and haproxy hosts > > (it's 3 by default) > > 500 is huge and will result in delays which can fill the queues. It's > particularly important for incoming ACKs because a Tx packet will > occupy the TCP buffers for at least 500 us because of this. 3 is an > intel-specific value meaning "auto-adaptative". It tends to work > well enough in most cases. Otherwise I use 30-100 depending on the > NICs and the workload. Larger values reduce CPU usage for large > objects but prevent you from reaching high connection rates. Lower > ones do the opposite. I realized that too in my next tests and went back to 3. I forgot about the RX buffer which is at 256 by default. I tried raising it to 4096 (the maximum) with some little improvements. > > > So now the blocking part is not IRQs anymore (they run at worst at 60% > > of cpu0, now i get at 100% on cpu1 with haproxy and latency spikes to > > 100ms (instead of 3ms without load). > > Note that you'll always reach 100% unless you have a huge issue. Haproxy > aggregates processing. For example, at a low load, you'll get one full > round of I/O and events processing for a single connection, but at higher > load if you can accept multiple connections at once, you'll process two > of them in a single round. So you can remain at 100% and continue to scale > for a while. To give you some metrics, using small objects, we're used to > see about 120k conn/s through haproxy on a single core of a xeon 1620. You are right and also I didn't realize that what I should also monitor is latency since when I run the benchmark it goes through the roof (and the nginx backend is not blocking since requesting it directly is super fast). I will try to do further testing now that I have a better understanding of this. Not sure if there any http tool like siege that are easier to monitor object size and latency? > > > I am still testing with small > > packets. The goal is to max out the session per sec not the bandwidth. > > OK but you need first to ensure that you *can* max out the bandwidth, > otherwise it definitely indicates a setup problem. OK I'll try to do that with large objects first. I can also increase MTU with the backend, use splicing and maybe LRO it should max out the bandwidth. > > - have you disabled irq-balance ? > > no > > You should definitely, otherwise you bind your IRQs by hand, and when > you don't look, it changes them in your back. Sorry this was meant to be a yes. IRQ balance is not active and has never been. > Note: you're not the first one to encounter performance issues on a low > frequency CPU, if we could pin point the limiting factor and document it > for future reference, it would be useful. It would also be useful if we > can indicate that some hardware combinations must be avoided (eg: NICs, > CPUs, RAM speed, etc). > Yes. Thanks for the help. So I get it that 2GHz is slow especially because we use a low power chassis since most of our hardware does not need this kind of performance in general. I am still a bit disappointed by the connections speed I reach. I'll update you later today trying out all the solutions and getting better data.

