Willy, As always, you were spot on.
The VSphere 5/No swap/disable SMP build of Gentoo (that Vsphere thinks is RHEL so we can use the vmxnet driver) we have now is basically running on par with the bare metal one. Thanks for the help. I'll probably have to look at our other configs (including some in Solaris x86 zones) and disable swap to see if that helps (they don't have nearly the traffic.) matt On Jan 14, 2012, at 12:24 AM, Willy Tarreau wrote: > Hi John, > > On Fri, Jan 13, 2012 at 06:02:54PM -0500, John Lauro wrote: >> There are all sorts of kernel tuning parameters under /proc that can make >> a big difference, not to mention what type of virtual NIC you have in the >> VM. Are they running the same kernel version and Gentoo release? Have >> you compared sysctl.conf (or whatever gento uses to customize settings in >> /proc)? >> >> Generally I prefer to run haproxy (and only haproxy) in 1 CPU vms (less >> CPUs, lower latency from the vm scheduler), with haproxy, my only >> exception is if I also want ssl and load is higher. When dealing with >> high rates and larger number of connections make sure you don't go low on >> RAM. Haproxy goes exponentially worse as it starts to swap, in fact >> running swapoff -a isn't a bad idea, especially for bench testing... and >> it takes a lot more ram to support 8000 connections/sec than 300. >> >> In summary, check RAM, and /proc tuning. > > You're perfectly right with the RAM issues, as swapping should *NEVER* > happen on any component involved in web processing. A single swapout/swapin > cycle is noticeable by the user, and this goes worse with more users, to the > point the site totally stops responding. This is why admins are always very > careful not to push MaxClients too far on Apache. > > And I too run with swap disabled ! > > I remmeber having spent weeks tracking down an issue where haproxy was > logging network issues (retransmits when establishing connections and > even receiving requests). In the end we found that a script on the machine > was using curl to upload the daily logs, and this old version of curl used > to buffer all the file to RAM before sending it. Days of large traffic were > causing many things to be swapped out, and TCP buffers to be shrunk, so that > for about half a day after the even, drops and retransmits were still quite > common, causing huge response time delays. > > Getting back to Matt's issue, it's nothing new that VMs are *much* slower > than bare metal for latency sensitive applications. If you look at how the > CPU usage is spread in haproxy, you'll often see 15% user and 85% system, > and the ratio can derive to 1%+99% when transfering large objects. In the > system, haproxy only uses the network stack. So that's simple : on average, > the network stack is responsible for 85% to 99% of the performance. That's > why we try hard to reduce the number of system calls and to merge TCP > segments when that's possible. > > When you add an hypervisor between the kernel and the hardware, there's no > secret : you have to pass through 2 layers, and whatever optimizations have > been performed in the kernel are lost due to this extra work. > > At Exceliance, we've spent a lot of time benchmarking hypervisors. It > happens that VSphere 5 is much much faster than ESX 3, around 5x, meaning > a much lower overhead. But still it's around half the performance of the > bare metal. Other hypervisors we've tried are still even slower than ESX 3. > Sometimes, a network driver can make a major change, and even changing the > hardware NIC can make important changes. > > Virtualization is fine for CPU intensive jobs where latency is not a problem, > such as number crunching. SSL offloading is not much affected by > virtualization > since most of the work already takes maximum user-land CPU. Same for Java or > Ruby apps. But if you need very high performance, you wouldn't want to run > components such as a firewall, router, load balancer or proxy in a VM, unless > you're ready to waste a lot of power. Note that many people do that for > convenience reasons, but I still find it wasted to consume twice the power > for the same job. > > Please note that your numbers seem low, and I don't know if this is because > of the object sizes or not. Session rate is measured on small (ideally empty) > objects, and byte rate is measured on large objects. Small objects on a 2 GHz > machine should be around 20-25000, not 8000. But since you say that you're > limited by the backends too, it can still be normal. > > And your 3500 in Vsphere 4 seems low too, unless those are already large > objects. I have memories of 6500 on ESX3 with a Core2Duo 3 GHz. Check what > NIC you're emulating, prefer vmxnet and try with several hardware NICs in > this machine (e1000e are fine in general). And also ensure that no other > VM is started when you run the test, otherwise you'll never get acceptable > numbers (which is the height of virtualization) ! > > Regards, > Willy > >> >>> -----Original Message----- >>> From: Matt Banks [mailto:[email protected]] >>> Sent: Friday, January 13, 2012 2:40 PM >>> To: [email protected] >>> Subject: VM vs bare metal and threading >>> >>> All, >>> >>> I'm not sure what the issue is here, but I wanted to know if there was >> an >>> easy explanation for this. >>> >>> We've been doing some load testing of HAProxy and have found the >> following: >>> >>> HAProxy (both 1.4.15 and 1.4.19 builds) running under Gentoo in a 2 vCPU >> VM >>> (Vsphere 4.x) running on a box with a Xeon x5675 (3.06 GHz current gen >>> Westmere) maxes out (starts throwing 50x errors) at around a session >> rate >>> of 3500. >>> >>> However, copies of the same binaries pointed at the same backend servers >> on >>> a Gentoo box (bare metal) with 2x E5405 (2.00GHz - Q4,2007 launch) top >> out >>> at a session rate of around 8000 - at which point the back end servers >>> start to fall over. And that HAProxy machine is doing LOTS of other >> things >>> at the same time. >>> >>> Here's the reason for the query: We're not sure why, but the bare metal >> box >>> seems to be balancing the load better across cpu's. (We're using the >> same >>> config file, so nbproc is set to 1 for both setups). Most of our HAProxy >>> setups aren't really getting hit hard enough to tell if multiple CPU's >> are >>> being used or not as their session rates typically stay around 300-400. >>> >>> We know it's not virtualization in general because we have a virtual >>> machine in the production version of this system that achieves higher >>> numbers on lesser hardware. >>> >>> Just wondering if there is somewhere we should start looking. >>> >>> TIA. >>> matt

