Thanks for the response Willy I agree of what you are saying. I have loadtested a lot of different machines/systems and the VMs never have as good performance as a physical machine. However, in this case we have to use Amazon so it's more focus to get the most out of 1 single instance and then scale with more machines to get more performance.
Xtra large EC2's are "supposed" to be dedicated machines in the cloud, no one else should use them except for you. But if I can't get HAProxy to use the XL EC2 properly it could be better to have more Large instances Instead (2 cores). That would reduce cost and make better use of the instances. And make stud use one of the cores and HAProxy the other? I read a lot of people that have tried stud. This example is interesting in this case because he assigns the different processes to different cores with cpuset: http://vincent.bernat.im/en/blog/2011-ssl-benchmark.html In my case, would cpuset be the same as taskset? /E -----Original Message----- From: Willy Tarreau [mailto:[email protected]] Sent: den 8 oktober 2011 00:11 To: Erik Torlen Cc: [email protected] Subject: Re: HAProxy, multicores and EC2 Hi Erik, On Sat, Oct 08, 2011 at 01:04:39AM +0000, Erik Torlen wrote: > Hi, > > I'm running HAProxy on a EC2 Xlarge instance together with stud. > > I'm having troubles to get HAProxy use all the cores on the machine, You should not need to do that and should not do that at all. > HAProxy with nbproc=3 is actually using more cores but the actual performance > is not so much better. Running 6k concurrent users doing 3 reqs each This means you're limited by something else. > gave around 5100 req/s for nbproc=1 and 6200 req/s with nbproc=3. > And this is only through http, not https with stud. 5100 or 6200 req/s is pretty low, this is less than what you get out of an Atom CPU with a single core. Most likely what is the limiting factor here is the virtualization layers which considerably limits the packet rate. I also once met a case where conntrack was running on the host OS and was not even tuned. You can easily guess that if the host OS or the virtualization layer is the bottleneck, nothing will work faster in the VMs. And I suspect that when you're doing your tests, other people running VMs on the same hardware are seriously affected. Anyway, 6k req/s for a VM is in the range of common values for large boxes. It's not a secret that networking is the worst thing a VM can do. > I have done a lot of tuning on the OS and turned of conntracks which has > given me good performance enhancements. > > I have read this : > http://www.mail-archive.com/[email protected]/msg00891.html > But can't actually figure out if it really gave the guy so much. He said he > bound haproxy to cpu1, how is that done? You can do this with "taskset" but I doubt it will help in your case. > As you can see below there is a lot of % on the "si". What does that say? It says that all the time is wasted processing packets at the network driver level, which is to be expected in such an environment. > With nbproc =3 haproxy utilizes all the cores when checking (top -d 1). > > top - 20:01:58 up 2 days, 1:13, 1 user, load average: 1.83, 1.48, 0.97 > Tasks: 83 total, 4 running, 79 sleeping, 0 stopped, 0 zombie > Cpu0 : 12.0%us, 21.0%sy, 0.0%ni, 1.0%id, 0.0%wa, 0.0%hi, 66.0%si, 0.0%st > Cpu1 : 4.8%us, 9.7%sy, 0.0%ni, 80.6%id, 0.0%wa, 0.0%hi, 4.8%si, 0.0%st > Cpu2 : 11.1%us, 22.2%sy, 0.0%ni, 56.3%id, 0.0%wa, 0.0%hi, 9.6%si, 0.7%st > Cpu3 : 4.7%us, 9.4%sy, 0.0%ni, 81.8%id, 0.0%wa, 0.0%hi, 4.1%si, 0.0%st > Mem: 15374136k total, 1067816k used, 14306320k free, 37880k buffers > Swap: 0k total, 0k used, 0k free, 236184k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 1430 haproxy 20 0 72908 31m 656 R 99.3 0.2 4:50.85 haproxy > 1431 haproxy 20 0 75648 32m 652 R 64.5 0.2 4:56.19 haproxy > 1432 haproxy 20 0 76204 32m 652 R 63.6 0.2 4:58.69 haproxy > > With nbproc = 1 haproxy is only utilizing one core. > > top - 20:57:24 up 2 days, 2:08, 1 user, load average: 0.93, 0.65, 0.49 > Tasks: 81 total, 2 running, 79 sleeping, 0 stopped, 0 zombie > Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 97.6%id, 0.0%wa, 0.0%hi, 0.8%si, 1.6%st > Cpu1 : 27.0%us, 49.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 24.0%si, 0.0%st > Cpu2 : 0.5%us, 0.0%sy, 0.0%ni, 99.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Mem: 15374136k total, 1147756k used, 14226380k free, 38100k buffers > Swap: 0k total, 0k used, 0k free, 236244k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 1582 haproxy 20 0 242m 113m 656 R 100.2 0.8 7:17.05 haproxy Interestingly you can see that the %si is much lower with a single process, which probably means that contention between all process traffic involves locking in the driver and makes the thing even worse. Quite frankly, if you need to achieve a load higher than a few thousands conns/s, I strongly discourage you from running your nodes in VMs, you should get bare metal, even the cheapest ones will do better and will cost you less. A dedicated Atom will do slightly better. A dedicated Core i3 or Phenom will do in the 40k range. Regards, Willy

