Thanks for the response Willy

I agree of what you are saying.
I have loadtested a lot of different machines/systems and the VMs never have as 
good performance
as a physical machine. However, in this case we have to use Amazon so it's more 
focus to get the most
out of 1 single instance and then scale with more machines to get more 
performance.

Xtra large EC2's are "supposed" to be dedicated machines in the cloud, no one 
else should use them except
for you. But if I can't get HAProxy to use the XL EC2 properly it could be 
better to have more Large instances
Instead (2 cores). That would reduce cost and make better use of the instances.
And make stud use one of the cores and HAProxy the other?

I read a lot of people that have tried stud. This example is interesting in 
this case because he assigns the
different processes to different cores with cpuset: 
http://vincent.bernat.im/en/blog/2011-ssl-benchmark.html

In my case, would cpuset be the same as taskset? 

/E

-----Original Message-----
From: Willy Tarreau [mailto:[email protected]] 
Sent: den 8 oktober 2011 00:11
To: Erik Torlen
Cc: [email protected]
Subject: Re: HAProxy, multicores and EC2

Hi Erik,

On Sat, Oct 08, 2011 at 01:04:39AM +0000, Erik Torlen wrote:
> Hi,
> 
> I'm running HAProxy on a EC2 Xlarge instance together with stud.
> 
> I'm having troubles to get HAProxy use all the cores on the machine,

You should not need to do that and should not do that at all.

> HAProxy with nbproc=3 is actually using more cores but the actual performance 
> is not so much better. Running 6k concurrent users doing 3 reqs each

This means you're limited by something else.

> gave around 5100 req/s for nbproc=1 and 6200 req/s with nbproc=3.
> And this is only through http, not https with stud.

5100 or 6200 req/s is pretty low, this is less than what you get out
of an Atom CPU with a single core. Most likely what is the limiting
factor here is the virtualization layers which considerably limits
the packet rate. I also once met a case where conntrack was running
on the host OS and was not even tuned. You can easily guess that if
the host OS or the virtualization layer is the bottleneck, nothing
will work faster in the VMs. And I suspect that when you're doing
your tests, other people running VMs on the same hardware are seriously
affected.

Anyway, 6k req/s for a VM is in the range of common values for large
boxes. It's not a secret that networking is the worst thing a VM can
do.

> I have done a lot of tuning on the OS and turned of conntracks which has 
> given me good performance enhancements.
> 
> I have read this : 
> http://www.mail-archive.com/[email protected]/msg00891.html
> But can't actually figure out if it really gave the guy so much. He said he 
> bound haproxy to cpu1, how is that done?

You can do this with "taskset" but I doubt it will help in your case.

> As you can see below there is a lot of % on the "si". What does that say?

It says that all the time is wasted processing packets at the network
driver level, which is to be expected in such an environment.

> With nbproc =3 haproxy utilizes all the cores when checking (top -d 1).
> 
> top - 20:01:58 up 2 days,  1:13,  1 user,  load average: 1.83, 1.48, 0.97
> Tasks:  83 total,   4 running,  79 sleeping,   0 stopped,   0 zombie
> Cpu0  : 12.0%us, 21.0%sy,  0.0%ni,  1.0%id,  0.0%wa,  0.0%hi, 66.0%si,  0.0%st
> Cpu1  :  4.8%us,  9.7%sy,  0.0%ni, 80.6%id,  0.0%wa,  0.0%hi,  4.8%si,  0.0%st
> Cpu2  : 11.1%us, 22.2%sy,  0.0%ni, 56.3%id,  0.0%wa,  0.0%hi,  9.6%si,  0.7%st
> Cpu3  :  4.7%us,  9.4%sy,  0.0%ni, 81.8%id,  0.0%wa,  0.0%hi,  4.1%si,  0.0%st
> Mem:  15374136k total,  1067816k used, 14306320k free,    37880k buffers
> Swap:        0k total,        0k used,        0k free,   236184k cached
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 1430 haproxy   20   0 72908  31m  656 R 99.3  0.2   4:50.85 haproxy
> 1431 haproxy   20   0 75648  32m  652 R 64.5  0.2   4:56.19 haproxy
> 1432 haproxy   20   0 76204  32m  652 R 63.6  0.2   4:58.69 haproxy
> 
> With nbproc = 1 haproxy is only utilizing one core.
> 
> top - 20:57:24 up 2 days,  2:08,  1 user,  load average: 0.93, 0.65, 0.49
> Tasks:  81 total,   2 running,  79 sleeping,   0 stopped,   0 zombie
> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 97.6%id,  0.0%wa,  0.0%hi,  0.8%si,  1.6%st
> Cpu1  : 27.0%us, 49.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi, 24.0%si,  0.0%st
> Cpu2  :  0.5%us,  0.0%sy,  0.0%ni, 99.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Mem:  15374136k total,  1147756k used, 14226380k free,    38100k buffers
> Swap:        0k total,        0k used,        0k free,   236244k cached
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 1582 haproxy   20   0  242m 113m  656 R 100.2  0.8   7:17.05 haproxy

Interestingly you can see that the %si is much lower with a single process,
which probably means that contention between all process traffic involves
locking in the driver and makes the thing even worse.

Quite frankly, if you need to achieve a load higher than a few thousands
conns/s, I strongly discourage you from running your nodes in VMs, you
should get bare metal, even the cheapest ones will do better and will
cost you less. A dedicated Atom will do slightly better. A dedicated
Core i3 or Phenom will do in the 40k range.

Regards,
Willy


Reply via email to