Hi Vikash,

On Sun, Oct 21, 2012 at 11:20:32PM +0530, freak 62 wrote:
> What should be the min. configuration of a m/c such that Haproxy running on
> it can hold up to 30K ~ 50K conn/sec for a total of 500000 connections?

You're looking for the high number range here, so you absolutely need to
run a benchmark on your machine. First, you absolutely need to know the
average object size so that you can transform the 50kcps to bandwidth.
You can achieve 50kcps on a gig link if all you're returning is HTTP 304,
or if you're dealing with massive attacks and just close these connections.
But if you're transfering more than 1.4 kB of response headers + data, the
response will be composed of two TCP segments and the gig link will be too
tight, so you'll need 10G.

Second, 500k connections will eat a lot of memory. Assuming that these
connections will mostly remain idle (long polling) connections, given
the ratio you're proposing, we can say that the kernel alone will require
at least 16kB per connection (4k read+4k write bufs per socket and per
side). And haproxy can be tuned to use about 17kB with 8kB buffers (for
normal HTTP traffic), or you can go down to 4kB buffers if you're only
doing small transfers. Let's stay on the safe side, 16k for the system
+ 17k for haproxy = 33kB per connection. This is 16.5 GB of RAM. You
definitely need some RAM for the system to work, and I recommend that
network buffers (kernel+haproxy) don't represent more than 2/3 of the
system's memory, so you need at least 24 GB of RAM. Let's go to 32 to
be safe.

You need a massive amount of system tuning too, to be able to support
1 million file descriptors.

You need to architecture your site so that haproxy can spread the load
on enough servers so that the number of source ports does not become
the limiting factor. Consider 50k usable source ports. You'll need to
run on 10 servers and have haproxy manage the source ports itself using
the "source" parameter on each "server" line. If you have less servers,
then you need to have multiple source addresses on haproxy, or you need
it to transparently bind to the client's IP address and then run in
transparent mode and become the default gateway for your servers. This
also comes with a cost on packet rate.

> I am using Dell Desktop  and configurations is:
>      Model name:  Intel(R) Core(TM) i7 CPU 930  @ 2.80GHz
>      No.of processor : 8
>      Memory : 4GB.
> 
> Is setting nbproc=8 will ensure that Haproxy will run on 8 cores?

In general yes but it's the system's scheduler which decides. However,
the more cores you set, the less the performance will be, because moving
data across CPU caches is extremely inefficient.

In practice, to obtain the highest connection rates, you have to pin
network IRQs to one core, and haproxy on another core, the closest
possible to the IRQ one, ideally sharing the same L2 cache, or if not
possible, the same L3. Don't set it on a core which does not share
the cache with the first one.

The best performing CPU has the highest frequency and the largest shared
cache between the two cores in use. For instance, an i7 3770 at 3.4 GHz
with 8M of shared L3 cache should be nice. And such a CPU can be pushed
to 3.9 GHz if you limit it to two cores only.

BTW, a Core i7 930 has 4 cores, not 8, so never make your system run
on more cores than available, it will constantly context-switch and
the performance will be even lower.

> What other parameters should be set to ensure that Haproxy should not
> become the bottleneck?

Every detail counts, you absolutely need to run a benchmark. This as
stupid as network interrupt latency has a huge impact, because depending
on the process latency, you can see the NIC driver switch to polling mode 
and have one CPU core completely dedicated to ksoftirqd. If the IRQ was
not correctly pinned to its own core, it means the load will be shared
with haproxy! You also need to tune your socket buffers and system backlog
for the average transfer size. Another (stupid) example : some people
install graphics environments on their servers (very bad idea) and are
surprized to see low performance. Often this is caused by the GPU using
shared memory, and introducing important memory access latencies. On my
laptop for example, I get 10% more network performance by killing X and
disabling the frame buffer. The GPU then switches to real text mode where
the memory bandwidth is ridiculous (100kB/s) and fits in the cache (4kB).

There is no one-size-fits-all recipe, you need to run a benchmark.

Regards,
Willy


Reply via email to