Hi Vikash, On Sun, Oct 21, 2012 at 11:20:32PM +0530, freak 62 wrote: > What should be the min. configuration of a m/c such that Haproxy running on > it can hold up to 30K ~ 50K conn/sec for a total of 500000 connections?
You're looking for the high number range here, so you absolutely need to run a benchmark on your machine. First, you absolutely need to know the average object size so that you can transform the 50kcps to bandwidth. You can achieve 50kcps on a gig link if all you're returning is HTTP 304, or if you're dealing with massive attacks and just close these connections. But if you're transfering more than 1.4 kB of response headers + data, the response will be composed of two TCP segments and the gig link will be too tight, so you'll need 10G. Second, 500k connections will eat a lot of memory. Assuming that these connections will mostly remain idle (long polling) connections, given the ratio you're proposing, we can say that the kernel alone will require at least 16kB per connection (4k read+4k write bufs per socket and per side). And haproxy can be tuned to use about 17kB with 8kB buffers (for normal HTTP traffic), or you can go down to 4kB buffers if you're only doing small transfers. Let's stay on the safe side, 16k for the system + 17k for haproxy = 33kB per connection. This is 16.5 GB of RAM. You definitely need some RAM for the system to work, and I recommend that network buffers (kernel+haproxy) don't represent more than 2/3 of the system's memory, so you need at least 24 GB of RAM. Let's go to 32 to be safe. You need a massive amount of system tuning too, to be able to support 1 million file descriptors. You need to architecture your site so that haproxy can spread the load on enough servers so that the number of source ports does not become the limiting factor. Consider 50k usable source ports. You'll need to run on 10 servers and have haproxy manage the source ports itself using the "source" parameter on each "server" line. If you have less servers, then you need to have multiple source addresses on haproxy, or you need it to transparently bind to the client's IP address and then run in transparent mode and become the default gateway for your servers. This also comes with a cost on packet rate. > I am using Dell Desktop and configurations is: > Model name: Intel(R) Core(TM) i7 CPU 930 @ 2.80GHz > No.of processor : 8 > Memory : 4GB. > > Is setting nbproc=8 will ensure that Haproxy will run on 8 cores? In general yes but it's the system's scheduler which decides. However, the more cores you set, the less the performance will be, because moving data across CPU caches is extremely inefficient. In practice, to obtain the highest connection rates, you have to pin network IRQs to one core, and haproxy on another core, the closest possible to the IRQ one, ideally sharing the same L2 cache, or if not possible, the same L3. Don't set it on a core which does not share the cache with the first one. The best performing CPU has the highest frequency and the largest shared cache between the two cores in use. For instance, an i7 3770 at 3.4 GHz with 8M of shared L3 cache should be nice. And such a CPU can be pushed to 3.9 GHz if you limit it to two cores only. BTW, a Core i7 930 has 4 cores, not 8, so never make your system run on more cores than available, it will constantly context-switch and the performance will be even lower. > What other parameters should be set to ensure that Haproxy should not > become the bottleneck? Every detail counts, you absolutely need to run a benchmark. This as stupid as network interrupt latency has a huge impact, because depending on the process latency, you can see the NIC driver switch to polling mode and have one CPU core completely dedicated to ksoftirqd. If the IRQ was not correctly pinned to its own core, it means the load will be shared with haproxy! You also need to tune your socket buffers and system backlog for the average transfer size. Another (stupid) example : some people install graphics environments on their servers (very bad idea) and are surprized to see low performance. Often this is caused by the GPU using shared memory, and introducing important memory access latencies. On my laptop for example, I get 10% more network performance by killing X and disabling the frame buffer. The GPU then switches to real text mode where the memory bandwidth is ridiculous (100kB/s) and fits in the cache (4kB). There is no one-size-fits-all recipe, you need to run a benchmark. Regards, Willy

