RE: nginx alone performs x2 than haproxy-nginx
Hi Pasi, Do you know if ubuntu 12.04 has these optimized drivers or not? I think Canonical developers are going to add the drivers later in some update to Ubuntu 12.04 packages. The drivers are not yet in 12.04. I saw some discussion from Canonical guys on xen-devel about that. For the record, here is a benchmark from the XEN guys about this problem [1], the patchset [2], and the Ubuntu Bug Report [3] requesting a backport to current Ubuntu 12.04. [1] http://blog.xen.org/index.php/2011/11/29/baremetal-vs-xen-vs-kvm-redux/ [2] http://git.kernel.org/?p=linux/kernel/git/konrad/xen.git;a=shortlog;h=refs/heads/devel/acpi-cpufreq.v4 [3] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/898112
Re: nginx alone performs x2 than haproxy-nginx
On 29/04/2012 20:01, Willy Tarreau wrote: What I could suggest would be : - reduce /proc/sys/net/netfilter/nf_conntrack_tcp_timeout_time_wait to 30s - increase /proc/sys/net/netfilter/nf_conntrack_max to 524288 conns. - increase hashsize to 131072 buckets. This will help you support up to 8700 conn/s without trouble. You just need to scale the latter two settings accordingly if you plan to go higher. You could also disable connection tracking all together using the NOTRACK target in the raw table. iptables -t raw -A PREROUTING -p tcp --dport 80 -j NOTRACK iptables -t raw -A PREROUTING -p tcp --dport 443 -j NOTRACK Note however that you will no longer be able to carry out any connection tracking logic on matched packes, including no NAT, syncookie protection, etc. Jinn
Re: nginx alone performs x2 than haproxy-nginx
On Wed, May 02, 2012 at 03:40:58PM +0200, Lukas Tribus wrote: Note however that you will no longer be able to carry out any connection tracking logic on matched packes, including no NAT, syncookie protection, etc. Are you sure syncookie protection doesn't work with -j NOTRACK? I don't believe syncookie has anything to do with conntrack at all, in fact, if syncookies would be stateful, they would be totally useless. You're right Lukas, syncookies are independant on conntrack, they're applied on the socket itself, as soon as the backlog is full. Willy
Re: nginx alone performs x2 than haproxy-nginx
On Mon, Apr 30, 2012 at 12:19:26PM +0200, Sebastien Estienne wrote: Hi Pasi, Do you know if ubuntu 12.04 has these optimized drivers or not? I think Canonical developers are going to add the drivers later in some update to Ubuntu 12.04 packages. The drivers are not yet in 12.04. I saw some discussion from Canonical guys on xen-devel about that. -- Pasi thanx -- Sebastien E. Le 30 avr. 2012 à 11:06, Pasi Kärkkäinen pa...@iki.fi a écrit : On Sun, Apr 29, 2012 at 06:18:52PM +0200, Willy Tarreau wrote: I'm using VPS machines from Linode.com, they are quite powerful. They're based on Xen. I don't see the network card saturated. OK I see now. There's no point searching anywhere else. Once again you're a victim of the high overhead of virtualization that vendors like to pretend is almost unnoticeable :-( As for nf_conntrack, I have iptables enabled with rules as a firewall on each machine, I stopped it on all involved machines and I still get those results. nf_conntrack is compiled to the kernel (it's a kernel provided by Linode) so I don't think I can disable it completely. Just not use it (and not use any firewall between them). It's having the module loaded with default settings which is harmful, so even unloading the rules will not change anything. Anyway, now I'm pretty sure that the overhead caused by the default conntrack settings is nothing compared with the overhead of Xen. Even if 6-7K is very low (for nginx directly), why is haproxy doing half than that? That's quite simple : it has two sides so it must process twice the number of packets. Since you're virtualized, you're packet-bound. Most of the time is spent communicating with the host and with the network, so the more the packets and the less performance you get. That's why you're seeing a 2x increase even with nginx when enabling keep-alive. I'd say that your numbers are more or less in line with a recent benchmark we conducted at Exceliance and which is summarized below (each time the hardware was running a single VM) : http://blog.exceliance.fr/2012/04/24/hypervisors-virtual-network-performance-comparison-from-a-virtualized-load-balancer-point-of-view/ (BTW you'll note that Xen was the worst performer here with 80% loss compared to native performance). Note that Ubuntu 11.10 kernel is lacking important drivers such as the Xen ACPI power management / cpufreq drivers so it's not able to use the better performing CPU states. That driver is merged to recent upstream Linux 3.4 (-rc). Also the xen-netback dom0 driver is still unoptimized in the upstream Linux kernel. Using RHEL5/CentOS5 as Xen host/dom0, or SLES11 or OpenSuse is a better idea today for benchmarking because those have the fully optimized kernel/drivers. Upstream Linux will get the optimizations in small steps (per the Linux development model). Citrix XenServer 6 is using the optimized kernel/drivers so that explains the difference in the benchmark compared to Ubuntu Xen4.1. I just wanted to hilight that. -- Pasi
Re: nginx alone performs x2 than haproxy-nginx
On Sun, Apr 29, 2012 at 06:18:52PM +0200, Willy Tarreau wrote: I'm using VPS machines from Linode.com, they are quite powerful. They're based on Xen. I don't see the network card saturated. OK I see now. There's no point searching anywhere else. Once again you're a victim of the high overhead of virtualization that vendors like to pretend is almost unnoticeable :-( As for nf_conntrack, I have iptables enabled with rules as a firewall on each machine, I stopped it on all involved machines and I still get those results. nf_conntrack is compiled to the kernel (it's a kernel provided by Linode) so I don't think I can disable it completely. Just not use it (and not use any firewall between them). It's having the module loaded with default settings which is harmful, so even unloading the rules will not change anything. Anyway, now I'm pretty sure that the overhead caused by the default conntrack settings is nothing compared with the overhead of Xen. Even if 6-7K is very low (for nginx directly), why is haproxy doing half than that? That's quite simple : it has two sides so it must process twice the number of packets. Since you're virtualized, you're packet-bound. Most of the time is spent communicating with the host and with the network, so the more the packets and the less performance you get. That's why you're seeing a 2x increase even with nginx when enabling keep-alive. I'd say that your numbers are more or less in line with a recent benchmark we conducted at Exceliance and which is summarized below (each time the hardware was running a single VM) : http://blog.exceliance.fr/2012/04/24/hypervisors-virtual-network-performance-comparison-from-a-virtualized-load-balancer-point-of-view/ (BTW you'll note that Xen was the worst performer here with 80% loss compared to native performance). Note that Ubuntu 11.10 kernel is lacking important drivers such as the Xen ACPI power management / cpufreq drivers so it's not able to use the better performing CPU states. That driver is merged to recent upstream Linux 3.4 (-rc). Also the xen-netback dom0 driver is still unoptimized in the upstream Linux kernel. Using RHEL5/CentOS5 as Xen host/dom0, or SLES11 or OpenSuse is a better idea today for benchmarking because those have the fully optimized kernel/drivers. Upstream Linux will get the optimizations in small steps (per the Linux development model). Citrix XenServer 6 is using the optimized kernel/drivers so that explains the difference in the benchmark compared to Ubuntu Xen4.1. I just wanted to hilight that. -- Pasi
Re: nginx alone performs x2 than haproxy-nginx
Hi, On Mon, Apr 30, 2012 at 12:06:25PM +0300, Pasi Kärkkäinen wrote: I'd say that your numbers are more or less in line with a recent benchmark we conducted at Exceliance and which is summarized below (each time the hardware was running a single VM) : http://blog.exceliance.fr/2012/04/24/hypervisors-virtual-network-performance-comparison-from-a-virtualized-load-balancer-point-of-view/ (BTW you'll note that Xen was the worst performer here with 80% loss compared to native performance). Note that Ubuntu 11.10 kernel is lacking important drivers such as the Xen ACPI power management / cpufreq drivers so it's not able to use the better performing CPU states. That driver is merged to recent upstream Linux 3.4 (-rc). Also the xen-netback dom0 driver is still unoptimized in the upstream Linux kernel. Using RHEL5/CentOS5 as Xen host/dom0, or SLES11 or OpenSuse is a better idea today for benchmarking because those have the fully optimized kernel/drivers. Upstream Linux will get the optimizations in small steps (per the Linux development model). Citrix XenServer 6 is using the optimized kernel/drivers so that explains the difference in the benchmark compared to Ubuntu Xen4.1. I just wanted to hilight that. Thanks for these useful information Pasi ! I'm Baptiste will be very interested! Cheers, Willy
Re: nginx alone performs x2 than haproxy-nginx
Hi Pasi, Do you know if ubuntu 12.04 has these optimized drivers or not? thanx -- Sebastien E. Le 30 avr. 2012 à 11:06, Pasi Kärkkäinen pa...@iki.fi a écrit : On Sun, Apr 29, 2012 at 06:18:52PM +0200, Willy Tarreau wrote: I'm using VPS machines from Linode.com, they are quite powerful. They're based on Xen. I don't see the network card saturated. OK I see now. There's no point searching anywhere else. Once again you're a victim of the high overhead of virtualization that vendors like to pretend is almost unnoticeable :-( As for nf_conntrack, I have iptables enabled with rules as a firewall on each machine, I stopped it on all involved machines and I still get those results. nf_conntrack is compiled to the kernel (it's a kernel provided by Linode) so I don't think I can disable it completely. Just not use it (and not use any firewall between them). It's having the module loaded with default settings which is harmful, so even unloading the rules will not change anything. Anyway, now I'm pretty sure that the overhead caused by the default conntrack settings is nothing compared with the overhead of Xen. Even if 6-7K is very low (for nginx directly), why is haproxy doing half than that? That's quite simple : it has two sides so it must process twice the number of packets. Since you're virtualized, you're packet-bound. Most of the time is spent communicating with the host and with the network, so the more the packets and the less performance you get. That's why you're seeing a 2x increase even with nginx when enabling keep-alive. I'd say that your numbers are more or less in line with a recent benchmark we conducted at Exceliance and which is summarized below (each time the hardware was running a single VM) : http://blog.exceliance.fr/2012/04/24/hypervisors-virtual-network-performance-comparison-from-a-virtualized-load-balancer-point-of-view/ (BTW you'll note that Xen was the worst performer here with 80% loss compared to native performance). Note that Ubuntu 11.10 kernel is lacking important drivers such as the Xen ACPI power management / cpufreq drivers so it's not able to use the better performing CPU states. That driver is merged to recent upstream Linux 3.4 (-rc). Also the xen-netback dom0 driver is still unoptimized in the upstream Linux kernel. Using RHEL5/CentOS5 as Xen host/dom0, or SLES11 or OpenSuse is a better idea today for benchmarking because those have the fully optimized kernel/drivers. Upstream Linux will get the optimizations in small steps (per the Linux development model). Citrix XenServer 6 is using the optimized kernel/drivers so that explains the difference in the benchmark compared to Ubuntu Xen4.1. I just wanted to hilight that. -- Pasi
Re: nginx alone performs x2 than haproxy-nginx
Hi Bar, On Sun, Apr 29, 2012 at 02:09:42PM +0300, Bar Ziony wrote: Hi, I have 2 questions about a haproxy setup I configured. This is the setup: LB server (haproxy 1.4.20, debian squeeze 64-bit) in http mode, forwarding requests to a single nginx web server, that resides on a different machine. I'll paste the haproxy config at the end of this message. 1. Benchmarking: When doing some benchmarking with 'ab' or 'siege', for a small (2 bytes, single char) file: ab -n 1 -c 40 http://lb/test.html VS ab -n 1 -c 40 http://web-01/test.html web-01 directly gets 6000-6500 requests/sec. haproxy-nginx gets 3000 requests/sec. This is extremely low, it's approximately what I achieve on a sub-1watt 500 MHz Geode LX, and I guess you're running on much larger hardware since you're saying it's 64-bit. When using ab -k to enable keepalives, nginx is getting 12,000 requests/sec, and haproxy gets around 6000-7000 requests/sec. Even this is very low. Note that the 6-7k here relates to what nginx support above without keep-alive so it might make sense, but all these numbers seem very low in general. I wanted to ask if the x2 difference is normal? I tried to remove the ACL for checking if the path ends with PHP, the results were not different. Is ab running on the same machine as haproxy ? Do you have nf_conntrack loaded on any of the systems ? Do you observe any process reaching 100% CPU somewhere ? Aren't you injecting on a 100 Mbps NIC ? 2. As you can see, I separate the dynamic (PHP) requests from other (static) requests. a. Is this the way to do it (path_end .php) ? It looks fine. Other people like to store all their statics in a small set of directories and use path_beg with these prefixes instead. But it depends on how you classify your URLs in fact. b. I limit the number of connections to the dynamic backend server(s). I just set it according to the number of FastCGI PHP processes available on that machine. How do I check/benchmark what is the limit for the static backend? Or is it not needed? Nginx performs quite well in general and specially as a static file server. You may well set a high maxconn or none at all on the static backend, you won't harm it. Otherwise I found nothing suspect in your config. Regards, Willy
Re: nginx alone performs x2 than haproxy-nginx
Hi Willy, Thanks for your time. I really didn't know this are such low results. I ran 'ab' from a different machine than haproxy and nginx (which are different machines too). I also tried to run 'ab' from multiple machines (not haproxy or nginx) and the results are pretty much / 3 the single result 'ab' result... I'm using VPS machines from Linode.com, they are quite powerful. They're based on Xen. I don't see the network card saturated. As for nf_conntrack, I have iptables enabled with rules as a firewall on each machine, I stopped it on all involved machines and I still get those results. nf_conntrack is compiled to the kernel (it's a kernel provided by Linode) so I don't think I can disable it completely. Just not use it (and not use any firewall between them). Even if 6-7K is very low (for nginx directly), why is haproxy doing half than that? about nginx static backend maxconn - what is a high maxconn number? Just the limit I can see with 'ab'? Thanks, Bar. On Sun, Apr 29, 2012 at 4:27 PM, Willy Tarreau w...@1wt.eu wrote: Hi Bar, On Sun, Apr 29, 2012 at 02:09:42PM +0300, Bar Ziony wrote: Hi, I have 2 questions about a haproxy setup I configured. This is the setup: LB server (haproxy 1.4.20, debian squeeze 64-bit) in http mode, forwarding requests to a single nginx web server, that resides on a different machine. I'll paste the haproxy config at the end of this message. 1. Benchmarking: When doing some benchmarking with 'ab' or 'siege', for a small (2 bytes, single char) file: ab -n 1 -c 40 http://lb/test.html VS ab -n 1 -c 40 http://web-01/test.html web-01 directly gets 6000-6500 requests/sec. haproxy-nginx gets 3000 requests/sec. This is extremely low, it's approximately what I achieve on a sub-1watt 500 MHz Geode LX, and I guess you're running on much larger hardware since you're saying it's 64-bit. When using ab -k to enable keepalives, nginx is getting 12,000 requests/sec, and haproxy gets around 6000-7000 requests/sec. Even this is very low. Note that the 6-7k here relates to what nginx support above without keep-alive so it might make sense, but all these numbers seem very low in general. I wanted to ask if the x2 difference is normal? I tried to remove the ACL for checking if the path ends with PHP, the results were not different. Is ab running on the same machine as haproxy ? Do you have nf_conntrack loaded on any of the systems ? Do you observe any process reaching 100% CPU somewhere ? Aren't you injecting on a 100 Mbps NIC ? 2. As you can see, I separate the dynamic (PHP) requests from other (static) requests. a. Is this the way to do it (path_end .php) ? It looks fine. Other people like to store all their statics in a small set of directories and use path_beg with these prefixes instead. But it depends on how you classify your URLs in fact. b. I limit the number of connections to the dynamic backend server(s). I just set it according to the number of FastCGI PHP processes available on that machine. How do I check/benchmark what is the limit for the static backend? Or is it not needed? Nginx performs quite well in general and specially as a static file server. You may well set a high maxconn or none at all on the static backend, you won't harm it. Otherwise I found nothing suspect in your config. Regards, Willy
Re: nginx alone performs x2 than haproxy-nginx
On Sun, Apr 29, 2012 at 05:25:01PM +0300, Bar Ziony wrote: Hi Willy, Thanks for your time. I really didn't know this are such low results. I ran 'ab' from a different machine than haproxy and nginx (which are different machines too). I also tried to run 'ab' from multiple machines (not haproxy or nginx) and the results are pretty much / 3 the single result 'ab' result... OK so this clearly means that the limitation comes from the tested components and not the machine running ab. I'm using VPS machines from Linode.com, they are quite powerful. They're based on Xen. I don't see the network card saturated. OK I see now. There's no point searching anywhere else. Once again you're a victim of the high overhead of virtualization that vendors like to pretend is almost unnoticeable :-( As for nf_conntrack, I have iptables enabled with rules as a firewall on each machine, I stopped it on all involved machines and I still get those results. nf_conntrack is compiled to the kernel (it's a kernel provided by Linode) so I don't think I can disable it completely. Just not use it (and not use any firewall between them). It's having the module loaded with default settings which is harmful, so even unloading the rules will not change anything. Anyway, now I'm pretty sure that the overhead caused by the default conntrack settings is nothing compared with the overhead of Xen. Even if 6-7K is very low (for nginx directly), why is haproxy doing half than that? That's quite simple : it has two sides so it must process twice the number of packets. Since you're virtualized, you're packet-bound. Most of the time is spent communicating with the host and with the network, so the more the packets and the less performance you get. That's why you're seeing a 2x increase even with nginx when enabling keep-alive. I'd say that your numbers are more or less in line with a recent benchmark we conducted at Exceliance and which is summarized below (each time the hardware was running a single VM) : http://blog.exceliance.fr/2012/04/24/hypervisors-virtual-network-performance-comparison-from-a-virtualized-load-balancer-point-of-view/ (BTW you'll note that Xen was the worst performer here with 80% loss compared to native performance). In your case it's very unlikely that you'd have dedicated hardware, and since you don't have access to the host, you don't know what its settings are, so I'd say that what you managed to reach is not that bad for such an environment. You should be able to slightly increase performance by adding the following options in your defaults section : option tcp-smart-accept option tcp-smart-connect Each of them will save one packet during the TCP handshake, which may slightly compensate for the losses caused by virtualization. Note that I have also encountered a situation once where conntrack was loaded on the hypervisor and not tuned at all, resulting in extremely low performance. The effect is that the performance continuously drops as you add requests, until your source ports roll over and the performance remains stable. In your case, you run with only 10k reqs, which is not enough to measure the performance under such conditions. You should have one injecter running a constant load (eg: 1M requests in loops) and another one running the 10k reqs several times in a row to observe if the results are stable or not. about nginx static backend maxconn - what is a high maxconn number? Just the limit I can see with 'ab'? It depends on your load, but nginx will have no problem handling as many concurrent requests as haproxy on static files. So not having a maxconn there makes sense. Otherwise you can limit it to a few thousands if you want, but the purpose of maxconn is to protect a server, so here there is not really anything to protect. Last point about virtualized environments, they're really fine if you're seeking costs before performance. However, if you're building a high traffic site (6k req/s might qualify as a high traffic site), you'd be better with a real hardware. You would not want to fail such a site just for saving a few dollars a month. To give you an idea, even with a 15EUR/month dedibox consisting on a single-core Via Nano processor and which runs nf_conntrack, I can achieve 14300 req/s. Hoping this helps, Willy
Re: nginx alone performs x2 than haproxy-nginx
Willy, Thanks as always for the very detailed and helpful answer. I'll reply in-line, like you ;-) On Sun, Apr 29, 2012 at 7:18 PM, Willy Tarreau w...@1wt.eu wrote: On Sun, Apr 29, 2012 at 05:25:01PM +0300, Bar Ziony wrote: Hi Willy, Thanks for your time. I really didn't know this are such low results. I ran 'ab' from a different machine than haproxy and nginx (which are different machines too). I also tried to run 'ab' from multiple machines (not haproxy or nginx) and the results are pretty much / 3 the single result 'ab' result... OK so this clearly means that the limitation comes from the tested components and not the machine running ab. I'm using VPS machines from Linode.com, they are quite powerful. They're based on Xen. I don't see the network card saturated. OK I see now. There's no point searching anywhere else. Once again you're a victim of the high overhead of virtualization that vendors like to pretend is almost unnoticeable :-( The overhead is really that huge? As for nf_conntrack, I have iptables enabled with rules as a firewall on each machine, I stopped it on all involved machines and I still get those results. nf_conntrack is compiled to the kernel (it's a kernel provided by Linode) so I don't think I can disable it completely. Just not use it (and not use any firewall between them). It's having the module loaded with default settings which is harmful, so even unloading the rules will not change anything. Anyway, now I'm pretty sure that the overhead caused by the default conntrack settings is nothing compared with the overhead of Xen. Why is it harmful that it loaded with default setteings? Could it be disabled? Even if 6-7K is very low (for nginx directly), why is haproxy doing half than that? That's quite simple : it has two sides so it must process twice the number of packets. Since you're virtualized, you're packet-bound. Most of the time is spent communicating with the host and with the network, so the more the packets and the less performance you get. That's why you're seeing a 2x increase even with nginx when enabling keep-alive. 1. Can you explain what does it mean that I'm packet-bound, and why is it happening since I'm using virtualization? 2. When you say twice the number of packets, you mean: Client sends request (as 1 or more packets) to haproxy which intercepts it, acts upon it and sends a new request (1 or more packets) to the server, which then again sends the response, that's why it's twice the number of packets? It's not twice the bandwidth of using the web-server directly right? I'd say that your numbers are more or less in line with a recent benchmark we conducted at Exceliance and which is summarized below (each time the hardware was running a single VM) : http://blog.exceliance.fr/2012/04/24/hypervisors-virtual-network-performance-comparison-from-a-virtualized-load-balancer-point-of-view/ (BTW you'll note that Xen was the worst performer here with 80% loss compared to native performance). In your case it's very unlikely that you'd have dedicated hardware, and since you don't have access to the host, you don't know what its settings are, so I'd say that what you managed to reach is not that bad for such an environment. You should be able to slightly increase performance by adding the following options in your defaults section : option tcp-smart-accept option tcp-smart-connect Thanks! I think it did help and now I get 3700 req/sec without -k , and almost 5000 req/sec with -k. I do have a small issue (it was there before I added these options): when doing 'ab -n 1 -c 60 http://lb-01/test.html', 'ab' gets stuck for a second or two at the end, causing the req/sec to drop to around 2000 req/sec. If I Ctrl+c before the end, I see the numbers above. Is this happening because of 'ab' or because of something with my setup? With -k it doesn't happen. And I also think it doesn't always happen with the second, passive LB (when I tested it). Each of them will save one packet during the TCP handshake, which may slightly compensate for the losses caused by virtualization. Note that I have also encountered a situation once where conntrack was loaded on the hypervisor and not tuned at all, resulting in extremely low performance. The effect is that the performance continuously drops as you add requests, until your source ports roll over and the performance remains stable. In your case, you run with only 10k reqs, which is not enough to measure the performance under such conditions. You should have one injecter running a constant load (eg: 1M requests in loops) and another one running the 10k reqs several times in a row to observe if the results are stable or not. What do you mean by until your source ports rollover ? I'm sorry, but I didn't quite understand the meaning of your proposed check? about nginx static backend maxconn - what is a high maxconn
Re: nginx alone performs x2 than haproxy-nginx
On Sun, Apr 29, 2012 at 09:05:26PM +0300, Bar Ziony wrote: I'm using VPS machines from Linode.com, they are quite powerful. They're based on Xen. I don't see the network card saturated. OK I see now. There's no point searching anywhere else. Once again you're a victim of the high overhead of virtualization that vendors like to pretend is almost unnoticeable :-( The overhead is really that huge? Generally, yes. A packet entering the machine from a NIC has to be copied by the NIC to memory, then the NIC sends an interrupt, which interrupts the VM in its work to switch to the HV kernel, the driver reads the data from memory, does a lot of checking and decoding, tries to determine from its MAC (or worse IP+ports in case of NAT) what VM it's aimed at. Then it puts it in shape for delivery via the appropriate means for the VM (depending on the type of driver emulation, possibly involving splitting it into smaller chunks). Then when the packet has been well decorated, the HV simulates an interrupt to the VM and switch back to it. The VM now enters the driver in order to read the packet. And the fun begins. Depending on the drivers, the number of exchanges between the driver and the supposedly real NIC will cause a ping-pong between the driver in the VM and the NIC emulator in the HV. For instance, reading a status flag on the NIC might cause a double context switch. If the driver regularly reads this status flag, the context might switch a lot. This is extremely expensive for each packet. You'll note that I'm not even counting the overhead of multiple memory copies for the same incoming packet. That's why HV vendors try to propose more direct drivers (vmxnet, xen-vnif, hv_netvsc). You'll also note that performance using these drivers can be up to 30-40 times better than the NIC emulation by avoiding many ping-pong games between the two sides. Reaching such gains by saving such exchanges can get you an idea of the extreme overhead which remains even for a single bounce. You'll often see benchmarks of outgoing traffic leaving VMs for file serving, showing somewhat correct performance, and you'll probably never see benchs of incoming nor mixed traffic (eg: proxies). On the network, a send is always much cheaper than a receive because memory copies and buffer parsing can be avoided, leaving just a virtual context switch as the remaining overhead. But on the receive path, clearly everything counts, including the real NIC hardware and interrupt affinity, which you have no control over. It's having the module loaded with default settings which is harmful, so even unloading the rules will not change anything. Anyway, now I'm pretty sure that the overhead caused by the default conntrack settings is nothing compared with the overhead of Xen. Why is it harmful that it loaded with default setteings? Could it be disabled? The default hashtable size is very small and depends on the available memory. With 1+GB, you have 16k buckets, and below you have #mem/16384. When you stack a few hundreds of thousands of connections there (which happens very quickly at 6k conn/s with the default time_wait setting of 120s), you end up with long lookups for each incoming packet. You can change the setting when loading the module using the hashsize parameter, or in your case, by setting it here : /sys/module/nf_conntrack/parameters/hashsize It's wise to have a hashsize not less than 1/4 of the max amount of entries you're expecting to get in the table. Multiply your conn rate by 120s (time_wait) to get the number of entries needed. Don't forget that on a proxy, you have twice the number of connections. At 6k conn/s, you'd then have to support 6*2*120 = 1.44M conns in the table, and support a hash size of at least 360k entries (better use a power of two since there's a divide in the kernel that many processors are able to optimize). So let's use 262144 entries for hashsize. Also, you should reduce the time_wait timeout to around 30-60s, and increase the conntrack_max setting which defaults to 64k (!) resulting in a total blocking of new connections after 5s of a run at 6k/s on both sides. What I could suggest would be : - reduce /proc/sys/net/netfilter/nf_conntrack_tcp_timeout_time_wait to 30s - increase /proc/sys/net/netfilter/nf_conntrack_max to 524288 conns. - increase hashsize to 131072 buckets. This will help you support up to 8700 conn/s without trouble. You just need to scale the latter two settings accordingly if you plan to go higher. That's quite simple : it has two sides so it must process twice the number of packets. Since you're virtualized, you're packet-bound. Most of the time is spent communicating with the host and with the network, so the more the packets and the less performance you get. That's why you're seeing a 2x increase even with nginx when enabling keep-alive. 1. Can you explain what does it mean that I'm packet-bound, and why is it happening