Re: Feature Requests: march native and cwnd setting param

2010-11-28 Thread Willy Tarreau
Hi Hank,

On Sat, Nov 27, 2010 at 06:56:34AM -0800, Hank A. Paulson wrote:
 1 - With recent CPUs Intel 5300/5400/5500/5600 and AMD 6100 the set of 
 optimal compiler settings for optimizations :) is not something anyone can 
 keep up with - not to mention different versions of gcc that understand 
 none, some or all of the features of these CPUs. march native allows gcc to 
 take on the burden of optimizing the compile time settings, so if that 
 could be added as one of the options in the makefile, it would be helpful 
 because then I could use the same make... line on every machine but it 
 would self-adjust for that machine.
(...)

That's a good idea, I have implemented it and even ported it to 1.4.
I have also added ARCH=32 and ARCH=64 do be used in combination with
CPU=native, so that you can select whether you explicitly want a 32
or 64-bit executable.

 2 - Google has pushed via both tcp related RFCs and patches to the 
 networking code for the linux kernel to allow the initial cwnd to be set as 
 a socket option - this would be a huge help to sites that communicate with 
 the same clients over and over and/or with many small requests allowing a 
 full response in one (or at least fewer) round trips. For one site that I 
 work on that is over 250 ms away with a very reliable gateway on the other 
 end, I burn through several round trips to deliver an icon/small gif/etc - 
 an icon that could have all the necessary packets in flight before the 
 first ack. It turns out the small initial cwnd creates more traffic across 
 the under sea cables than an initial cwnd of 8 or 10 or 12.
 
 http://www.amailbox.org/mailarchive/linux-netdev/2010/5/26/6278007

Indeed it can be nice in mobile environments for instance, where the
RTT is quite high. It does not seem too hard to add, I'm adding this
to the 1.5 TODO list.

 I also wanted to see if you were aware of two other recent kernel changes 
 that could be helpful to haproxy performance, the first could be helpful 
 for the new UNIX socket connections in recent haproxy versions:
 
 Implementation of recvmmsg:
 recvmmsg() is a new syscall that allows to receive with a single syscall 
 multiple messages that would require multiple calls to recvmsg(). For 
 high-bandwith, small packet applications, throughput and latency are 
 improved greatly.

Unfortunately, this will have no effect here because recvmmsg()'s goal is
to receive multiple datagrams at once, but we're not working with datagrams
but with streams, and segments are already combined to return as many of
them as possible.

A small improvement we can work on is to use accept4() instead of accept()
to save one setsockopt().

 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a2e2725541fad72416326798c2d7fa4dafb7d337
 
 The second is RPS from google to improve network processing performance 
 with multiple CPUs - similar to MSI-X but google found that both together 
 had even more performance than just MSI-X:
 
 http://kernelnewbies.org/Linux_2_6_35#head-94daf753b96280181e79a71ca4bb7f7a423e302a
 
 http://lwn.net/Articles/362339/

Yes I've followed that. There's is nothing to do to make use of that,
you just need to upgrade your kernel :-)

Cheers,
Willy




Feature Requests: march native and cwnd setting param

2010-11-27 Thread Hank A. Paulson
1 - With recent CPUs Intel 5300/5400/5500/5600 and AMD 6100 the set of optimal 
compiler settings for optimizations :) is not something anyone can keep up 
with - not to mention different versions of gcc that understand none, some or 
all of the features of these CPUs. march native allows gcc to take on the 
burden of optimizing the compile time settings, so if that could be added as 
one of the options in the makefile, it would be helpful because then I could 
use the same make... line on every machine but it would self-adjust for that 
machine. Obviously, this is not a setting that distros would use to spin 
package binaries, but for great for getting the optimal settings for a given 
machine. Examples:


model name  : Intel(R) Xeon(R) CPU   E5520  @ 2.27GHz

# cc -march=native -E -v -  /dev/null 21 | fgrep cc1

/usr/libexec/gcc/x86_64-redhat-linux/4.4.5/cc1 -E -quiet -v - -march=core2 
-mcx16 -msahf -mpopcnt -msse4.2 --param l1-cache-size=32 --param 
l1-cache-line-size=64 --param l2-cache-size=8192 -mtune=core2


model name  : AMD Opteron(tm) Processor 6172

[r...@hesj3-m41 cron.d]# cc -march=native -E -v -  /dev/null 21 | fgrep cc1

/usr/libexec/gcc/x86_64-redhat-linux/4.5.1/cc1 -E -quiet -v - -march=amdfam10 
-mcx16 -msahf -mpopcnt -mabm --param l1-cache-size=64 --param 
l1-cache-line-size=64 --param l2-cache-size=512 -mtune=amdfam10



2 - Google has pushed via both tcp related RFCs and patches to the networking 
code for the linux kernel to allow the initial cwnd to be set as a socket 
option - this would be a huge help to sites that communicate with the same 
clients over and over and/or with many small requests allowing a full response 
in one (or at least fewer) round trips. For one site that I work on that is 
over 250 ms away with a very reliable gateway on the other end, I burn through 
several round trips to deliver an icon/small gif/etc - an icon that could have 
all the necessary packets in flight before the first ack. It turns out the 
small initial cwnd creates more traffic across the under sea cables than an 
initial cwnd of 8 or 10 or 12.


http://www.amailbox.org/mailarchive/linux-netdev/2010/5/26/6278007

I also wanted to see if you were aware of two other recent kernel changes that 
could be helpful to haproxy performance, the first could be helpful for the 
new UNIX socket connections in recent haproxy versions:


Implementation of recvmmsg:
recvmmsg() is a new syscall that allows to receive with a single syscall 
multiple messages that would require multiple calls to recvmsg(). For 
high-bandwith, small packet applications, throughput and latency are improved 
greatly.


http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a2e2725541fad72416326798c2d7fa4dafb7d337

The second is RPS from google to improve network processing performance with 
multiple CPUs - similar to MSI-X but google found that both together had even 
more performance than just MSI-X:


http://kernelnewbies.org/Linux_2_6_35#head-94daf753b96280181e79a71ca4bb7f7a423e302a

http://lwn.net/Articles/362339/