Re: Nbproc question

2009-09-30 Thread Mariusz Gronczewski
2009/9/29 Willy Tarreau w...@1wt.eu:
 On Tue, Sep 29, 2009 at 10:41:28AM -0700, David Birdsong wrote:
 (...)
  Which translates into that for one CPU :
   10% user
   40% system
   50% soft-irq
 
  This means that 90% of the time is spent in the kernel (network 
  stack+drivers).
  Do you have a high bit rate (multi-gigabit) ? Are you sure you aren't 
  running
  with any ip_conntrack/nf_conntrack module loaded ? Can you show the output 
  of

 do you recommend against these modules?  we have a stock fedora 10
 kernel that have nf_conntrack compiled in statically.

 By default I recommend against it because it's never tuned for server usage,
 and if people don't know if they are using it, then they might be using it
 with inadequate desktop tuning.

 i've increased:
 /proc/sys/net/netfilter/nf_conntrack_max but is it correct to expect
 connection tracking to add kernel networking cpu overhead due to
 netfilter?  i've speculated that it might, but fruitless searches for
 discussions that would suggest so have restrained me from bothering to
 re-compile a custom kernel for our haproxy machines.

 Yes, from my experience, using conntrack on a machine (with large enough
 hash buckets) still results in 1/3 of the CPU being usable for haproxy+system
 and 2/3 being consumed by conntrack. You must understand that when running
 conntrack on a proxy, it has to setup and tear down two connections per
 proxy connection, explaining why it ends up with that amount of CPU used.

 Often if you absolutely need conntrack to NAT packets, the solution consist
 in setting it on one front machine and having the proxies on a second level
 machine (run both in series). It will *triple* the performance because the
 number of conntrack entries will be halved and it will have more CPU to run.
You could also try to do something like this

# iptables -I PREROUTING -p tcp --dport 80 -j NOTRACK
# iptables -I OUTPUT -p tcp --dport 80 -j NOTRACK

it should disable conn. tracking for packets to/from haproxy

Regarda
Mariusz



Re: Nbproc question

2009-09-30 Thread Mariusz Gronczewski
2009/9/30 Mariusz Gronczewski xani...@gmail.com:
 2009/9/29 Willy Tarreau w...@1wt.eu:
 On Tue, Sep 29, 2009 at 10:41:28AM -0700, David Birdsong wrote:
 (...)
  Which translates into that for one CPU :
   10% user
   40% system
   50% soft-irq
 
  This means that 90% of the time is spent in the kernel (network 
  stack+drivers).
  Do you have a high bit rate (multi-gigabit) ? Are you sure you aren't 
  running
  with any ip_conntrack/nf_conntrack module loaded ? Can you show the 
  output of

 do you recommend against these modules?  we have a stock fedora 10
 kernel that have nf_conntrack compiled in statically.

 By default I recommend against it because it's never tuned for server usage,
 and if people don't know if they are using it, then they might be using it
 with inadequate desktop tuning.

 i've increased:
 /proc/sys/net/netfilter/nf_conntrack_max but is it correct to expect
 connection tracking to add kernel networking cpu overhead due to
 netfilter?  i've speculated that it might, but fruitless searches for
 discussions that would suggest so have restrained me from bothering to
 re-compile a custom kernel for our haproxy machines.

 Yes, from my experience, using conntrack on a machine (with large enough
 hash buckets) still results in 1/3 of the CPU being usable for haproxy+system
 and 2/3 being consumed by conntrack. You must understand that when running
 conntrack on a proxy, it has to setup and tear down two connections per
 proxy connection, explaining why it ends up with that amount of CPU used.

 Often if you absolutely need conntrack to NAT packets, the solution consist
 in setting it on one front machine and having the proxies on a second level
 machine (run both in series). It will *triple* the performance because the
 number of conntrack entries will be halved and it will have more CPU to run.
 You could also try to do something like this

 # iptables -I PREROUTING -p tcp --dport 80 -j NOTRACK
 # iptables -I OUTPUT -p tcp --dport 80 -j NOTRACK
# iptables -t raw -I PREROUTING -p tcp  --dport 80 -j NOTRACK
# iptables -t raw -I OUTPUT -p tcp --dport 80 -j NOTRACK
sorry for mistake



RE: Nbproc question

2009-09-29 Thread Jonah Horowitz
Here's the output of top on the system:

top - 09:50:36 up 4 days, 18:50,  1 user,  load average: 1.31, 1.59, 1.55
Tasks: 117 total,   2 running, 115 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.5%us,  9.9%sy,  0.0%ni, 75.0%id,  0.0%wa,  0.5%hi, 12.1%si,  0.0%st
Mem:   8179536k total,   997748k used,  7181788k free,   139236k buffers
Swap:  9976356k total,0k used,  9976356k free,   460396k cached

PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND  
  
 752741 daemon20   0 34760  24m  860 R  100  0.3 871:15.76 haproxy  
  

It's a quad core system, but haproxy is taking 100% of one core.

We're doing less than 5k req/sec and the box has two 2.6ghz Opterons in it.

Do you know how much health checks affect cpu utilization of an haproxy process?

We have about 100 backend servers and we're running inter 500 rise 2 fall 1

I haven't tried adjusting that, although when it was set to the default our 
error rates were much higher.

Thanks,

Jonah


-Original Message-
From: Willy Tarreau [mailto:w...@1wt.eu] 
Sent: Monday, September 28, 2009 9:50 PM
To: Jonah Horowitz
Cc: haproxy@formilux.org
Subject: Re: Nbproc question

On Mon, Sep 28, 2009 at 06:43:58PM -0700, Jonah Horowitz wrote:
 In the documentation it seems to discourage using the nbproc directive.
 What¹s the situation with this?  I¹m running a server with 8 cores, so I¹m
 tempted to up the nbproc.  Is the process normally multithreaded?

no the process is not multithreaded.

 Is nbproc
 something I can use for performance tuning, or is it just for file handles?

It can bring you small performance gains at the expense of a more
complex monitoring, since the stats will still only reflect the
process which receives the stats request. Also, health-checks will
be performed by each process, causing an increased load on your
servers. And the connection limitation will not work anymore, as
any process won't know that there are other processes already
using a server.

It was initially designed to workaround per-process file handle
limitations on some systems, but it is true that it brings a minor
performance advantage.

However, considering that you can reach 4 connections per second
with a single process on a cheap core2duo 2.66 GHz, and that forwarding
data at 10 Gbps on this machine consumes only 20% of a core, you can
certainly understand why I don't see the situations where it would
make sense to use nbproc.

Regards,
Willy




Re: Nbproc question

2009-09-29 Thread David Birdsong
On Tue, Sep 29, 2009 at 10:30 AM, Willy Tarreau w...@1wt.eu wrote:
 On Tue, Sep 29, 2009 at 09:56:51AM -0700, Jonah Horowitz wrote:
 Here's the output of top on the system:

 top - 09:50:36 up 4 days, 18:50,  1 user,  load average: 1.31, 1.59, 1.55
 Tasks: 117 total,   2 running, 115 sleeping,   0 stopped,   0 zombie
 Cpu(s):  2.5%us,  9.9%sy,  0.0%ni, 75.0%id,  0.0%wa,  0.5%hi, 12.1%si,  
 0.0%st
 Mem:   8179536k total,   997748k used,  7181788k free,   139236k buffers
 Swap:  9976356k total,        0k used,  9976356k free,   460396k cached

     PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  752741 daemon    20   0 34760  24m  860 R  100  0.3 871:15.76 haproxy

 It's a quad core system, but haproxy is taking 100% of one core.

 We're doing less than 5k req/sec and the box has two 2.6ghz Opterons in it.

 huh! surely there's something unusually wrong here.

 Do you know how much health checks affect cpu utilization of an haproxy 
 process?

 We have about 100 backend servers and we're running inter 500 rise 2 fall 1

 It means only 200 checks per second, that's not much at all. I've ran tests up
 to 4 checks per second, so you should not even notice it.

 I haven't tried adjusting that, although when it was set to the default our 
 error rates were much higher.

 Could you send me your conf in private ?

 Also, what's your data rate ? I'm seeing the following CPU usage :
  Cpu(s):  2.5%us,  9.9%sy,  0.0%ni, 75.0%id,  0.0%wa,  0.5%hi, 12.1%si,  
 0.0%st

 Which translates into that for one CPU :
  10% user
  40% system
  50% soft-irq

 This means that 90% of the time is spent in the kernel (network 
 stack+drivers).
 Do you have a high bit rate (multi-gigabit) ? Are you sure you aren't running
 with any ip_conntrack/nf_conntrack module loaded ? Can you show the output of
do you recommend against these modules?  we have a stock fedora 10
kernel that have nf_conntrack compiled in statically.  i've increased:
/proc/sys/net/netfilter/nf_conntrack_max but is it correct to expect
connection tracking to add kernel networking cpu overhead due to
netfilter?  i've speculated that it might, but fruitless searches for
discussions that would suggest so have restrained me from bothering to
re-compile a custom kernel for our haproxy machines.

 haproxy -vv ? I'd like to see if epoll support is correctly enabled. Also,
 please send the output of uname -a. Ah, please also check the clock sources 
 :

 # cat /sys/devices/system/clocksource/clocksource0/current_clocksource
 # cat /sys/devices/system/clocksource/clocksource0/available_clocksource

 Many dual-core opterons had no synchronization for their internal timestamp
 counters, so those were often disabled and replaced with slow external clock
 sources, resulting in poor network performance.

 Something nice on an opteron however is that you can nearly double the
 performance by binding the network interrupts to one core and the process
 on the other one of the same socket.
this is intriguing.  can this be done with other multi-core cpus?  do
you have any documentation that i could read to learn more about this?


 Regards,
 Willy






Re: Nbproc question

2009-09-29 Thread Willy Tarreau
On Tue, Sep 29, 2009 at 10:41:28AM -0700, David Birdsong wrote:
(...)
  Which translates into that for one CPU :
   10% user
   40% system
   50% soft-irq
 
  This means that 90% of the time is spent in the kernel (network 
  stack+drivers).
  Do you have a high bit rate (multi-gigabit) ? Are you sure you aren't 
  running
  with any ip_conntrack/nf_conntrack module loaded ? Can you show the output 
  of

 do you recommend against these modules?  we have a stock fedora 10
 kernel that have nf_conntrack compiled in statically.

By default I recommend against it because it's never tuned for server usage,
and if people don't know if they are using it, then they might be using it
with inadequate desktop tuning.

 i've increased:
 /proc/sys/net/netfilter/nf_conntrack_max but is it correct to expect
 connection tracking to add kernel networking cpu overhead due to
 netfilter?  i've speculated that it might, but fruitless searches for
 discussions that would suggest so have restrained me from bothering to
 re-compile a custom kernel for our haproxy machines.

Yes, from my experience, using conntrack on a machine (with large enough
hash buckets) still results in 1/3 of the CPU being usable for haproxy+system
and 2/3 being consumed by conntrack. You must understand that when running
conntrack on a proxy, it has to setup and tear down two connections per
proxy connection, explaining why it ends up with that amount of CPU used.

Often if you absolutely need conntrack to NAT packets, the solution consist
in setting it on one front machine and having the proxies on a second level
machine (run both in series). It will *triple* the performance because the
number of conntrack entries will be halved and it will have more CPU to run.

  Something nice on an opteron however is that you can nearly double the
  performance by binding the network interrupts to one core and the process
  on the other one of the same socket.
 this is intriguing.  can this be done with other multi-core cpus?

yes it can but it's only possible/efficient when the L2 cache is shared,
which is the case on opterons. With an L3 cache, it will not be as efficient
but will still be. But when your caches are completely independant, having
packets being parsed by one core and pass to the other core through slow
memory is horribly inefficient as the data pass twice on the memory bus for
nothing !

 do you have any documentation that i could read to learn more about this?

Not that much. I remember there are some useful tuning tricks on the Myricom
site and/or in some of their drivers' READMEs. That's where I discovered the
DCA mechanism that I was not aware of.

Willy