Hi Robert,

On Tue, May 19, 2015 at 04:10:54PM -0700, Robert Brooks wrote:
> On Mon, May 18, 2015 at 7:58 PM, Willy Tarreau <[email protected]> wrote:
> 
> > It's useless at such sizes. A rule of thumb is that splicing will not be
> > used at all for anything that completely fits in a buffer since haproxy
> > tries to read a whole response at once and needs to parse the HTTP headers
> > anyway. In general, the cost of the splice() system calls compared to
> > copying data sees a break-even around 4-16kB depending on arch, memory
> > speed, setup and many factors. I find splice very useful on moderately
> > large objects (>64kB) at high bit rates (10-60 Gbps). On most gigabit
> > NICs, it's inefficient as most such NICs have limited capabilities which
> > make them less efficient at performing multiple DMA accesses for a single
> > packet (eg: no hardware scatter-gather to enable GSO/TSO for example).
> >
> 
> I had considered doing tcp proxying at the top level haproxy such that this
> workload is passed down to individual hosts and then maybe splice would be
> effective, but I am not sure it is enough of a win. Managing http as such
> seems to be more correct and gettting http stats is useful too.

Yes and haproxy is faster in HTTP than in TCP. While it can sound surprizing,
it's because there are states in HTTP which allow haproxy to know what events
to care about (eg: poll for reads only at certain moments, etc) and when small
packets may be aggregated.

Regardless of this, splicing is really useless for small packets.

> We are running Intel I350 Gigabit NICs which appear support most tcp
> offloading except LRO.

OK, I have one such as well. They're quite good indeed, but don't expect to
gain anything by using splicing. Running at gig speed is very cheap even
with recv/send. With 1500-byte packets, that's only 80000 packets per
second. The atom in my NFS server can saturate its link with about half
a core. When it comes to doing it with small HTTP responses, it will clearly
not saturate the gig link however.

> We previously had Broadcom interfaces, which again
> had good TCP offload, but appear to have interrupt coalescing issues. Are
> you still using Myricom cards?

I still have them in my home lab and use them from time to time, yes.
These cards are very interesting to test your software because they
combine very low latency with very little CPU usage in the driver. So
you can reach 10Gbps of forwarding rate with only 25% of one core on
an old core-2 duo, and you're never concerned with ksoftirqd triggering
during your tests. However I found that I was facing some packet rate
limitations with them, meaning it's not possible to reach 10G in TCP
with packets smaller than 1500 bytes and their respective ACKs, which
is about 800kpps in each direction, so in our appliances we have switched
to intel 82599 NICs whose driver is much heavier, but which can saturate
10G at any packet size.

> Any thoughts if these would be a win with our workload? Our data rates are
> relatively small, it's all about request rates.

I know at least one site who managed to significantly increase their
request rate by switching from gig NICs to myricom for small requests.
If you look there :

    http://www.haproxy.org/10g.html

You'll see in the old benchmark (2009) that at 2kB objects we were at
about 33k req/s on a single process on the core-2 using haproxy 1.3.

On recent hardware, intel NICs can go higher than this because you
easily have more CPU power to dedicate to the driver. Reaching 60-100k
is not uncommon with fine tuning on such small request sizes.

> > Nothing apparently wrong here, though the load could look a bit high
> > for that traffic (a single-core Atom N270 can do that at 100% CPU).
> > You should run a benchmark to find your system's limits. The CPU load
> > in general is not linear since most operations will be aggregated when
> > the connection rate increases. Still you're running at approx 50% CPU
> > and a lot of softirq, so I suspect that conntrack is enabled on the
> > system with default settings and not tuned for performance, which may
> > explain why the numbers look a bit high. But nothing to worry about.
> >
> >
> Conntrack is not used or loaded, however, what I believe is contributing to
> higher than expected system load is non keep-alive traffic. We also serve a
> lot of http 302s, about 1000 reqs/sec, keep-alive is disabled as these are
> single requests from individual clients.

1000 http 302/s is almost idle. You should barely notice haproxy in "top"
at such a rate. Here's what I'm seeing on my core2quad @3 GHz at exactly
1000 connections per second :

Tasks: 178 total,   1 running, 175 sleeping,   2 stopped,   0 zombie
Cpu(s):  0.2%us,  1.0%sy,  0.2%ni, 98.0%id,  0.0%wa,  0.0%hi,  0.5%si,  0.0%st
Mem:   8168416k total,  7927908k used,   240508k free,   479316k buffers
Swap:  8393956k total,    18148k used,  8375808k free,  4886752k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND           
 9253 willy     20   0  4236  512  412 S    2  0.0   0:00.22 injectl4           
 9240 willy     20   0  3096 1096  812 S    1  0.0   0:01.83 haproxy            
    1 root      20   0   828   44   20 S    0  0.0   0:50.36 init               
    2 root      20   0     0    0    0 S    0  0.0   0:00.02 kthreadd           
    3 root      20   0     0    0    0 S    0  0.0   0:23.72 ksoftirqd/0        
 
=> 1% of one core for haproxy, 2% for injectl4.

> For these I would love to have backend connection pooling, I've seen it
> discussed, but based on my reading of the conversations it may be
> problematic to implement? Is this still a likely feature?

Yes it will have to be implemented for HTTP/2 otherwise we'll lose
server-side keep-alive that took so long to have! It will be useful as
well when haproxy is installed in front of a fast cache like varnish,
because for small objects, most of the CPU power is lost in the
connection setup.

I think you should definitely run a benchmark of your setup, your
CPU usage numbers still look quite high to me for the load and I'm
still suspecting something might be wrong on the system and/or user.
Even the 20% user for 7k req should be better, as that's what you
could have for 50-100k conn/s. Maybe you have a complex config though,
I don't know (eg: lots of rewrite rules or so).

A benchmark would tell you how far you are from the limits you could
reach.

Regards,
Willy


Reply via email to