Re: nginx alone performs x2 than haproxy->nginx

Willy Tarreau Sun, 29 Apr 2012 12:01:48 -0700

On Sun, Apr 29, 2012 at 09:05:26PM +0300, Bar Ziony wrote:
> > > I'm using VPS machines from Linode.com, they are quite powerful. They're
> > > based on Xen. I don't see the network card saturated.
> >
> > OK I see now. There's no point searching anywhere else. Once again you're
> > a victim of the high overhead of virtualization that vendors like to
> > pretend
> > is almost unnoticeable :-(
> 
> The overhead is really that huge?


Generally, yes. A packet entering the machine from a NIC has to be copied
by the NIC to memory, then the NIC sends an interrupt, which interrupts the
VM in its work to switch to the HV kernel, the driver reads the data from
memory, does a lot of checking and decoding, tries to determine from its
MAC (or worse IP+ports in case of NAT) what VM it's aimed at. Then it puts
it in shape for delivery via the appropriate means for the VM (depending
on the type of driver emulation, possibly involving splitting it into
smaller chunks). Then when the packet has been well decorated, the HV
simulates an interrupt to the VM and switch back to it. The VM now enters
the driver in order to read the packet. And the fun begins. Depending on
the drivers, the number of exchanges between the driver and the supposedly
real NIC will cause a ping-pong between the driver in the VM and the NIC
emulator in the HV. For instance, reading a status flag on the NIC might
cause a double context switch. If the driver regularly reads this status
flag, the context might switch a lot. This is extremely expensive for each
packet. You'll note that I'm not even counting the overhead of multiple
memory copies for the same incoming packet.

That's why HV vendors try to propose more direct drivers (vmxnet, xen-vnif,
hv_netvsc). You'll also note that performance using these drivers can be
up to 30-40 times better than the NIC emulation by avoiding many ping-pong
games between the two sides. Reaching such gains by saving such exchanges
can get you an idea of the extreme overhead which remains even for a single
bounce.

You'll often see benchmarks of outgoing traffic leaving VMs for file serving,
showing somewhat correct performance, and you'll probably never see benchs
of incoming nor mixed traffic (eg: proxies). On the network, a send is always
much cheaper than a receive because memory copies and buffer parsing can be
avoided, leaving just a virtual context switch as the remaining overhead.

But on the receive path, clearly everything counts, including the real NIC
hardware and interrupt affinity, which you have no control over.

> > It's having the module loaded with default settings which is harmful, so
> > even unloading the rules will not change anything. Anyway, now I'm pretty
> > sure that the overhead caused by the default conntrack settings is nothing
> > compared with the overhead of Xen.
> >
> 
> Why is it harmful that it loaded with default setteings? Could it be
> disabled?

The default hashtable size is very small and depends on the available
memory. With 1+GB, you have 16k buckets, and below you have #mem/16384.
When you stack a few hundreds of thousands of connections there (which
happens very quickly at 6k conn/s with the default time_wait setting of
120s), you end up with long lookups for each incoming packet. You can
change the setting when loading the module using the hashsize parameter,
or in your case, by setting it here :

    /sys/module/nf_conntrack/parameters/hashsize

It's wise to have a hashsize not less than 1/4 of the max amount of
entries you're expecting to get in the table. Multiply your conn rate
by 120s (time_wait) to get the number of entries needed. Don't forget
that on a proxy, you have twice the number of connections. At 6k conn/s,
you'd then have to support 6*2*120 = 1.44M conns in the table, and
support a hash size of at least 360k entries (better use a power of two
since there's a divide in the kernel that many processors are able to
optimize). So let's use 262144 entries for hashsize. Also, you should
reduce the time_wait timeout to around 30-60s, and increase the
conntrack_max setting which defaults to 64k (!) resulting in a total
blocking of new connections after 5s of a run at 6k/s on both sides.

What I could suggest would be :
   - reduce /proc/sys/net/netfilter/nf_conntrack_tcp_timeout_time_wait to 30s
   - increase /proc/sys/net/netfilter/nf_conntrack_max to 524288 conns.
   - increase hashsize to 131072 buckets.

This will help you support up to 8700 conn/s without trouble. You just
need to scale the latter two settings accordingly if you plan to go higher.

> > That's quite simple : it has two sides so it must process twice the number
> > of packets. Since you're virtualized, you're packet-bound. Most of the time
> > is spent communicating with the host and with the network, so the more the
> > packets and the less performance you get. That's why you're seeing a 2x
> > increase even with nginx when enabling keep-alive.
> >
> 
> 1. Can you explain what does it mean that I'm packet-bound, and why is it
> happening since I'm using virtualization?

That's what I explained above :-)

> 2. When you say twice the number of packets, you mean: Client sends request
> (as 1 or more packets) to haproxy which intercepts it, acts upon it and
> sends a new request (1 or more packets) to the server, which then again
> sends the response, that's why it's twice the number of packets? It's not
> twice the bandwidth of using the web-server directly right?

To be exact, it's twice the number of requests and responses, as whatever
goes in will have to go out, so that results in effectively doubling the
number of packets. The total bandwidth is not really doubled because in
web environments, responses are generally larger than requests, but at
least you'll see the sum of your web servers in+out in each direction on
the proxy.

> > You should be able to slightly increase performance by adding the following
> > options in your defaults section :
> >
> >   option tcp-smart-accept
> >   option tcp-smart-connect
> >
> 
> Thanks! I think it did help and now I get 3700 req/sec without -k , and
> almost 5000 req/sec with -k.

So this is the proof that it's the packet processing which is high.

> I do have a small issue (it was there before I added these options): when
> doing 'ab -n 10000 -c 60 http://lb-01/test.html', 'ab' gets stuck for a
> second or two at the end, causing the req/sec to drop to around 2000
> req/sec. If I Ctrl+c before the end, I see the numbers above. Is this
> happening because of 'ab' or because of something with my setup? With -k it
> doesn't happen. And I also think it doesn't always happen with the second,
> passive LB (when I tested it).

You're not using enough requests. Ab collects many statistics and if some
requests experience hickups, network delays or anything, it lowers its
average rate. I like to make it run at least 10s to get approximately
accurate numbers. So in your case this would mean something like "-n 60000".

> > Each of them will save one packet during the TCP handshake, which may
> > slightly compensate for the losses caused by virtualization. Note that
> > I have also encountered a situation once where conntrack was loaded
> > on the hypervisor and not tuned at all, resulting in extremely low
> > performance. The effect is that the performance continuously drops as
> > you add requests, until your source ports roll over and the performance
> > remains stable. In your case, you run with only 10k reqs, which is not
> > enough to measure the performance under such conditions. You should have
> > one injecter running a constant load (eg: 1M requests in loops) and
> > another one running the 10k reqs several times in a row to observe if
> > the results are stable or not.
> >
> 
> What do you mean by "until your source ports rollover" ? I'm sorry, but I
> didn't quite understand the meaning of your proposed check?

You have 64k possible source ports in TCP. The system allocates them in
turns when you establish outgoing connections. Once you reach the last
one, it starts reusing the oldest released one. For conntrack, this means
that you're refreshing an older connection which was still in the table,
so this does not add to the conntrack count once you've reached the limit.
For instance, let's suppose your systems are not tuned and run with the
default 32768..61000 range. Once you've used these 28232 ports, you have
the same number of entries in TIME_WAIT state in your conntrack table.
But as soon as you start using again a port for which a session still
exists in the table, this session is refreshed so you never have more
than 28232 sessions there. In production this is very different because
your visitors don't click at these rates and don't run out of ports,
however they're many. So their ports times their IPs sum up as many more
sessions. These high session counts emphasize the effect of the small
default hash table of netfilter, which you don't observe a lot during
your tests. That said, it is possible that you'd already notice a minor
difference if you reduce ip_local_port_range to 10000 ports or increase
it to 64k ports.

> > Last point about virtualized environments, they're really fine if
> > you're seeking costs before performance. However, if you're building
> > a high traffic site (>6k req/s might qualify as a high traffic site),
> > you'd be better with a real hardware. You would not want to fail such
> > a site just for saving a few dollars a month. To give you an idea,
> > even with a 15EUR/month dedibox consisting on a single-core Via Nano
> > processor and which runs nf_conntrack, I can achieve 14300 req/s.
> >
> 
> It's very unlikely that we'll move to dedicated boxes. It's not the money
> (we have some to spare :)), but the maintainability and scalability of the
> setup.
> Everything is scalable in our setup, besides the LB, which just have a
> passive failover machine with keepalived. We're also already deeply
> "invested" in this setup and it will be a big pain to migrate. We don't
> have the manpower for that.

In my opinion, virtualization is not all black or all white. It's good
for some things (eg: easily add new application servers when needed),
and bad for other ones (high I/O rates). Also, when you get into trouble,
it's a nightmare to debug. Believe me, I know a big site I won't publicly
disclose who experienced a major failure during one week exactly for this
reason.  Everything appeared OK, CPU usage, memory etc... But when you see
that under load, a ping can take 30s between two VMs, you know that
something is going bad. Finally they managed to migrate in emergency the
last day of the even they were covering. All they knew was that it was not
their fault and that something was wrong at the hosting company, without
any way to prove anything since they don't have access to the underlying
hardware.

> I'm now afraid that it won't fit us in the long run, since we're now
> peaking at 700 req/sec and averaging 350-400 req/sec.

Then you still have some margin.

> It's still a x4.5
> difference than our LB's max, but it's spooky. Do you think that having a
> bigger VM (more RAM, more CPU) will help?

It can but there is no guarantee at all. The worst thing that can happen
for low latency components such as proxies, firewalls, routers etc... is
to be installed on a machine where other VMs are running. And guess how
your provider is covering hardware expenses ? By installing tens of VMs
on a single hardware. It's not uncommon to observe no difference at all
between small and large VMs, however it's common to observe very different
performance between morning and afternoon for instance.

> Can you think of other options/tuning stuff, even not only to haproxy but
> to the kernel , that could help out in this case?

>From your side, once you've tuned the kernel and haproxy, there's not
much more tunables. You need to have an up to date network driver, but
that's all you can play with. Installing a cache on this machine might
help since it could significantly reduce the number of packets going
to web servers.

> How is it possible that such a low-performance dedicated box can handle
> almost 5 times the req/sec rate as a powerful VM?

It's not only possible, it's common. From all the tests I have ran to date,
the common performance ratio between native and virtualized is between 4
and 6. If you look at the numbers published on the Exceliance blog, you'll
note that Xen does only 14k while native was at 70k. You have exactly 5.

In my opinion, you should stay hosted there as long as it's possible if
you have invested a lot on this, but start thinking about alternative
hosting solutions for critical components such as a load balancer which
is supposed to bring you scalability. Castrating this component is not
the smartest idea for long term growth I think. And if you decide to
install a cache, be sure to do the same for it (and if possible, to
install it on the same machine as the LB).

Regards,
Willy

Re: nginx alone performs x2 than haproxy->nginx

Reply via email to