[if you want to look at the numbers, skip a few paragraphs]
So with the pktgenif tool I started looking at how long a
single-threaded UDP sendto() takes, and how to bring that figure down.
I'm documenting the steps in this mail so that everyone can repeat them
and/or suggest further improvements.
Note that this is only step 1: single-core send, which is mostly about
getting the base cost of processing a packet low while ignoring
scheduling. Part 2 will be receiving, which, unlike send, includes at
least two threads: one pulling packets off of the "NIC" and stuffing
them into the right place, and one actually reading them in the
application capacity. Part 3 is most likely something about
multithreaded performance, though I still maintain that it's easier and
more performance to structure things so that you just run multiple
disjoint networking stacks.
One thing to note is that since pktgenif is a "dummy" interface, there
is no actual I/O happening. That can be visited later, though I don't
think it will make much of a difference, since e.g. netmap is supposed
to be able to transmit a packet in ~60 cycles, and I'm sure other high
performance userspace packet shovellers that rump kernels already
integrate with, such as DPDK, sport similar figures. It takes _a lot_
more CPU cycles to prepare the packets than to actually send them, as we
will see shortly. Logically, this makes sense too, as most of the
shovelling work is done off the CPU, while the preparation is done on
the CPU (at least when running the rump kernel on the CPU, like I am
doing here).
The measurements are done with my laptop which has a 2.53GHz Core2Duo
T9400. Yes, that CPU was released >5years ago. Newer CPUs should have
better performance (or at least I'd hope so). The measurements were
done by running the tool in the rump-pktgenif repository as "./testtool
-r config.sh -c 1000000 send", where config.sh is config.sh.example with
the directory appropriately edited. This results in a million 22byte
UDP payloads being sent as 64byte ethernet frames (22+8+20+14 = 64).
The variable RUMP_NCPU was set to "1" (since we don't need more in this
case).
I present both the sends/second numbers output by testtool, as well as
the cycle count as obtained from the "rdtsc" instruction executed on
both sides of sendto. The math between them works out (more or less
...), so use whichever figure you perceive to be nicer. Counting cycles
affects the "send/s parameter" a bit, since the former is done in every
loop while the latter is an aggregate timestamp calculation, but they're
close enough for this rough purpose.
The following numbers are cumulative, so improvement n includes
everything between improvement [1,n-1] (unless otherwise stated, usually
because the optimization ended up being slower).
0: baseline
This is just a normal build with the current "./buildrump.sh".
cycles: 4753
send/s: 517827
1: compiler optimizations
buildrump.sh with BUILDRUMP_DBG='-O2 -march=native -mtune=native -g'.
cycles: 4994
send/s: 496492
Yes, "better" compiler optimizations make it 5% slower on my system with
gcc 4.8.1. We'll not use this "optimization" in the next step. I did
try this occasionally with the later steps, but it was always slower.
Before everyone goes screaming "cache footprint" as the reason, I'd like
to note that testing with -Os resulted in way less performance.
2: static linking (!PIC)
dynamically linked code is slower than statically linked (due to the
extra calculations position independence requires), so we use static
rump kernel components. The easiest way to accomplish this is to run
buildrump.sh with -V MKPIC=no which prevents PIC libs from being created
(some day I'll fix "rumpmake LDSTATIC=-static" to work ...)
cycles: 3209
send/s: 763705
Usually the quoted performance difference between PIC and !PIC is ~20%,
but here it's much greater, not completely sure why. One explanation is
the next optimization:
3: curlwp
The simple profiling method of running "perf top" while the test is
running shows a good deal of calls is spent processing "curlwp" (which
resolves currently running rump kernel thread). In fact, curlwp is
called ~40 times per sendto. Since the current default implementation
does three function calls per curlwp (x86_curlwp -> rumpuser_curlwp ->
tls), it's 120 unnecessary function calls per a few thousand cycles.
Well, you do the math on how cheap you'd expect that to be. So per
Justin's suggestion, let's use curlwp that conditionally inlines curlwp
within a rump kernel using __thread.
cycles: 2737
send/s: 898800
I have this hacked up in a private tree. I plan to eventually commit a
version where you can conditionally enable the optimization. It won't
be on by default because 1) it's technically not legal, though it works
just fine in userspace 2) it breaks native NetBSD x86 ABI compatibility,
though that's almost never an issue.
4: locks_up
An advantage of using a rump kernel is that you can run multiple
networking stacks, one per core. If you allow only one thread inside a
rump kernel (RUMP_NCPU=1), atomic memory accesses and barriers for lock
fastpaths can be optimized away. This is done by compiling librump with
locks_up.c instead of locks.c (see sys/rump/librump/rumpkern/Makefile.rump)
cycles: 2517
send/s: 972478
5: ip checksum
Looking at "perf top" again, IP checksumming is starting to take a long
time. Let's make if_virt say that it can do checksum offloading (which
in the case of it being backed by e.g. netmap is a more than reasonable
assumption). We need to add the following to the config script to
enable the capabilities:
./rumpremote ifconfig pg0 ip4csum upd4csum
cycles: 2357
send/s: 1041834
6: clang
And for final fun, let's try what kind of the results we get with clang:
CC=clang ./buildrump.sh ...
cycles: 2399
send/s: 1020794
clang 3.2 is almost the same as gcc, just a little slower.
That was it for the simple optimizations. There's probably room to
squeeze more fat out, but those will have to wait for a later day. In
summary, seems like static linking and changing curlwp are the big
winners. Compiler "optimization" is the big loser. With large sendto()
payloads even my dated laptop CPU can saturate 10GigE.
- antti
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
rumpkernel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rumpkernel-users