[if you want to look at the numbers, skip a few paragraphs]

So with the pktgenif tool I started looking at how long a 
single-threaded UDP sendto() takes, and how to bring that figure down. 
I'm documenting the steps in this mail so that everyone can repeat them 
and/or suggest further improvements.

Note that this is only step 1: single-core send, which is mostly about 
getting the base cost of processing a packet low while ignoring 
scheduling.  Part 2 will be receiving, which, unlike send, includes at 
least two threads: one pulling packets off of the "NIC" and stuffing 
them into the right place, and one actually reading them in the 
application capacity.  Part 3 is most likely something about 
multithreaded performance, though I still maintain that it's easier and 
more performance to structure things so that you just run multiple 
disjoint networking stacks.

One thing to note is that since pktgenif is a "dummy" interface, there 
is no actual I/O happening.  That can be visited later, though I don't 
think it will make much of a difference, since e.g. netmap is supposed 
to be able to transmit a packet in ~60 cycles, and I'm sure other high 
performance userspace packet shovellers that rump kernels already 
integrate with, such as DPDK, sport similar figures.  It takes _a lot_ 
more CPU cycles to prepare the packets than to actually send them, as we 
will see shortly.  Logically, this makes sense too, as most of the 
shovelling work is done off the CPU, while the preparation is done on 
the CPU (at least when running the rump kernel on the CPU, like I am 
doing here).

The measurements are done with my laptop which has a 2.53GHz Core2Duo 
T9400.  Yes, that CPU was released >5years ago.  Newer CPUs should have 
better performance (or at least I'd hope so).  The measurements were 
done by running the tool in the rump-pktgenif repository as "./testtool 
-r config.sh -c 1000000 send", where config.sh is config.sh.example with 
the directory appropriately edited.  This results in a million 22byte 
UDP payloads being sent as 64byte ethernet frames (22+8+20+14 = 64). 
The variable RUMP_NCPU was set to "1" (since we don't need more in this 
case).

I present both the sends/second numbers output by testtool, as well as 
the cycle count as obtained from the "rdtsc" instruction executed on 
both sides of sendto.  The math between them works out (more or less 
...), so use whichever figure you perceive to be nicer.  Counting cycles 
affects the "send/s parameter" a bit, since the former is done in every 
loop while the latter is an aggregate timestamp calculation, but they're 
close enough for this rough purpose.

The following numbers are cumulative, so improvement n includes 
everything between improvement [1,n-1] (unless otherwise stated, usually 
because the optimization ended up being slower).


        0: baseline

This is just a normal build with the current "./buildrump.sh".

cycles: 4753
send/s: 517827


        1: compiler optimizations

buildrump.sh with BUILDRUMP_DBG='-O2 -march=native -mtune=native -g'.

cycles: 4994
send/s: 496492

Yes, "better" compiler optimizations make it 5% slower on my system with 
gcc 4.8.1.  We'll not use this "optimization" in the next step.  I did 
try this occasionally with the later steps, but it was always slower. 
Before everyone goes screaming "cache footprint" as the reason, I'd like 
to note that testing with -Os resulted in way less performance.


        2: static linking (!PIC)

dynamically linked code is slower than statically linked (due to the 
extra calculations position independence requires), so we use static 
rump kernel components.  The easiest way to accomplish this is to run 
buildrump.sh with -V MKPIC=no which prevents PIC libs from being created 
(some day I'll fix "rumpmake LDSTATIC=-static" to work ...)

cycles: 3209
send/s: 763705

Usually the quoted performance difference between PIC and !PIC is ~20%, 
but here it's much greater, not completely sure why.  One explanation is 
the next optimization:


        3: curlwp

The simple profiling method of running "perf top" while the test is 
running shows a good deal of calls is spent processing "curlwp" (which 
resolves currently running rump kernel thread).  In fact, curlwp is 
called ~40 times per sendto.  Since the current default implementation 
does three function calls per curlwp (x86_curlwp -> rumpuser_curlwp -> 
tls), it's 120 unnecessary function calls per a few thousand cycles. 
Well, you do the math on how cheap you'd expect that to be.  So per 
Justin's suggestion, let's use curlwp that conditionally inlines curlwp 
within a rump kernel using __thread.

cycles: 2737
send/s: 898800

I have this hacked up in a private tree.  I plan to eventually commit a 
version where you can conditionally enable the optimization.  It won't 
be on by default because 1) it's technically not legal, though it works 
just fine in userspace 2) it breaks native NetBSD x86 ABI compatibility, 
though that's almost never an issue.


        4: locks_up

An advantage of using a rump kernel is that you can run multiple 
networking stacks, one per core.  If you allow only one thread inside a 
rump kernel (RUMP_NCPU=1), atomic memory accesses and barriers for lock 
fastpaths can be optimized away.  This is done by compiling librump with 
locks_up.c instead of locks.c (see sys/rump/librump/rumpkern/Makefile.rump)

cycles: 2517
send/s: 972478


        5: ip checksum

Looking at "perf top" again, IP checksumming is starting to take a long 
time.  Let's make if_virt say that it can do checksum offloading (which 
in the case of it being backed by e.g. netmap is a more than reasonable 
assumption).  We need to add the following to the config script to 
enable the capabilities:
./rumpremote ifconfig pg0 ip4csum upd4csum

cycles: 2357
send/s: 1041834


        6: clang

And for final fun, let's try what kind of the results we get with clang:
CC=clang ./buildrump.sh ...

cycles: 2399
send/s: 1020794

clang 3.2 is almost the same as gcc, just a little slower.


That was it for the simple optimizations.  There's probably room to 
squeeze more fat out, but those will have to wait for a later day.  In 
summary, seems like static linking and changing curlwp are the big 
winners.  Compiler "optimization" is the big loser.  With large sendto() 
payloads even my dated laptop CPU can saturate 10GigE.

   - antti

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
rumpkernel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rumpkernel-users

Reply via email to