Re: improving networking performance, part 1

Justin Cormack Tue, 11 Mar 2014 11:04:07 -0700

On Tue, Mar 11, 2014 at 3:05 PM, Antti Kantee <[email protected]> wrote:
> [if you want to look at the numbers, skip a few paragraphs]
>
> So with the pktgenif tool I started looking at how long a
> single-threaded UDP sendto() takes, and how to bring that figure down.
> I'm documenting the steps in this mail so that everyone can repeat them
> and/or suggest further improvements.
>
> Note that this is only step 1: single-core send, which is mostly about
> getting the base cost of processing a packet low while ignoring
> scheduling.  Part 2 will be receiving, which, unlike send, includes at
> least two threads: one pulling packets off of the "NIC" and stuffing
> them into the right place, and one actually reading them in the
> application capacity.  Part 3 is most likely something about
> multithreaded performance, though I still maintain that it's easier and
> more performance to structure things so that you just run multiple
> disjoint networking stacks.
>
> One thing to note is that since pktgenif is a "dummy" interface, there
> is no actual I/O happening.  That can be visited later, though I don't
> think it will make much of a difference, since e.g. netmap is supposed
> to be able to transmit a packet in ~60 cycles, and I'm sure other high
> performance userspace packet shovellers that rump kernels already
> integrate with, such as DPDK, sport similar figures.  It takes _a lot_
> more CPU cycles to prepare the packets than to actually send them, as we
> will see shortly.  Logically, this makes sense too, as most of the
> shovelling work is done off the CPU, while the preparation is done on
> the CPU (at least when running the rump kernel on the CPU, like I am
> doing here).
>
> The measurements are done with my laptop which has a 2.53GHz Core2Duo
> T9400.  Yes, that CPU was released >5years ago.  Newer CPUs should have
> better performance (or at least I'd hope so).  The measurements were
> done by running the tool in the rump-pktgenif repository as "./testtool
> -r config.sh -c 1000000 send", where config.sh is config.sh.example with
> the directory appropriately edited.  This results in a million 22byte
> UDP payloads being sent as 64byte ethernet frames (22+8+20+14 = 64).
> The variable RUMP_NCPU was set to "1" (since we don't need more in this
> case).
>
> I present both the sends/second numbers output by testtool, as well as
> the cycle count as obtained from the "rdtsc" instruction executed on
> both sides of sendto.  The math between them works out (more or less
> ...), so use whichever figure you perceive to be nicer.  Counting cycles
> affects the "send/s parameter" a bit, since the former is done in every
> loop while the latter is an aggregate timestamp calculation, but they're
> close enough for this rough purpose.
>
> The following numbers are cumulative, so improvement n includes
> everything between improvement [1,n-1] (unless otherwise stated, usually
> because the optimization ended up being slower).
>
>
>         0: baseline
>
> This is just a normal build with the current "./buildrump.sh".
>
> cycles: 4753
> send/s: 517827
>
>
>         1: compiler optimizations
>
> buildrump.sh with BUILDRUMP_DBG='-O2 -march=native -mtune=native -g'.
>
> cycles: 4994
> send/s: 496492
>
> Yes, "better" compiler optimizations make it 5% slower on my system with
> gcc 4.8.1.  We'll not use this "optimization" in the next step.  I did
> try this occasionally with the later steps, but it was always slower.
> Before everyone goes screaming "cache footprint" as the reason, I'd like
> to note that testing with -Os resulted in way less performance.
>
>
>         2: static linking (!PIC)
>
> dynamically linked code is slower than statically linked (due to the
> extra calculations position independence requires), so we use static
> rump kernel components.  The easiest way to accomplish this is to run
> buildrump.sh with -V MKPIC=no which prevents PIC libs from being created
> (some day I'll fix "rumpmake LDSTATIC=-static" to work ...)
>
> cycles: 3209
> send/s: 763705
>
> Usually the quoted performance difference between PIC and !PIC is ~20%,
> but here it's much greater, not completely sure why.  One explanation is
> the next optimization:
>
>
>         3: curlwp
>
> The simple profiling method of running "perf top" while the test is
> running shows a good deal of calls is spent processing "curlwp" (which
> resolves currently running rump kernel thread).  In fact, curlwp is
> called ~40 times per sendto.  Since the current default implementation
> does three function calls per curlwp (x86_curlwp -> rumpuser_curlwp ->
> tls), it's 120 unnecessary function calls per a few thousand cycles.
> Well, you do the math on how cheap you'd expect that to be.  So per
> Justin's suggestion, let's use curlwp that conditionally inlines curlwp
> within a rump kernel using __thread.
>
> cycles: 2737
> send/s: 898800
>
> I have this hacked up in a private tree.  I plan to eventually commit a
> version where you can conditionally enable the optimization.  It won't
> be on by default because 1) it's technically not legal, though it works
> just fine in userspace 2) it breaks native NetBSD x86 ABI compatibility,
> though that's almost never an issue.
>
>
>         4: locks_up
>
> An advantage of using a rump kernel is that you can run multiple
> networking stacks, one per core.  If you allow only one thread inside a
> rump kernel (RUMP_NCPU=1), atomic memory accesses and barriers for lock
> fastpaths can be optimized away.  This is done by compiling librump with
> locks_up.c instead of locks.c (see sys/rump/librump/rumpkern/Makefile.rump)
>
> cycles: 2517
> send/s: 972478
>
>
>         5: ip checksum
>
> Looking at "perf top" again, IP checksumming is starting to take a long
> time.  Let's make if_virt say that it can do checksum offloading (which
> in the case of it being backed by e.g. netmap is a more than reasonable
> assumption).  We need to add the following to the config script to
> enable the capabilities:
> ./rumpremote ifconfig pg0 ip4csum upd4csum
>
> cycles: 2357
> send/s: 1041834
>
>
>         6: clang
>
> And for final fun, let's try what kind of the results we get with clang:
> CC=clang ./buildrump.sh ...
>
> cycles: 2399
> send/s: 1020794
>
> clang 3.2 is almost the same as gcc, just a little slower.
>
>
> That was it for the simple optimizations.  There's probably room to
> squeeze more fat out, but those will have to wait for a later day.  In
> summary, seems like static linking and changing curlwp are the big
> winners.  Compiler "optimization" is the big loser.  With large sendto()
> payloads even my dated laptop CPU can saturate 10GigE.



Cool, will test on some other machines and see how they compare.

Is there a way to run the driver (or similar) under native NetBSD,
still curious how it compares.

A buildrump.sh option to compile locks_up would be useful I guess, or
runtime switch.

Justin

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
rumpkernel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rumpkernel-users

Re: improving networking performance, part 1

Reply via email to