On Thu, Aug 8, 2013 at 9:35 PM, John Jasen <jja...@realityfailure.org> wrote:
> You may want to test jumbo frames, just to see what would happen. I
> would expect you to see closer to 10 Gb/s with the same number of
> interrupts.

Results for jumbo frames are below (spoiler: 10 Gbps, same number of
interrupts, 40% CPU0 usage).

> On 08/08/2013 08:26 PM, Maxim Khitrov wrote:
>> Active Processor Cores: All
>
> I would turn that off, or at least make it only dual core.

No effect, results are also below.

>> That's... a bit faster. The CPU in the desktops is Intel i7-3770,
>> which is very similar to the Xeon E3-1275v2. Is this a FreeBSD vs
>> OpenBSD difference?
>
> Could be. It might be worth testing FreeBSD on your packet forwarding
> boxes, just to see if you get similar results.

I installed FreeBSD on a USB flash drive, booted the backup firewall
from that, and ran iperf -c 127.0.0.1 -t 60:

[  3]  0.0-60.0 sec   373 GBytes  53.4 Gbits/sec

Almost the same as the desktops, so this performance boost is due to
FreeBSD (which keeps all cores at 70% load) and not the hardware.

Now for jumbo frames:

# s1: iperf -s
# c1: iperf -c s1 -t 60 -m
[  3]  0.0-60.0 sec  69.1 GBytes  9.89 Gbits/sec
[  3] MSS size 8192 bytes (MTU 8232 bytes, unknown interface)

With MTU set to 9000 along the entire path, a single client can max
out the 10 gigabit link through the firewall. This also addresses the
question of PCIe bandwidth - not an issue. I just had to double
kern.ipc.nmbjumbo9 to 12800 on all FreeBSD hosts before I could enable
jumbo frames (got "ix0: Could not setup receive structures"
otherwise).

Both clients together:

# s1: iperf -s
# s2: iperf -s
# c1: nc gw 1234 ; iperf -c s1 -t 60
# c2: nc gw 1234 ; iperf -c s2 -t 60
[  3]  0.0-60.0 sec  34.6 GBytes  4.95 Gbits/sec
[  3]  0.0-60.0 sec  34.5 GBytes  4.94 Gbits/sec

During all of these tests, systat shows 8k interrupts on each
interface, and CPU0 usage is 40% interrupt, 60% idle.

Going back to 1500 MTU, disabling Hardware Prefetcher and Adjacent
Cache Line Prefetch in BIOS has no effect:

# c1->s1
[  3]  0.0-60.0 sec  29.5 GBytes  4.22 Gbits/sec

# c1->s1, c2->s2
[  3]  0.0-60.0 sec  14.8 GBytes  2.12 Gbits/sec
[  3]  0.0-60.0 sec  15.7 GBytes  2.25 Gbits/sec

Same goes for disabling two of the cores:

# c1->s1
[  3]  0.0-60.0 sec  30.7 GBytes  4.39 Gbits/sec

# c1->s1, c2->s2
[  3]  0.0-60.0 sec  15.2 GBytes  2.18 Gbits/sec
[  3]  0.0-60.0 sec  15.2 GBytes  2.17 Gbits/sec

Same with bsd.sp kernel and all but one of the cores disabled:

# c1->s1
[  3]  0.0-60.0 sec  31.3 GBytes  4.48 Gbits/sec

# c1->s1, c2->s2
[  3]  0.0-60.0 sec  15.0 GBytes  2.15 Gbits/sec
[  3]  0.0-60.0 sec  16.1 GBytes  2.30 Gbits/sec

Finally, I went back to all cores enabled, bsd.mp kernel, Hardware
Prefetcher and Adjacent Cache Line Prefetch enabled:

# c1->s1
[  3]  0.0-60.0 sec  30.9 GBytes  4.43 Gbits/sec

# c1->s2, c2->s2
[  3]  0.0-60.0 sec  16.8 GBytes  2.40 Gbits/sec
[  3]  0.0-60.0 sec  14.0 GBytes  2.00 Gbits/sec

As you can see, none of these tweaks had any measurable impact. The
firewall can only handle so many packets per second. To push more
packets through, I need to reduce the per-packet processing overhead.
Here's a simple illustration of this fact using just the c1->s1 test:

# pf disabled (set skip on {ix0, ix1}):
[  3]  0.0-60.0 sec  37.4 GBytes  5.35 Gbits/sec

# pf enabled, no state on ix0:
[  3]  0.0-60.1 sec  8.28 GBytes  1.18 Gbits/sec

# pf enabled, keep state:
[  3]  0.0-60.0 sec  30.8 GBytes  4.41 Gbits/sec

# pf enabled, keep state (sloppy):
[  3]  0.0-60.0 sec  31.2 GBytes  4.46 Gbits/sec

# pf enabled, modulate state:
[  3]  0.0-60.0 sec  28.3 GBytes  4.05 Gbits/sec

# pf enabled, modulate state scrub (random-id reassemble tcp):
[  3]  0.0-60.0 sec  25.8 GBytes  3.69 Gbits/sec

The interesting thing about the last test is that systat shows double
the number of interrupts (32k total, 16k per interface) and CPU0 is
about 5% idle instead of the usual 10%. The rest is self-evident. More
work per packet = lower throughput. This is also another confirmation
that the sloppy state tracker has no performance benefits.

Unless someone has any other ideas on how to reduce the per-packet
processing time, I think ~4.5 Gbps is the most that my hardware can
handle at the default MTU. A bit disappointing, but it was the fastest
CPU that I could get from Lanner and also my first step beyond 1
gigabit.

If OpenBSD starts using multiple cores for interrupt processing in the
future, 10+ Gbps should be easy to achieve. FreeBSD is an option if
performance is critical, but for now I'd rather have all the 4.6+ pf
improvements.

Reply via email to