On Thu, Aug 8, 2013 at 9:35 PM, John Jasen <jja...@realityfailure.org> wrote: > You may want to test jumbo frames, just to see what would happen. I > would expect you to see closer to 10 Gb/s with the same number of > interrupts.
Results for jumbo frames are below (spoiler: 10 Gbps, same number of interrupts, 40% CPU0 usage). > On 08/08/2013 08:26 PM, Maxim Khitrov wrote: >> Active Processor Cores: All > > I would turn that off, or at least make it only dual core. No effect, results are also below. >> That's... a bit faster. The CPU in the desktops is Intel i7-3770, >> which is very similar to the Xeon E3-1275v2. Is this a FreeBSD vs >> OpenBSD difference? > > Could be. It might be worth testing FreeBSD on your packet forwarding > boxes, just to see if you get similar results. I installed FreeBSD on a USB flash drive, booted the backup firewall from that, and ran iperf -c 127.0.0.1 -t 60: [ 3] 0.0-60.0 sec 373 GBytes 53.4 Gbits/sec Almost the same as the desktops, so this performance boost is due to FreeBSD (which keeps all cores at 70% load) and not the hardware. Now for jumbo frames: # s1: iperf -s # c1: iperf -c s1 -t 60 -m [ 3] 0.0-60.0 sec 69.1 GBytes 9.89 Gbits/sec [ 3] MSS size 8192 bytes (MTU 8232 bytes, unknown interface) With MTU set to 9000 along the entire path, a single client can max out the 10 gigabit link through the firewall. This also addresses the question of PCIe bandwidth - not an issue. I just had to double kern.ipc.nmbjumbo9 to 12800 on all FreeBSD hosts before I could enable jumbo frames (got "ix0: Could not setup receive structures" otherwise). Both clients together: # s1: iperf -s # s2: iperf -s # c1: nc gw 1234 ; iperf -c s1 -t 60 # c2: nc gw 1234 ; iperf -c s2 -t 60 [ 3] 0.0-60.0 sec 34.6 GBytes 4.95 Gbits/sec [ 3] 0.0-60.0 sec 34.5 GBytes 4.94 Gbits/sec During all of these tests, systat shows 8k interrupts on each interface, and CPU0 usage is 40% interrupt, 60% idle. Going back to 1500 MTU, disabling Hardware Prefetcher and Adjacent Cache Line Prefetch in BIOS has no effect: # c1->s1 [ 3] 0.0-60.0 sec 29.5 GBytes 4.22 Gbits/sec # c1->s1, c2->s2 [ 3] 0.0-60.0 sec 14.8 GBytes 2.12 Gbits/sec [ 3] 0.0-60.0 sec 15.7 GBytes 2.25 Gbits/sec Same goes for disabling two of the cores: # c1->s1 [ 3] 0.0-60.0 sec 30.7 GBytes 4.39 Gbits/sec # c1->s1, c2->s2 [ 3] 0.0-60.0 sec 15.2 GBytes 2.18 Gbits/sec [ 3] 0.0-60.0 sec 15.2 GBytes 2.17 Gbits/sec Same with bsd.sp kernel and all but one of the cores disabled: # c1->s1 [ 3] 0.0-60.0 sec 31.3 GBytes 4.48 Gbits/sec # c1->s1, c2->s2 [ 3] 0.0-60.0 sec 15.0 GBytes 2.15 Gbits/sec [ 3] 0.0-60.0 sec 16.1 GBytes 2.30 Gbits/sec Finally, I went back to all cores enabled, bsd.mp kernel, Hardware Prefetcher and Adjacent Cache Line Prefetch enabled: # c1->s1 [ 3] 0.0-60.0 sec 30.9 GBytes 4.43 Gbits/sec # c1->s2, c2->s2 [ 3] 0.0-60.0 sec 16.8 GBytes 2.40 Gbits/sec [ 3] 0.0-60.0 sec 14.0 GBytes 2.00 Gbits/sec As you can see, none of these tweaks had any measurable impact. The firewall can only handle so many packets per second. To push more packets through, I need to reduce the per-packet processing overhead. Here's a simple illustration of this fact using just the c1->s1 test: # pf disabled (set skip on {ix0, ix1}): [ 3] 0.0-60.0 sec 37.4 GBytes 5.35 Gbits/sec # pf enabled, no state on ix0: [ 3] 0.0-60.1 sec 8.28 GBytes 1.18 Gbits/sec # pf enabled, keep state: [ 3] 0.0-60.0 sec 30.8 GBytes 4.41 Gbits/sec # pf enabled, keep state (sloppy): [ 3] 0.0-60.0 sec 31.2 GBytes 4.46 Gbits/sec # pf enabled, modulate state: [ 3] 0.0-60.0 sec 28.3 GBytes 4.05 Gbits/sec # pf enabled, modulate state scrub (random-id reassemble tcp): [ 3] 0.0-60.0 sec 25.8 GBytes 3.69 Gbits/sec The interesting thing about the last test is that systat shows double the number of interrupts (32k total, 16k per interface) and CPU0 is about 5% idle instead of the usual 10%. The rest is self-evident. More work per packet = lower throughput. This is also another confirmation that the sloppy state tracker has no performance benefits. Unless someone has any other ideas on how to reduce the per-packet processing time, I think ~4.5 Gbps is the most that my hardware can handle at the default MTU. A bit disappointing, but it was the fastest CPU that I could get from Lanner and also my first step beyond 1 gigabit. If OpenBSD starts using multiple cores for interrupt processing in the future, 10+ Gbps should be easy to achieve. FreeBSD is an option if performance is critical, but for now I'd rather have all the 4.6+ pf improvements.