Hi,
When testing single-stream 10GbE netperf receive performance on both
S10U4 and S11b84, I found an odd performance quirk. As I increased
the receiver's socket buffer size, I noticed a decrease in performance
as the socket buffer sizes approached 512KB, with a crash in
performance above 512KB. I also noticed that mpstat was reporting an
increasing number of xcals as I increased the socket buffer size.
For example, the output below is taken from b84 on an ancient
dual-core athlon64. The 10GbE driver is interrupting on CPU0, and the
netserver is bound to CPU1. The sender is a linux host. The mss on
the connection is 1460. Notice how the performance decreases from
5.7Gbs -> 2.2Gb/s as I increase the socket buffer size, and also
notice how the number of xcalls increases:
netperf output with rx socket buffer size: 128K:
131320 65536 65536 10.01 4954.47 12.56 89.41 0.415
2.957
Typical mpstat 1 output with rx buffer size 192K
0 0 0 0 11742 142 22886 24 16 1595 0 31942 2 78 0 20
1 0 0 0 6218 6144 147 6 15 5035 0 22 0 88 0 12
192K netperf:
196712 65536 65536 10.01 5726.38 13.22 95.46 0.378 2.731
192K mpstat:
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 31 11880 132 23314 66 8 449 0 28280 2 89 0 9
1 0 0 0 68 18 33 4 7 852 0 2 0 100 0 0
384K netperf:
393424 65536 65536 10.01 3683.07 8.02 97.61 0.357 4.342
384K mpstat:
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 1653 702 133 1024 40 14 1144 0 6693 1 99 0 0
1 0 0 0 1825 98 167 2 14 1191 0 89 0 79 0 21
768K netperf:
786848 65536 65536 10.03 2279.85 5.21 93.07 0.374 6.689
768K mpstat:
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 3658 367 136 74 23 7 841 0 2420 0 100 0 0
1 0 0 0 3913 139 208 2 9 884 0 46 0 64 0 36
If I use dtrace to track the xcalls, I see:
0 51311 xc_common:xcalls
unix`xc_do_call+0xfd
unix`xc_call+0x2b
unix`hat_tlb_inval+0x1c2
unix`x86pte_inval+0xa9
unix`hat_pte_unmap+0xfc
unix`hat_unload_callback+0x148
unix`hat_unload+0x41
unix`segkmem_free_vn+0x73
unix`segkmem_free+0x23
genunix`vmem_xfree+0x10c
genunix`vmem_free+0x25
genunix`kmem_free+0x47
genunix`dblk_lastfree_oversize+0x65
genunix`freeb+0x80
genunix`struiocopyout+0x7d
genunix`kstrgetmsg+0x8ed
sockfs`sotpi_recvmsg+0x25a
sockfs`recvit+0xbc
sockfs`recv+0x7c
sockfs`recv32+0x22
So the xcalls are coming from freeing oversized dblks.
Dtrace says they are coming from here:
0 9309 allocb_oversize:entry
genunix`allocb+0x46
genunix`allocb_tmpl+0x1e
genunix`pullupmsg+0xa7
genunix`kstrgetmsg+0x8dd
sockfs`sotpi_recvmsg+0x25a
sockfs`recvit+0xbc
sockfs`recv+0x7c
sockfs`recv32+0x22
unix`sys_syscall32+0x101
If I set a breakpoint (I could probably do it with dtrace, but I'm not
that smart), and look at the mblk chain, I see this:
[1]> allocb_oversize::bp
[1]> ::cont
ESC[m^OESC(BESC)0kmdb: stop at allocb_oversize
kmdb: target stopped at:
allocb_oversize:pushq %rbp
[0]> ::stack
allocb_oversize(726d4, 0)
allocb_tmpl+0x1e(726d4, ffffff008a432840)
pullupmsg+0xa7(ffffff008a432840, ffffffffffffffff)
kstrgetmsg+0x8dd(ffffff008b97e5c0, ffffff000157cc80, ffffff000157ce10,
ffffff000157ccf0, ffffff000157ccec, ffffffffffffffff)
sockfs`sotpi_recvmsg+0x25a(ffffff0087dad030, ffffff000157ce40, ffffff000157ce10
)
sockfs`recvit+0xbc(4, ffffff000157ce40, ffffff000157ce10, 0, 0, 0)
sockfs`recv+0x7c(4, 8144170, 20000, 0)
sockfs`recv32+0x22(4, 8144170, 20000, 0)
sys_syscall32+0x101()
[0]> ffffff008a432840::walk b_cont | ::mblk
ADDR FL TYPE LEN BLEN RPTR DBLK
ffffff008a432840 0 data 32 1520 ffffff0087ae05cc ffffff011b0d3400
ffffff008afb0f00 0 data 1460 1520 ffffff0089cca038 ffffff011b0d3480
<... 320 mblks deleted for brevity ...>
>
Looking at the source of kstrgetmsg(), there is a heuristic which
tries to do a pullup(mp, -1) if the first mblk has a length of less
than mblk_pull_len, or 64 bytes. In this case, it is pulling up about
460KB.
If I disable this by setting mblk_pull_len to 0, then the decrease in
performance with an increase in socket buffer size goes away on S10U4
(haven't looked at b84 yet).
Can somebody explain the "performance degredation from copyout
overhead" talked about by the comment near the declaration of
mblk_pull_len in common/os/streamio.c. I thought copyout was the
normal copy routine plus a trap handler, and should cost the same as
the copy in the pullup.
Thanks,
Drew
_______________________________________________
networking-discuss mailing list
[email protected]