Hi,

When testing single-stream 10GbE netperf receive performance on both
S10U4 and S11b84, I found an odd performance quirk.  As I increased
the receiver's socket buffer size, I noticed a decrease in performance
as the socket buffer sizes approached 512KB, with a crash in
performance above 512KB.  I also noticed that mpstat was reporting an
increasing number of xcals as I increased the socket buffer size.

For example, the output below is taken from b84 on an ancient
dual-core athlon64.  The 10GbE driver is interrupting on CPU0, and the
netserver is bound to CPU1.  The sender is a linux host.  The mss on
the connection is 1460.  Notice how the performance decreases from
5.7Gbs -> 2.2Gb/s as I increase the socket buffer size, and also
notice how the number of xcalls increases:

netperf output with rx socket buffer size: 128K:
131320  65536  65536    10.01      4954.47   12.56    89.41    0.415
2.957 

Typical mpstat 1 output with rx  buffer size 192K
  0    0   0    0 11742  142 22886   24   16 1595    0 31942    2  78   0  20
  1    0   0    0  6218 6144  147    6   15 5035    0    22    0  88   0  12


192K netperf:
196712  65536  65536    10.01      5726.38   13.22    95.46    0.378   2.731 

192K mpstat:
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0   31 11880  132 23314   66    8  449    0 28280    2  89   0   9
  1    0   0    0    68   18   33    4    7  852    0     2    0 100   0   0


384K netperf:
393424  65536  65536    10.01      3683.07   8.02     97.61    0.357   4.342 

384K mpstat:
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0 1653   702  133 1024   40   14 1144    0  6693    1  99   0   0
  1    0   0    0  1825   98  167    2   14 1191    0    89    0  79   0  21

768K netperf:
786848  65536  65536    10.03      2279.85   5.21     93.07    0.374   6.689 

768K mpstat:
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0 3658   367  136   74   23    7  841    0  2420    0 100   0   0
  1    0   0    0  3913  139  208    2    9  884    0    46    0  64   0  36


If I use dtrace to track the xcalls, I see:

  0  51311                 xc_common:xcalls 
              unix`xc_do_call+0xfd
              unix`xc_call+0x2b
              unix`hat_tlb_inval+0x1c2
              unix`x86pte_inval+0xa9
              unix`hat_pte_unmap+0xfc
              unix`hat_unload_callback+0x148
              unix`hat_unload+0x41
              unix`segkmem_free_vn+0x73
              unix`segkmem_free+0x23
              genunix`vmem_xfree+0x10c
              genunix`vmem_free+0x25
              genunix`kmem_free+0x47
              genunix`dblk_lastfree_oversize+0x65
              genunix`freeb+0x80
              genunix`struiocopyout+0x7d
              genunix`kstrgetmsg+0x8ed
              sockfs`sotpi_recvmsg+0x25a
              sockfs`recvit+0xbc
              sockfs`recv+0x7c
              sockfs`recv32+0x22


So the xcalls are coming from freeing oversized dblks.
Dtrace says they are coming from here:

  0   9309            allocb_oversize:entry 
              genunix`allocb+0x46
              genunix`allocb_tmpl+0x1e
              genunix`pullupmsg+0xa7
              genunix`kstrgetmsg+0x8dd
              sockfs`sotpi_recvmsg+0x25a
              sockfs`recvit+0xbc
              sockfs`recv+0x7c
              sockfs`recv32+0x22
              unix`sys_syscall32+0x101

If I set a breakpoint (I could probably do it with dtrace, but I'm not
that smart), and look at the mblk chain, I see this:

[1]> allocb_oversize::bp
[1]> ::cont
ESC[m^OESC(BESC)0kmdb: stop at allocb_oversize
kmdb: target stopped at:
allocb_oversize:pushq  %rbp
[0]> ::stack
allocb_oversize(726d4, 0)
allocb_tmpl+0x1e(726d4, ffffff008a432840)
pullupmsg+0xa7(ffffff008a432840, ffffffffffffffff)
kstrgetmsg+0x8dd(ffffff008b97e5c0, ffffff000157cc80, ffffff000157ce10, 
ffffff000157ccf0, ffffff000157ccec, ffffffffffffffff)
sockfs`sotpi_recvmsg+0x25a(ffffff0087dad030, ffffff000157ce40, ffffff000157ce10
)
sockfs`recvit+0xbc(4, ffffff000157ce40, ffffff000157ce10, 0, 0, 0)
sockfs`recv+0x7c(4, 8144170, 20000, 0)
sockfs`recv32+0x22(4, 8144170, 20000, 0)
sys_syscall32+0x101()
[0]> ffffff008a432840::walk b_cont | ::mblk
            ADDR FL TYPE    LEN   BLEN              RPTR             DBLK
ffffff008a432840 0  data    32    1520  ffffff0087ae05cc ffffff011b0d3400
ffffff008afb0f00 0  data    1460  1520  ffffff0089cca038 ffffff011b0d3480
<... 320 mblks deleted for brevity ...>
>

Looking at the source of kstrgetmsg(), there is a heuristic which
tries to do a pullup(mp, -1) if the first mblk has a length of less
than mblk_pull_len, or 64 bytes.  In this case, it is pulling up about
460KB.

If I disable this by setting mblk_pull_len to 0, then the decrease in
performance with an increase in socket buffer size goes away on S10U4
(haven't looked at b84 yet).

Can somebody explain the "performance degredation from copyout
overhead" talked about by the comment near the declaration of
mblk_pull_len in common/os/streamio.c.  I thought copyout was the
normal copy routine plus a trap handler, and should cost the same as
the copy in the pullup.

Thanks,

Drew
_______________________________________________
networking-discuss mailing list
[email protected]

Reply via email to