> >>So, after finding out that nc has a stupidly small buffer size (2k
> >>even though there is space for 16k), I was still not getting as good
> >>as performance using nc between machines, so I decided to generate some
> >>flame graphs to try to identify issues...  (Thanks to who included a
> >>full set of modules, including dtraceall on memstick!)
> >>
> >>So, the first one is:
> >>
> >>
> >>As I was browsing around, the em_handle_que was consuming quite a bit
> >>of cpu usage for only doing ~50MB/sec over gige..  Running top -SH shows
> >>me that the taskqueue for em was consuming about 50% cpu...  Also pretty
> >>high for only 50MB/sec...  Looking closer, you'll see that bpf_mtap is
> >>consuming ~3.18% (under ether_nh_input)..  I know I'm not running tcpdump
> >>or anything, but I think dhclient uses bpf to be able to inject packets
> >>and listen in on them, so I kill off dhclient, and instantly, the 
> >>taskqueue
> >>thread for em drops down to 40% CPU... (transfer rate only marginally
> >>improves, if it does)
> >>
> >>I decide to run another flame graph w/o dhclient running:
> >>
> >>
> >>and now _rxeof drops from 17.22% to 11.94%, pretty significant...
> >>
> >>So, if you care about performance, don't run dhclient...
> >>
> >Yes, I've noticed the same issue. It can absolutely kill performance
> >in a VM guest. It is much more pronounced on only some of my systems,
> >and I hadn't tracked it down yet. I wonder if this is fallout from
> >the callout work, or if there was some bpf change.
> >
> >I've been using the kludgey workaround patch below.
> Hm, pretty interesting.
> dhclient should setup proper filter (and it looks like it does so:
> 13:10 [0] m@ptichko s netstat -B
>   Pid  Netif   Flags      Recv      Drop     Match Sblen Hblen Command
>  1224    em0 -ifs--l  41225922         0        11     0     0 dhclient
> )
> see "match" count.
> And BPF itself adds the cost of read rwlock (+ bgp_filter() calls for 
> each consumer on interface).
> It should not introduce significant performance penalties.

Don't forget that it has to process the returning ack's... So, you're
looking around 10k+ pps that you have to handle and pass through the
filter...  That's a lot of packets to process...

Just for a bit more "double check", instead of using the HD as a
source, I used /dev/zero...   I ran a netstat -w 1 -I em0 when
running the test, and I was getting ~50.7MiB/s w/ dhclient running and
then I killed dhclient and it instantly jumped up to ~57.1MiB/s.. So I
launched dhclient again, and it dropped back to ~50MiB/s...

and some of this slowness is due to nc using small buffers which I will
fix shortly..

And with witness disabled it goes from 58MiB/s to 65.7MiB/s..  In
both cases, that's a 13% performance improvement by running w/o

This is using the latest memstick image, r266655 on a (Lenovo T61):
FreeBSD 11.0-CURRENT #0 r266655: Sun May 25 18:55:02 UTC 2014 amd64
FreeBSD clang version 3.4.1 (tags/RELEASE_34/dot1-final 208032) 20140512
WARNING: WITNESS option enabled, expect reduced performance.
CPU: Intel(R) Core(TM)2 Duo CPU     T7300  @ 2.00GHz (1995.05-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x6fb  Family=0x6  Model=0xf  Stepping=11
  AMD Features=0x20100800<SYSCALL,NX,LM>
  AMD Features2=0x1<LAHF>
  TSC: P-state invariant, performance statistics
real memory  = 2147483648 (2048 MB)
avail memory = 2014019584 (1920 MB)

