So, after finding out that nc has a stupidly small buffer size (2k
even though there is space for 16k), I was still not getting as good
as performance using nc between machines, so I decided to generate some
flame graphs to try to identify issues...  (Thanks to who included a
full set of modules, including dtraceall on memstick!)

So, the first one is:

As I was browsing around, the em_handle_que was consuming quite a bit
of cpu usage for only doing ~50MB/sec over gige..  Running top -SH shows
me that the taskqueue for em was consuming about 50% cpu...  Also pretty
high for only 50MB/sec...  Looking closer, you'll see that bpf_mtap is
consuming ~3.18% (under ether_nh_input)..  I know I'm not running tcpdump
or anything, but I think dhclient uses bpf to be able to inject packets
and listen in on them, so I kill off dhclient, and instantly, the
thread for em drops down to 40% CPU... (transfer rate only marginally
improves, if it does)

I decide to run another flame graph w/o dhclient running:

and now _rxeof drops from 17.22% to 11.94%, pretty significant...

So, if you care about performance, don't run dhclient...

Yes, I've noticed the same issue. It can absolutely kill performance
in a VM guest. It is much more pronounced on only some of my systems,
and I hadn't tracked it down yet. I wonder if this is fallout from
the callout work, or if there was some bpf change.

I've been using the kludgey workaround patch below.
Hm, pretty interesting.
dhclient should setup proper filter (and it looks like it does so:
13:10 [0] m@ptichko s netstat -B
    Pid  Netif   Flags      Recv      Drop     Match Sblen Hblen Command
   1224    em0 -ifs--l  41225922         0        11     0     0 dhclient
see "match" count.
And BPF itself adds the cost of read rwlock (+ bgp_filter() calls for
each consumer on interface).
It should not introduce significant performance penalties.

It will be a bit before I'm able to capture that. Here's a Flamegraph from
earlier in the year showing an absurd amount of time spent in bpf_mtap():
Can you briefly describe test setup?
(Actually I'm interested in overall pps rate, bpf filter used and match ratio).

For example, for some random box at $work:
22:17 [0] m@sas1-fw1 netstat -I vlan802 -w1
            input      (vlan802)           output
   packets  errs idrops      bytes    packets  errs      bytes colls
    430418     0     0  337712454     396282     0  333207773     0
CPU:  0.4% user,  0.0% nice,  1.2% system, 15.9% interrupt, 82.5% idle

2:17 [0] sas1-fw1# tcpdump -i vlan802 -lnps0 icmp and host X.X.X.X
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vlan802, link-type EN10MB (Ethernet), capture size 65535 bytes
22:17:14.866085 IP X.X.X.X > Y.Y.Y.Y: ICMP echo request, id 6730, seq 1, length 64

22:17 [0] m@sas1-fw1 s netstat -B 2>/dev/null | grep tcpdump
98520 vlan802 ---s---  27979422         0        40     0     0 tcpdump

CPU:  0.9% user,  0.0% nice,  2.7% system, 17.6% interrupt, 78.8% idle
(Actually the load is floating due to bursty traffic in 14-20% rate but I can't see much difference with tcpdump turned on/off).


diff --git a/sys/net/bpf.c b/sys/net/bpf.c
index cb3ed27..9751986 100644
--- a/sys/net/bpf.c
+++ b/sys/net/bpf.c
@@ -2013,9 +2013,11 @@ bpf_gettime(struct bintime *bt, int tstype, struct
mbuf *m)
                        return (BPF_TSTAMP_EXTERN);
+#if 0
        if (quality == BPF_TSTAMP_NORMAL)
bpf_getttime() is called IFF packet filter matches some traffic.
Can you show your "netstat -B" output ?
return (quality);

