On 10.06.2014 20:24, John-Mark Gurney wrote:
Alexander V. Chernikov wrote this message on Tue, Jun 10, 2014 at 13:17 +0400:
On 10.06.2014 07:03, Bryan Venteicher wrote:

----- Original Message -----
So, after finding out that nc has a stupidly small buffer size (2k
even though there is space for 16k), I was still not getting as good
as performance using nc between machines, so I decided to generate some
flame graphs to try to identify issues...  (Thanks to who included a
full set of modules, including dtraceall on memstick!)

So, the first one is:

As I was browsing around, the em_handle_que was consuming quite a bit
of cpu usage for only doing ~50MB/sec over gige..  Running top -SH shows
me that the taskqueue for em was consuming about 50% cpu...  Also pretty
high for only 50MB/sec...  Looking closer, you'll see that bpf_mtap is
consuming ~3.18% (under ether_nh_input)..  I know I'm not running tcpdump
or anything, but I think dhclient uses bpf to be able to inject packets
and listen in on them, so I kill off dhclient, and instantly, the
thread for em drops down to 40% CPU... (transfer rate only marginally
improves, if it does)

I decide to run another flame graph w/o dhclient running:

and now _rxeof drops from 17.22% to 11.94%, pretty significant...

So, if you care about performance, don't run dhclient...

Yes, I've noticed the same issue. It can absolutely kill performance
in a VM guest. It is much more pronounced on only some of my systems,
and I hadn't tracked it down yet. I wonder if this is fallout from
the callout work, or if there was some bpf change.

I've been using the kludgey workaround patch below.
Hm, pretty interesting.
dhclient should setup proper filter (and it looks like it does so:
13:10 [0] m@ptichko s netstat -B
   Pid  Netif   Flags      Recv      Drop     Match Sblen Hblen Command
  1224    em0 -ifs--l  41225922         0        11     0     0 dhclient
see "match" count.
And BPF itself adds the cost of read rwlock (+ bgp_filter() calls for
each consumer on interface).
It should not introduce significant performance penalties.
Don't forget that it has to process the returning ack's... So, you're
Well, it can be still captured with the proper filter like "ip && udp && port 67 or port 68". We're using tcpdump on high packet ratios (>1M) and it does not influence process _much_. We should probably convert its rwlock to rmlock and use per-cpu counters for statistics, but that's a different story.
looking around 10k+ pps that you have to handle and pass through the
filter...  That's a lot of packets to process...

Just for a bit more "double check", instead of using the HD as a
source, I used /dev/zero...   I ran a netstat -w 1 -I em0 when
running the test, and I was getting ~50.7MiB/s w/ dhclient running and
then I killed dhclient and it instantly jumped up to ~57.1MiB/s.. So I
launched dhclient again, and it dropped back to ~50MiB/s...
dhclient uses different BPF sockets for reading and writing (and it moves write socket to privileged child process via fork(). The problem we're facing with is the fact that dhclient does not set _any_ read filter on write socket:
21:27 [0] zfscurr0# netstat -B
  Pid  Netif   Flags      Recv      Drop     Match Sblen Hblen Command
 1529    em0 --fs--l     86774     86769     86784  4044  3180 dhclient
--------------------------------------- ^^^^^ --------------------------
 1526    em0 -ifs--l     86789         0         1     0     0 dhclient

so all traffic is pushed down introducing contention on BPF descriptor mutex.

(That's why I've asked for netstat -B output.)

Please try an attached patch to fix this. This is not the right way to fix this, we'd better change BPF behavior not to attach to interface readers for write-only consumers. This have been partially implemented as net.bpf.optimize_writers hack, but it does not work for all direct BPF consumers (which are not using pcap(3) API).

and some of this slowness is due to nc using small buffers which I will
fix shortly..

And with witness disabled it goes from 58MiB/s to 65.7MiB/s..  In
both cases, that's a 13% performance improvement by running w/o

This is using the latest memstick image, r266655 on a (Lenovo T61):
FreeBSD 11.0-CURRENT #0 r266655: Sun May 25 18:55:02 UTC 2014
     r...@grind.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64
FreeBSD clang version 3.4.1 (tags/RELEASE_34/dot1-final 208032) 20140512
WARNING: WITNESS option enabled, expect reduced performance.
CPU: Intel(R) Core(TM)2 Duo CPU     T7300  @ 2.00GHz (1995.05-MHz K8-class CPU)
   Origin="GenuineIntel"  Id=0x6fb  Family=0x6  Model=0xf  Stepping=11
   AMD Features=0x20100800<SYSCALL,NX,LM>
   AMD Features2=0x1<LAHF>
   TSC: P-state invariant, performance statistics
real memory  = 2147483648 (2048 MB)
avail memory = 2014019584 (1920 MB)

Index: sbin/dhclient/bpf.c
--- sbin/dhclient/bpf.c	(revision 266306)
+++ sbin/dhclient/bpf.c	(working copy)
@@ -131,6 +131,11 @@ struct bpf_insn dhcp_bpf_wfilter[] = {
 int dhcp_bpf_wfilter_len = sizeof(dhcp_bpf_wfilter) / sizeof(struct bpf_insn);
+struct bpf_insn dhcp_bpf_dfilter[] = {
+int dhcp_bpf_dfilter_len = sizeof(dhcp_bpf_dfilter) / sizeof(struct bpf_insn);
 if_register_send(struct interface_info *info)
@@ -160,6 +165,12 @@ if_register_send(struct interface_info *info)
 	if (ioctl(info->wfdesc, BIOCSETWF, &p) < 0)
 		error("Can't install write filter program: %m");
+	/* Set deny-all read filter for write socket */
+	p.bf_len = dhcp_bpf_dfilter_len;
+	p.bf_insns = dhcp_bpf_dfilter;
+	if (ioctl(info->wfdesc, BIOCSETFNR, &p) < 0)
+		error("Can't install write filter program: %m");
 	if (ioctl(info->wfdesc, BIOCLOCK, NULL) < 0)
 		error("Cannot lock bpf");
freebsd-current@freebsd.org mailing list
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Reply via email to