Re: fastforward/routing: a 3 million packet-per-second system?

George Neville-Neil Sun, 27 Jul 2014 13:43:21 -0700

On 22 Jul 2014, at 20:30, Adrian Chadd wrote:

hi!


You can use 'pmcstat -S CPU_CLK_UNHALTED_CORE -O pmc.out' (then ctrl-C
it after say 5 seconds), which will log the data to pmc.out;
then 'pmcannotate -k /boot/kernel pmc.out /boot/kernel/kernel' to find
out where the most cpu cycles are being spent.

Chiming in late, but don't you mean instruction-retired instead ofCPU_CLK_UNHALTED_CORE?


Best,
George

It should give us the location(s) inside the top CPU users.

Hopefully it'll then be much more obvious!

I'm glad you're digging into it!

-a



On 22 July 2014 12:21, John Jasen <jja...@gmail.com> wrote:
Navdeep;

I was struck by spending so much time in transmit as well.
Adrian's suggestion on mining lock profiling gave me an excuse to upthetx queues in /boot/loader.conf. Our prior conversations indicatedthat
up to 64 should be acceptable?





On 07/22/2014 03:10 PM, Adrian Chadd wrote:
Hi

Right. Time to figure out why you're spending so much time in
cxgbe_transmit() and t4_eth_tx(). Maybe ask Navdeep for some ideas?


-a

On 22 July 2014 12:07, John Jasen <jja...@gmail.com> wrote:
The first is pretty easy to turn around. Reading on dtrace now.Thanks
for the pointers and help!
PMC: [CPU_CLK_UNHALTED_CORE] Samples: 142654 (100.0%) , 124560unresolved
%SAMP IMAGE      FUNCTION             CALLERS
34.0 if_cxgbe.k t4_eth_tx cxgbe_transmit:24.0t4_tx_task:10.0
28.8 if_cxgbe.k cxgbe_transmit
7.6 if_cxgbe.k service_iq           t4_intr
6.4 if_cxgbe.k get_scatter_segment  service_iq
4.9 if_cxgbe.k reclaim_tx_descs     t4_eth_tx
3.2 if_cxgbe.k write_sgl_to_txd     t4_eth_tx
2.8 hwpmc.ko   pmclog_process_callc pmc_process_samples
2.1 libc.so.7  bcopy                pmclog_read
1.9 if_cxgbe.k t4_eth_rx            service_iq
1.7 hwpmc.ko   pmclog_reserve       pmclog_process_callchain
1.2 libpmc.so. pmclog_read
1.0 if_cxgbe.k write_txpkts_wr      t4_eth_tx
0.8 kernel     e1000_read_i2c_byte_ e1000_set_i2c_bb
0.6 libc.so.7  memset
0.5 if_cxgbe.k refill_fl            service_iq




On 07/22/2014 02:45 PM, Adrian Chadd wrote:
Hi,

Well, start with CPU profiling with pmc:

kldload hwpmc
pmcstat -S CPU_CLK_UNHALTED_CORE -T -w 1

.. look at the freebsd dtrace onliners (google that) for lock
contention and CPU usage.
You could compile with LOCK_PROFILING (which will slow things downalittle even when not in use) then enable it for a few seconds(which
will definitely slow things down) to gather fine grained lock
contention data.

(sysctl debug.lock.prof when you have it compiled in. sysctl
debug.lock.prof.enable=1; wait 10 seconds; sysctl
debug.lock.prof.enable=0; sysctl debug.lock.prof.stats)


-a


On 22 July 2014 11:42, John Jasen <jja...@gmail.com> wrote:
If you have ideas on what to instrument, I'll be happy to do it.
I'm faintly familiar with dtrace, et al, so it might take me afew tries
to get it right -- or bludgeoning with the documentation.

Thanks!




On 07/22/2014 02:07 PM, Adrian Chadd wrote:
Hi!
Well, what's missing is some dtrace/pmc/lockdebugginginvestigations
into the system to see where it's currently maxing out at.
I wonder if you're seeing contention on the transmit paths asdrivers
queue frames from one set of driver threads/queues to another
potentially completely different set of driver transmit
threads/queues.




-a


On 22 July 2014 08:18, John Jasen <jja...@gmail.com> wrote:
Feedback and/or tips and tricks more than welcome.

Outstanding questions:

Would increasing the number of processor cores help?
Would a system where both processor QPI ports connect to eachother
mitigate QPI bottlenecks?

Are there further performance optimizations I am missing?

Server Description:
The system in question is a Dell Poweredge R820, 16GB of RAM,and two
Intel(R) Xeon(R) CPU E5-4610 0 @ 2.40GHz.
Onboard, in a 16x PCIe slot, I have one Chelsio T-580-CRtwo-port 40GbE
NIC, and in an 8x slot, another T-580-CR dual port.

I am running FreeBSD 10.0-STABLE.

BIOS tweaks:

Hyperthreading (or Logical Processors) is turned off.
Memory Node Interleaving is turned off, but did not appear toimpact
performance.

/boot/loader.conf contents:
#for CARP+PF testing
carp_load="YES"
#load cxgbe drivers.
cxgbe_load="YES"
#maxthreads appears to not exceed CPU.
net.isr.maxthreads=12
#bindthreads may be indicated when using cpuset(1) oninterrupts
net.isr.bindthreads=1
#random guess based on googling
net.isr.maxqlimit=60480
net.link.ifqmaxlen=90000
#discussions with cxgbe maintainer and list led me to tryingthis.
Allows more interrupts
#to be fixed to CPUs, which in some cases, improves interruptbalancing.
hw.cxgbe.ntxq10g=16
hw.cxgbe.nrxq10g=16

/etc/sysctl.conf contents:

#the following is also enabled by rc.conf gateway_enable.
net.inet.ip.fastforwarding=1
#recommendations from BSD router project
kern.random.sys.harvest.ethernet=0
kern.random.sys.harvest.point_to_point=0
kern.random.sys.harvest.interrupt=0
#probably should be removed, as cxgbe does not seem toaffect/be
affected by irq storm settings
hw.intr_storm_threshold=25000000
#based on Calomel.Org performance suggestions. 4x40GbE, seemed
reasonable to use 100GbE settings
kern.ipc.maxsockbuf=1258291200
net.inet.tcp.recvbuf_max=1258291200
net.inet.tcp.sendbuf_max=1258291200
#attempting to play with ULE scheduler, making it serve packetsversus
netstat
kern.sched.slice=1
kern.sched.interact=1

/etc/rc.conf contains:

hostname="fbge1"
#should remove, especially given below duplicate entry
ifconfig_igb0="DHCP"
sshd_enable="YES"
#ntpd_enable="YES"
# Set dumpdev to "AUTO" to enable crash dumps, "NO" to disable
dumpdev="AUTO"
# OpenBSD PF options to play with later. very bad for rawpacket rates.
#pf_enable="YES"
#pflog_enable="YES"
# enable packet forwarding
# these enable forwarding and fastforwarding sysctls. inet6does not
have fastforward
gateway_enable="YES"
ipv6_gateway_enable="YES"
# enable OpenBSD ftp-proxy
# should comment out until actively playing with PF
ftpproxy_enable="YES"
#left in place, commented out from prior testing
#ifconfig_mlxen1="inet 172.16.2.1 netmask 255.255.255.0 mtu9000"#ifconfig_mlxen0="inet 172.16.1.1 netmask 255.255.255.0 mtu9000"#ifconfig_mlxen3="inet 172.16.7.1 netmask 255.255.255.0 mtu9000"#ifconfig_mlxen2="inet 172.16.8.1 netmask 255.255.255.0 mtu9000"# -lro and -tso options added per mailing list suggestion fromBjoern A.
Zeeb (bzeeb-lists at lists.zabbadoz.net)
ifconfig_cxl0="inet 172.16.3.1 netmask 255.255.255.0 mtu 9000-lro -tso up"ifconfig_cxl1="inet 172.16.4.1 netmask 255.255.255.0 mtu 9000-lro -tso up"ifconfig_cxl2="inet 172.16.5.1 netmask 255.255.255.0 mtu 9000-lro -tso up"ifconfig_cxl3="inet 172.16.6.1 netmask 255.255.255.0 mtu 9000-lro -tso up"# aliases instead of reconfiguring test clients. See abovecommented out
entries
ifconfig_cxl0_alias0="172.16.7.1 netmask 255.255.255.0"
ifconfig_cxl1_alias0="172.16.8.1 netmask 255.255.255.0"
ifconfig_cxl2_alias0="172.16.1.1 netmask 255.255.255.0"
ifconfig_cxl3_alias0="172.16.2.1 netmask 255.255.255.0"
# for remote monitoring/admin of the test device
ifconfig_igb0="inet 172.30.60.60 netmask 255.255.0.0"

Additional configurations:
cpuset-chelsio-6cpu-high
# Original provided by  Navdeep Parhar <npar...@gmail.com>
# takes vmstat -ai output into a list, and assigns interruptsin order to
# the available CPU cores.
# Modified: to assign only to the 'high CPUs', ie: on core1.
# See:http://lists.freebsd.org/pipermail/freebsd-net/2014-July/039317.html
#!/usr/local/bin/bash
ncpu=12
irqlist=$(vmstat -ia | egrep 't4nex|t5nex|cxgbc' | cut -f1 -d:| cut -c4-)
i=6
for irq in $irqlist; do
     cpuset -l $i -x $irq
     i=$((i+1))
     [ $i -ge $ncpu ] && i=6
done

Client Description:

Two Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz processors
64 GB ram
Mellanox Technologies MT27500 Family [ConnectX-3]
Centos 6.4 with updates
iperf3 installed from yum repositories:iperf3-3.0.3-3.el6.x86_64
Test setup:
I've found about 3 streams between Centos clients is about thebest way
to get the most out of them.
Above certain points, the -b flag does not change results.
-N is an artifact from using TCP
-l is needed, as -M doesn't work for UDP.

I usually use launch scripts similar to the following:

for i in `seq 41 60`; do ssh loader$i "export TIME=120; export
STREAMS=1; export PORT=52$i; export PKT=64; export RATE=2000m;
/root/iperf-test-8port-udp" & done

The scripts execute the following on each host.

#!/bin/bash
PORT1=$PORT
PORT2=$(($PORT+1000))
PORT3=$(($PORT+2000))
iperf3 -c loader41-40gbe -u -b 10000m -i 0  -N -l $PKT -t$TIME
-P$STREAMS -p$PORT1 &
iperf3 -c loader42-40gbe -u -b 10000m -i 0  -N -l $PKT -t$TIME
-P$STREAMS -p$PORT1 &
iperf3 -c loader43-40gbe -u -b 10000m -i 0  -N -l $PKT -t$TIME
-P$STREAMS -p$PORT1 &
... (through all clients and all three ports) ...
iperf3 -c loader60-40gbe -u -b 10000m -i 0  -N -l $PKT -t$TIME
-P$STREAMS -p$PORT3 &


Results:

Summarized, netstat -w 1 -q 240 -bd, run through:
cat test4-tuning | egrep -v {'packets | input '} | awk'{ipackets+=$1}
{idrops+=$3} {opackets+=$5} {odrops+=$9} END {print "input "
ipackets/NR, "idrops " idrops/NR, "opackets " opackets/NR,"odrops "
odrops/NR}'
input 1.10662e+07 idrops 8.01783e+06 opackets 3.04516e+06odrops 3152.4
Snapshot of raw output:

        input        (Total)           output
packets errs idrops bytes packets errs bytescolls drops11189148 0 7462453 1230805216 3725006 0 4097507100 79910527505 0 6746901 1158024978 3779096 0 4157007080 12710606163 0 6850760 1166676673 3751780 0 4126957610 153510749324 0 7132014 1182425799 3617558 0 3979309560 597210695667 0 7022717 1176521907 3669342 0 4036272360 146110441173 0 6762134 1148528662 3675048 0 4042555400 602110683773 0 7005635 1175215014 3676962 0 4044656710 260610869859 0 7208696 1195683372 3658432 0 4024276980 97911948989 0 8310926 1314387881 3633773 0 3997149860 72512426195 0 8864415 1366877194 3562311 0 3918531560 276213006059 0 9432389 1430661751 3570067 0 3927065520 515812822243 0 9098871 1410443600 3715177 0 4086685000 406413317864 0 9683602 1464961374 3632156 0 3995361310 368413701905 0 10182562 1507207982 3523101 0 3875408590
8690
13820227 0 10244870 1520221820 3562038 0 3918233220
2426
14437060 0 10955483 1588073033 3480105 0 3828105570
2619
14518471 0 11119573 1597028105 3397439 0 3737173550
5691
14890287 0 11675003 1637926521 3199812 0 3519783040
11007
14923610 0 11749091 1641594441 3171436 0 3488574680
7389
14738704 0 11609730 1621254991 3117715 0 3429483940
2597
14753975 0 11549735 1622935026 3207393 0 3528128460
4798





_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to"freebsd-net-unsubscr...@freebsd.org"
_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: fastforward/routing: a 3 million packet-per-second system?

Reply via email to