On 22 Jul 2014, at 20:30, Adrian Chadd wrote:

hi!

You can use 'pmcstat -S CPU_CLK_UNHALTED_CORE -O pmc.out' (then ctrl-C
it after say 5 seconds), which will log the data to pmc.out;
then 'pmcannotate -k /boot/kernel pmc.out /boot/kernel/kernel' to find
out where the most cpu cycles are being spent.


Chiming in late, but don't you mean instruction-retired instead of CPU_CLK_UNHALTED_CORE?

Best,
George


It should give us the location(s) inside the top CPU users.

Hopefully it'll then be much more obvious!

I'm glad you're digging into it!

-a



On 22 July 2014 12:21, John Jasen <jja...@gmail.com> wrote:
Navdeep;

I was struck by spending so much time in transmit as well.

Adrian's suggestion on mining lock profiling gave me an excuse to up the tx queues in /boot/loader.conf. Our prior conversations indicated that
up to 64 should be acceptable?





On 07/22/2014 03:10 PM, Adrian Chadd wrote:
Hi

Right. Time to figure out why you're spending so much time in
cxgbe_transmit() and t4_eth_tx(). Maybe ask Navdeep for some ideas?


-a

On 22 July 2014 12:07, John Jasen <jja...@gmail.com> wrote:
The first is pretty easy to turn around. Reading on dtrace now. Thanks
for the pointers and help!

PMC: [CPU_CLK_UNHALTED_CORE] Samples: 142654 (100.0%) , 124560 unresolved

%SAMP IMAGE      FUNCTION             CALLERS
34.0 if_cxgbe.k t4_eth_tx cxgbe_transmit:24.0 t4_tx_task:10.0
28.8 if_cxgbe.k cxgbe_transmit
7.6 if_cxgbe.k service_iq           t4_intr
6.4 if_cxgbe.k get_scatter_segment  service_iq
4.9 if_cxgbe.k reclaim_tx_descs     t4_eth_tx
3.2 if_cxgbe.k write_sgl_to_txd     t4_eth_tx
2.8 hwpmc.ko   pmclog_process_callc pmc_process_samples
2.1 libc.so.7  bcopy                pmclog_read
1.9 if_cxgbe.k t4_eth_rx            service_iq
1.7 hwpmc.ko   pmclog_reserve       pmclog_process_callchain
1.2 libpmc.so. pmclog_read
1.0 if_cxgbe.k write_txpkts_wr      t4_eth_tx
0.8 kernel     e1000_read_i2c_byte_ e1000_set_i2c_bb
0.6 libc.so.7  memset
0.5 if_cxgbe.k refill_fl            service_iq




On 07/22/2014 02:45 PM, Adrian Chadd wrote:
Hi,

Well, start with CPU profiling with pmc:

kldload hwpmc
pmcstat -S CPU_CLK_UNHALTED_CORE -T -w 1

.. look at the freebsd dtrace onliners (google that) for lock
contention and CPU usage.

You could compile with LOCK_PROFILING (which will slow things down a little even when not in use) then enable it for a few seconds (which
will definitely slow things down) to gather fine grained lock
contention data.

(sysctl debug.lock.prof when you have it compiled in. sysctl
debug.lock.prof.enable=1; wait 10 seconds; sysctl
debug.lock.prof.enable=0; sysctl debug.lock.prof.stats)


-a


On 22 July 2014 11:42, John Jasen <jja...@gmail.com> wrote:
If you have ideas on what to instrument, I'll be happy to do it.

I'm faintly familiar with dtrace, et al, so it might take me a few tries
to get it right -- or bludgeoning with the documentation.

Thanks!




On 07/22/2014 02:07 PM, Adrian Chadd wrote:
Hi!

Well, what's missing is some dtrace/pmc/lockdebugging investigations
into the system to see where it's currently maxing out at.

I wonder if you're seeing contention on the transmit paths as drivers
queue frames from one set of driver threads/queues to another
potentially completely different set of driver transmit
threads/queues.




-a


On 22 July 2014 08:18, John Jasen <jja...@gmail.com> wrote:
Feedback and/or tips and tricks more than welcome.

Outstanding questions:

Would increasing the number of processor cores help?

Would a system where both processor QPI ports connect to each other
mitigate QPI bottlenecks?

Are there further performance optimizations I am missing?

Server Description:

The system in question is a Dell Poweredge R820, 16GB of RAM, and two
Intel(R) Xeon(R) CPU E5-4610 0 @ 2.40GHz.

Onboard, in a 16x PCIe slot, I have one Chelsio T-580-CR two-port 40GbE
NIC, and in an 8x slot, another T-580-CR dual port.

I am running FreeBSD 10.0-STABLE.

BIOS tweaks:

Hyperthreading (or Logical Processors) is turned off.
Memory Node Interleaving is turned off, but did not appear to impact
performance.

/boot/loader.conf contents:
#for CARP+PF testing
carp_load="YES"
#load cxgbe drivers.
cxgbe_load="YES"
#maxthreads appears to not exceed CPU.
net.isr.maxthreads=12
#bindthreads may be indicated when using cpuset(1) on interrupts
net.isr.bindthreads=1
#random guess based on googling
net.isr.maxqlimit=60480
net.link.ifqmaxlen=90000
#discussions with cxgbe maintainer and list led me to trying this.
Allows more interrupts
#to be fixed to CPUs, which in some cases, improves interrupt balancing.
hw.cxgbe.ntxq10g=16
hw.cxgbe.nrxq10g=16

/etc/sysctl.conf contents:

#the following is also enabled by rc.conf gateway_enable.
net.inet.ip.fastforwarding=1
#recommendations from BSD router project
kern.random.sys.harvest.ethernet=0
kern.random.sys.harvest.point_to_point=0
kern.random.sys.harvest.interrupt=0
#probably should be removed, as cxgbe does not seem to affect/be
affected by irq storm settings
hw.intr_storm_threshold=25000000
#based on Calomel.Org performance suggestions. 4x40GbE, seemed
reasonable to use 100GbE settings
kern.ipc.maxsockbuf=1258291200
net.inet.tcp.recvbuf_max=1258291200
net.inet.tcp.sendbuf_max=1258291200
#attempting to play with ULE scheduler, making it serve packets versus
netstat
kern.sched.slice=1
kern.sched.interact=1

/etc/rc.conf contains:

hostname="fbge1"
#should remove, especially given below duplicate entry
ifconfig_igb0="DHCP"
sshd_enable="YES"
#ntpd_enable="YES"
# Set dumpdev to "AUTO" to enable crash dumps, "NO" to disable
dumpdev="AUTO"
# OpenBSD PF options to play with later. very bad for raw packet rates.
#pf_enable="YES"
#pflog_enable="YES"
# enable packet forwarding
# these enable forwarding and fastforwarding sysctls. inet6 does not
have fastforward
gateway_enable="YES"
ipv6_gateway_enable="YES"
# enable OpenBSD ftp-proxy
# should comment out until actively playing with PF
ftpproxy_enable="YES"
#left in place, commented out from prior testing
#ifconfig_mlxen1="inet 172.16.2.1 netmask 255.255.255.0 mtu 9000" #ifconfig_mlxen0="inet 172.16.1.1 netmask 255.255.255.0 mtu 9000" #ifconfig_mlxen3="inet 172.16.7.1 netmask 255.255.255.0 mtu 9000" #ifconfig_mlxen2="inet 172.16.8.1 netmask 255.255.255.0 mtu 9000" # -lro and -tso options added per mailing list suggestion from Bjoern A.
Zeeb (bzeeb-lists at lists.zabbadoz.net)
ifconfig_cxl0="inet 172.16.3.1 netmask 255.255.255.0 mtu 9000 -lro -tso up" ifconfig_cxl1="inet 172.16.4.1 netmask 255.255.255.0 mtu 9000 -lro -tso up" ifconfig_cxl2="inet 172.16.5.1 netmask 255.255.255.0 mtu 9000 -lro -tso up" ifconfig_cxl3="inet 172.16.6.1 netmask 255.255.255.0 mtu 9000 -lro -tso up" # aliases instead of reconfiguring test clients. See above commented out
entries
ifconfig_cxl0_alias0="172.16.7.1 netmask 255.255.255.0"
ifconfig_cxl1_alias0="172.16.8.1 netmask 255.255.255.0"
ifconfig_cxl2_alias0="172.16.1.1 netmask 255.255.255.0"
ifconfig_cxl3_alias0="172.16.2.1 netmask 255.255.255.0"
# for remote monitoring/admin of the test device
ifconfig_igb0="inet 172.30.60.60 netmask 255.255.0.0"

Additional configurations:
cpuset-chelsio-6cpu-high
# Original provided by  Navdeep Parhar <npar...@gmail.com>
# takes vmstat -ai output into a list, and assigns interrupts in order to
# the available CPU cores.
# Modified: to assign only to the 'high CPUs', ie: on core1.
# See: http://lists.freebsd.org/pipermail/freebsd-net/2014-July/039317.html
#!/usr/local/bin/bash
ncpu=12
irqlist=$(vmstat -ia | egrep 't4nex|t5nex|cxgbc' | cut -f1 -d: | cut -c4-)
i=6
for irq in $irqlist; do
     cpuset -l $i -x $irq
     i=$((i+1))
     [ $i -ge $ncpu ] && i=6
done

Client Description:

Two Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz processors
64 GB ram
Mellanox Technologies MT27500 Family [ConnectX-3]
Centos 6.4 with updates
iperf3 installed from yum repositories: iperf3-3.0.3-3.el6.x86_64

Test setup:

I've found about 3 streams between Centos clients is about the best way
to get the most out of them.
Above certain points, the -b flag does not change results.
-N is an artifact from using TCP
-l is needed, as -M doesn't work for UDP.

I usually use launch scripts similar to the following:

for i in `seq 41 60`; do ssh loader$i "export TIME=120; export
STREAMS=1; export PORT=52$i; export PKT=64; export RATE=2000m;
/root/iperf-test-8port-udp" & done

The scripts execute the following on each host.

#!/bin/bash
PORT1=$PORT
PORT2=$(($PORT+1000))
PORT3=$(($PORT+2000))
iperf3 -c loader41-40gbe -u -b 10000m -i 0  -N -l $PKT -t$TIME
-P$STREAMS -p$PORT1 &
iperf3 -c loader42-40gbe -u -b 10000m -i 0  -N -l $PKT -t$TIME
-P$STREAMS -p$PORT1 &
iperf3 -c loader43-40gbe -u -b 10000m -i 0  -N -l $PKT -t$TIME
-P$STREAMS -p$PORT1 &
... (through all clients and all three ports) ...
iperf3 -c loader60-40gbe -u -b 10000m -i 0  -N -l $PKT -t$TIME
-P$STREAMS -p$PORT3 &


Results:

Summarized, netstat -w 1 -q 240 -bd, run through:
cat test4-tuning | egrep -v {'packets | input '} | awk '{ipackets+=$1}
{idrops+=$3} {opackets+=$5} {odrops+=$9} END {print "input "
ipackets/NR, "idrops " idrops/NR, "opackets " opackets/NR, "odrops "
odrops/NR}'

input 1.10662e+07 idrops 8.01783e+06 opackets 3.04516e+06 odrops 3152.4

Snapshot of raw output:

        input        (Total)           output
packets errs idrops bytes packets errs bytes colls drops 11189148 0 7462453 1230805216 3725006 0 409750710 0 799 10527505 0 6746901 1158024978 3779096 0 415700708 0 127 10606163 0 6850760 1166676673 3751780 0 412695761 0 1535 10749324 0 7132014 1182425799 3617558 0 397930956 0 5972 10695667 0 7022717 1176521907 3669342 0 403627236 0 1461 10441173 0 6762134 1148528662 3675048 0 404255540 0 6021 10683773 0 7005635 1175215014 3676962 0 404465671 0 2606 10869859 0 7208696 1195683372 3658432 0 402427698 0 979 11948989 0 8310926 1314387881 3633773 0 399714986 0 725 12426195 0 8864415 1366877194 3562311 0 391853156 0 2762 13006059 0 9432389 1430661751 3570067 0 392706552 0 5158 12822243 0 9098871 1410443600 3715177 0 408668500 0 4064 13317864 0 9683602 1464961374 3632156 0 399536131 0 3684 13701905 0 10182562 1507207982 3523101 0 387540859 0
8690
13820227 0 10244870 1520221820 3562038 0 391823322 0
2426
14437060 0 10955483 1588073033 3480105 0 382810557 0
2619
14518471 0 11119573 1597028105 3397439 0 373717355 0
5691
14890287 0 11675003 1637926521 3199812 0 351978304 0
11007
14923610 0 11749091 1641594441 3171436 0 348857468 0
7389
14738704 0 11609730 1621254991 3117715 0 342948394 0
2597
14753975 0 11549735 1622935026 3207393 0 352812846 0
4798





_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Reply via email to