Re: ixl 40G bad performance?
On Thu, Dec 10, 2015 at 2:18 PM, Eggert, Larswrote: > On 2015-10-26, at 18:40, Eggert, Lars wrote: > > On 2015-10-26, at 17:08, Pieper, Jeffrey E > wrote: > >> As a caveat, this was using default netperf message sizes. > > > > I get the same ~3 Gb/s with the default netperf sizes and driver 1.4.5. > > Now there is version 1.4.8 on the Intel website, but it doesn't change > things for me. > I had the opportunity to see similar numbers and behavior while using XL710 1.4.3 as of FreeBSD r291085 while in DPDK poll mode, but driver 1.2.8 as of r292035 was providing expected numbers. While removing rxcsum/txcsum did not provide differences, fully removing RSS + disabling rx/cxsum support provided better numbers. However now with driver 1.4.8 and the same set of hardware setup, except for a different transceiver, I can get 36Gbps/24Mpps with no further tweaks, so if you can replace your transceiver, shall be a different test as a starting point. > > > When you tcpdump during the run, do you see TSO/LRO in effect, i.e., do > you see "segments" > 32K in the trace? > > I still see no TSO/LRO in effect when tcpdump'ing on the receiver; note > how all the packets are 1448 bytes: > > tcpdump: verbose output suppressed, use -v or -vv for full protocol decode > listening on ixl0, link-type EN10MB (Ethernet), capture size 262144 bytes > 17:02:42.328782 IP 10.0.4.1.21507 > 10.0.4.2.12865: Flags [S], seq > 15244366, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 478099 > ecr 0], length 0 > 17:02:42.328808 IP 10.0.4.2.12865 > 10.0.4.1.21507: Flags [S.], seq > 1819579546, ack 15244367, win 65535, options [mss 1460,nop,wscale > 6,sackOK,TS val 3553932482 ecr 478099], length 0 > 17:02:42.328842 IP 10.0.4.1.21507 > 10.0.4.2.12865: Flags [.], ack 1, win > 1040, options [nop,nop,TS val 478099 ecr 3553932482], length 0 > 17:02:42.329804 IP 10.0.4.1.21507 > 10.0.4.2.12865: Flags [P.], seq 1:657, > ack 1, win 1040, options [nop,nop,TS val 478100 ecr 3553932482], length 656 > 17:02:42.331671 IP 10.0.4.2.12865 > 10.0.4.1.21507: Flags [P.], seq 1:657, > ack 657, win 1040, options [nop,nop,TS val 3553932485 ecr 478100], length > 656 > 17:02:42.331717 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [S], seq > 1387798477, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 478102 > ecr 0], length 0 > 17:02:42.331729 IP 10.0.4.2.30216 > 10.0.4.1.10449: Flags [S.], seq > 4085135109, ack 1387798478, win 65535, options [mss 1460,nop,wscale > 6,sackOK,TS val 282922 ecr 478102], length 0 > 17:02:42.331781 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], ack 1, win > 1040, options [nop,nop,TS val 478102 ecr 282922], length 0 > 17:02:42.331796 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq 1:1449, > ack 1, win 1040, options [nop,nop,TS val 478102 ecr 282922], length 1448 > 17:02:42.331800 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq > 1449:2897, ack 1, win 1040, options [nop,nop,TS val 478102 ecr 282922], > length 1448 > 17:02:42.331807 IP 10.0.4.2.30216 > 10.0.4.1.10449: Flags [.], ack 2897, > win 1018, options [nop,nop,TS val 282923 ecr 478102], length 0 > 17:02:42.331809 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq > 2897:4345, ack 1, win 1040, options [nop,nop,TS val 478102 ecr 282922], > length 1448 > 17:02:42.331813 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq > 4345:5793, ack 1, win 1040, options [nop,nop,TS val 478102 ecr 282922], > length 1448 > 17:02:42.331817 IP 10.0.4.2.30216 > 10.0.4.1.10449: Flags [.], ack 5793, > win 1018, options [nop,nop,TS val 282923 ecr 478102], length 0 > 17:02:42.331818 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq > 5793:7241, ack 1, win 1040, options [nop,nop,TS val 478102 ecr 282922], > length 1448 > 17:02:42.331821 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq > 7241:8689, ack 1, win 1040, options [nop,nop,TS val 478102 ecr 282922], > length 1448 > 17:02:42.331825 IP 10.0.4.2.30216 > 10.0.4.1.10449: Flags [.], ack 8689, > win 1018, options [nop,nop,TS val 282923 ecr 478102], length 0 > 17:02:42.331826 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq > 8689:10137, ack 1, win 1040, options [nop,nop,TS val 478102 ecr > 282922], length 1448 > 17:02:42.331829 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq > 10137:11585, ack 1, win 1040, options [nop,nop,TS val 478102 ecr > 282922], length 1448 > ... > > Doing the same trace over 10G ix interfaces shows most segments in the > 8-32K range, indicating that TSO/LRO are in use (and results in 9.9G > throughput). > > Lars > ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: ixl 40G bad performance?
On 10 December 2015 at 10:29, Denis Pearsonwrote: > On Thu, Dec 10, 2015 at 2:18 PM, Eggert, Lars wrote: > >> On 2015-10-26, at 18:40, Eggert, Lars wrote: >> > On 2015-10-26, at 17:08, Pieper, Jeffrey E >> wrote: >> >> As a caveat, this was using default netperf message sizes. >> > >> > I get the same ~3 Gb/s with the default netperf sizes and driver 1.4.5. >> >> Now there is version 1.4.8 on the Intel website, but it doesn't change >> things for me. >> > > I had the opportunity to see similar numbers and behavior while using XL710 > 1.4.3 as of FreeBSD r291085 while in DPDK poll mode, but driver 1.2.8 as of > r292035 was providing expected numbers. While removing rxcsum/txcsum did > not provide differences, fully removing RSS + disabling rx/cxsum support > provided better numbers. Can someone debug this a bit more? (My kit with ixl NICs in it is still not up and available. :( ) Device RSS, even without kernel RSS enabled, shouldn't cause a massive performance drop. If it is then something else odd is going on. Do you have a diff where you removed things? -adrian > However now with driver 1.4.8 and the same set of hardware setup, except > for a different transceiver, I can get 36Gbps/24Mpps with no further > tweaks, so if you can replace your transceiver, shall be a different test > as a starting point. ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: ixl 40G bad performance?
On Thu, Dec 10, 2015 at 4:40 PM, Adrian Chaddwrote: > On 10 December 2015 at 10:29, Denis Pearson > wrote: > > On Thu, Dec 10, 2015 at 2:18 PM, Eggert, Lars wrote: > > > >> On 2015-10-26, at 18:40, Eggert, Lars wrote: > >> > On 2015-10-26, at 17:08, Pieper, Jeffrey E < > jeffrey.e.pie...@intel.com> > >> wrote: > >> >> As a caveat, this was using default netperf message sizes. > >> > > >> > I get the same ~3 Gb/s with the default netperf sizes and driver > 1.4.5. > >> > >> Now there is version 1.4.8 on the Intel website, but it doesn't change > >> things for me. > >> > > > > I had the opportunity to see similar numbers and behavior while using > XL710 > > 1.4.3 as of FreeBSD r291085 while in DPDK poll mode, but driver 1.2.8 as > of > > r292035 was providing expected numbers. While removing rxcsum/txcsum did > > not provide differences, fully removing RSS + disabling rx/cxsum support > > provided better numbers. > > Can someone debug this a bit more? (My kit with ixl NICs in it is > still not up and available. :( ) > > Device RSS, even without kernel RSS enabled, shouldn't cause a massive > performance drop. If it is then something else odd is going on. > Do you have a diff where you removed things? > I can probably find out a snapshot with the code at the time and extract a diff, yes. I just don't know how it worths wasting the time when the problem is not reproducible on the current 1.4.8 driver which will hopefully get into -CURRENT (if it's not already there?). And it's much more specific, the performance drop happened on dpdk poll mode, not the usual kernel operation so a simple diff only pointing out the changes for the driver to actually build and run without rss will still require a testlab and different ways to generate traffic. This is why I suggested a transceiver change or replug first. Anyway RSS performance dropping problem is far from a FreeBSD specific problem, while researching I could find the exact same complaints on Windows users starting from windows 8 while having RSS@4 or RSS@16 or RSS completely disabled, some times with acceptable results only when it was disabled (despiste the fact that MiniportInterruptDPC was using a whole CPU when RSS was off results were still better). So I guess this is just a side effect of when it's just good to have NIC features turned off. The reason, I'm not an engineer to answer, but I would guess it's related to other NIC features also doing something with the packet or any sort of errors netstat or driver status may not tell. I was able to see the problem even with low pps rates and big packet sizes, as well as avg pkt size of 768bytes so I don't think it's any sort of card resource starvation. I can manage to have the whole lab up and running by the weekend if you want to investigate and compare, just ping me off list. > > -adrian > > > However now with driver 1.4.8 and the same set of hardware setup, except > > for a different transceiver, I can get 36Gbps/24Mpps with no further > > tweaks, so if you can replace your transceiver, shall be a different test > > as a starting point. > ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: ixl 40G bad performance?
On Thu, Dec 10, 2015 at 10:40 AM, Adrian Chaddwrote: > On 10 December 2015 at 10:29, Denis Pearson wrote: >> On Thu, Dec 10, 2015 at 2:18 PM, Eggert, Lars wrote: >> >>> On 2015-10-26, at 18:40, Eggert, Lars wrote: >>> > On 2015-10-26, at 17:08, Pieper, Jeffrey E >>> wrote: >>> >> As a caveat, this was using default netperf message sizes. >>> > >>> > I get the same ~3 Gb/s with the default netperf sizes and driver 1.4.5. >>> >>> Now there is version 1.4.8 on the Intel website, but it doesn't change >>> things for me. >>> >> >> I had the opportunity to see similar numbers and behavior while using XL710 >> 1.4.3 as of FreeBSD r291085 while in DPDK poll mode, but driver 1.2.8 as of >> r292035 was providing expected numbers. While removing rxcsum/txcsum did >> not provide differences, fully removing RSS + disabling rx/cxsum support >> provided better numbers. > > Can someone debug this a bit more? (My kit with ixl NICs in it is > still not up and available. :( ) > > Device RSS, even without kernel RSS enabled, shouldn't cause a massive > performance drop. If it is then something else odd is going on. I am not sure whether we are digressing (Lars' complaint was about poor bulk throughput, now i see DPDK and high packet rates mentioned so i feel obliged to pitch in!) but a related piece of info: last spring, with netmap and i40e on linux (don't remember which driver/firmware), we saw that enabling FlowDirector killed the pps throughput (from 32 down to 18 Mpps). FlowDirector is a device feature which was probably affecting ordinary processing on the NIC, either because of bugs or because of consuming controller resources. The same may be possibly happening with other device features. cheers luigi > > Do you have a diff where you removed things? > > > -adrian > >> However now with driver 1.4.8 and the same set of hardware setup, except >> for a different transceiver, I can get 36Gbps/24Mpps with no further >> tweaks, so if you can replace your transceiver, shall be a different test >> as a starting point. > ___ > freebsd-net@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" -- -+--- Prof. Luigi RIZZO, ri...@iet.unipi.it . Dip. di Ing. dell'Informazione http://www.iet.unipi.it/~luigi/. Universita` di Pisa TEL +39-050-2217533 . via Diotisalvi 2 Mobile +39-338-6809875 . 56122 PISA (Italy) -+--- ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: ixl 40G bad performance?
[snip] If RSS works fine on the latest driver then great. This was with single queue netperf, right? -a ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: ixl 40G bad performance?
Hi, On 2015-12-10, at 20:42, Denis Pearsonwrote: > I can probably find out a snapshot with the code at the time and extract a > diff, yes. I just don't know how it worths wasting the time when the problem > is not reproducible on the current 1.4.8 driver which will hopefully get into > -CURRENT (if it's not already there?). per my last email, I do see the same issues with 1.4.8. This is with a single netperf TCP flow, no NIC parameter tuning and no RSS or PCBGROUP in the kernel. > This is why I suggested a transceiver change or replug first. I will test this next week. (However, the same testbed booted into Linux doesn't see these low netperf numbers.) It really smells like a TSO/LRO (= packet rate) issue. If I configure jumbograms, performance jumps up as expected. Lars signature.asc Description: Message signed with OpenPGP using GPGMail
Re: ixl 40G bad performance?
On 2015-10-26, at 18:40, Eggert, Larswrote: > On 2015-10-26, at 17:08, Pieper, Jeffrey E wrote: >> As a caveat, this was using default netperf message sizes. > > I get the same ~3 Gb/s with the default netperf sizes and driver 1.4.5. Now there is version 1.4.8 on the Intel website, but it doesn't change things for me. > When you tcpdump during the run, do you see TSO/LRO in effect, i.e., do you > see "segments" > 32K in the trace? I still see no TSO/LRO in effect when tcpdump'ing on the receiver; note how all the packets are 1448 bytes: tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on ixl0, link-type EN10MB (Ethernet), capture size 262144 bytes 17:02:42.328782 IP 10.0.4.1.21507 > 10.0.4.2.12865: Flags [S], seq 15244366, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 478099 ecr 0], length 0 17:02:42.328808 IP 10.0.4.2.12865 > 10.0.4.1.21507: Flags [S.], seq 1819579546, ack 15244367, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 3553932482 ecr 478099], length 0 17:02:42.328842 IP 10.0.4.1.21507 > 10.0.4.2.12865: Flags [.], ack 1, win 1040, options [nop,nop,TS val 478099 ecr 3553932482], length 0 17:02:42.329804 IP 10.0.4.1.21507 > 10.0.4.2.12865: Flags [P.], seq 1:657, ack 1, win 1040, options [nop,nop,TS val 478100 ecr 3553932482], length 656 17:02:42.331671 IP 10.0.4.2.12865 > 10.0.4.1.21507: Flags [P.], seq 1:657, ack 657, win 1040, options [nop,nop,TS val 3553932485 ecr 478100], length 656 17:02:42.331717 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [S], seq 1387798477, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 478102 ecr 0], length 0 17:02:42.331729 IP 10.0.4.2.30216 > 10.0.4.1.10449: Flags [S.], seq 4085135109, ack 1387798478, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 282922 ecr 478102], length 0 17:02:42.331781 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], ack 1, win 1040, options [nop,nop,TS val 478102 ecr 282922], length 0 17:02:42.331796 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq 1:1449, ack 1, win 1040, options [nop,nop,TS val 478102 ecr 282922], length 1448 17:02:42.331800 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq 1449:2897, ack 1, win 1040, options [nop,nop,TS val 478102 ecr 282922], length 1448 17:02:42.331807 IP 10.0.4.2.30216 > 10.0.4.1.10449: Flags [.], ack 2897, win 1018, options [nop,nop,TS val 282923 ecr 478102], length 0 17:02:42.331809 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq 2897:4345, ack 1, win 1040, options [nop,nop,TS val 478102 ecr 282922], length 1448 17:02:42.331813 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq 4345:5793, ack 1, win 1040, options [nop,nop,TS val 478102 ecr 282922], length 1448 17:02:42.331817 IP 10.0.4.2.30216 > 10.0.4.1.10449: Flags [.], ack 5793, win 1018, options [nop,nop,TS val 282923 ecr 478102], length 0 17:02:42.331818 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq 5793:7241, ack 1, win 1040, options [nop,nop,TS val 478102 ecr 282922], length 1448 17:02:42.331821 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq 7241:8689, ack 1, win 1040, options [nop,nop,TS val 478102 ecr 282922], length 1448 17:02:42.331825 IP 10.0.4.2.30216 > 10.0.4.1.10449: Flags [.], ack 8689, win 1018, options [nop,nop,TS val 282923 ecr 478102], length 0 17:02:42.331826 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq 8689:10137, ack 1, win 1040, options [nop,nop,TS val 478102 ecr 282922], length 1448 17:02:42.331829 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq 10137:11585, ack 1, win 1040, options [nop,nop,TS val 478102 ecr 282922], length 1448 ... Doing the same trace over 10G ix interfaces shows most segments in the 8-32K range, indicating that TSO/LRO are in use (and results in 9.9G throughput). Lars signature.asc Description: Message signed with OpenPGP using GPGMail
Re: ixl 40G bad performance?
On 2015-10-26, at 4:38, Kevin Obermanwrote: > On Sun, Oct 25, 2015 at 12:10 AM, Daniel Engberg < > daniel.engberg.li...@pyret.net> wrote: > >> One thing I've noticed that probably affects your performance benchmarks >> somewhat is that you're using iperf(2) instead of the newer iperf3 but I >> could be wrong... > > iperf3 is not a newer version of iperf. It is a total re-write and a rather > different tool. It has significant improvements in many areas and new > capabilities that might be of use. That said, there is no reason to think > that the results of tests using iperf2 are in any way inaccurate. However, > it is entirely possible to get misleading results if options not properly > selected. FWIW, I've been using netperf and tried various options. I don't think the issues is the benchmarking tool. I think the issue is TSO/LRO issues (per my earlier email.) Lars signature.asc Description: Message signed with OpenPGP using GPGMail
Re: ixl 40G bad performance?
On 2015-10-26, at 15:38, Pieper, Jeffrey Ewrote: > With the latest ixl component from: > https://downloadcenter.intel.com/download/25160/Network-Adapter-Driver-for-PCI-E-40-Gigabit-Network-Connections-under-FreeBSD- > > running on 10.2 amd64, I easily get 9.6 Gb/s with one netperf stream, either > b2b or through a switch. This is with no driver/kernel tuning. Running 4 > streams easily gets me 36 GB/s. Thanks, will test! If the newer driver makes a difference, any chance we'll see it in -HEAD soon? Lars signature.asc Description: Message signed with OpenPGP using GPGMail
RE: ixl 40G bad performance?
-Original Message- From: owner-freebsd-...@freebsd.org [mailto:owner-freebsd-...@freebsd.org] On Behalf Of Eggert, Lars Sent: Monday, October 26, 2015 2:28 AM To: Kevin Oberman <rkober...@gmail.com> Cc: freebsd-net@freebsd.org; Daniel Engberg <daniel.engberg.li...@pyret.net> Subject: Re: ixl 40G bad performance? On 2015-10-26, at 4:38, Kevin Oberman <rkober...@gmail.com> wrote: > On Sun, Oct 25, 2015 at 12:10 AM, Daniel Engberg < > daniel.engberg.li...@pyret.net> wrote: > >> One thing I've noticed that probably affects your performance benchmarks >> somewhat is that you're using iperf(2) instead of the newer iperf3 but I >> could be wrong... > > iperf3 is not a newer version of iperf. It is a total re-write and a rather > different tool. It has significant improvements in many areas and new > capabilities that might be of use. That said, there is no reason to think > that the results of tests using iperf2 are in any way inaccurate. However, > it is entirely possible to get misleading results if options not properly > selected. > >FWIW, I've been using netperf and tried various options. > >I don't think the issues is the benchmarking tool. I think the issue is >TSO/LRO issues (per my earlier email.) > >Lars With the latest ixl component from: https://downloadcenter.intel.com/download/25160/Network-Adapter-Driver-for-PCI-E-40-Gigabit-Network-Connections-under-FreeBSD- running on 10.2 amd64, I easily get 9.6 Gb/s with one netperf stream, either b2b or through a switch. This is with no driver/kernel tuning. Running 4 streams easily gets me 36 GB/s. Jeff ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
RE: ixl 40G bad performance?
-Original Message- From: Eggert, Lars [mailto:l...@netapp.com] Sent: Monday, October 26, 2015 8:08 AM To: Pieper, Jeffrey E <jeffrey.e.pie...@intel.com> Cc: Kevin Oberman <rkober...@gmail.com>; freebsd-net@freebsd.org; Daniel Engberg <daniel.engberg.li...@pyret.net> Subject: Re: ixl 40G bad performance? On 2015-10-26, at 15:38, Pieper, Jeffrey E <jeffrey.e.pie...@intel.com> wrote: > With the latest ixl component from: > https://downloadcenter.intel.com/download/25160/Network-Adapter-Driver-for-PCI-E-40-Gigabit-Network-Connections-under-FreeBSD- > > running on 10.2 amd64, I easily get 9.6 Gb/s with one netperf stream, either > b2b or through a switch. This is with no driver/kernel tuning. Running 4 > streams easily gets me 36 GB/s. > >Thanks, will test! > >If the newer driver makes a difference, any chance we'll see it in -HEAD soon? > >Lars As a caveat, this was using default netperf message sizes. Jeff ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: ixl 40G bad performance?
On 2015-10-26, at 17:08, Pieper, Jeffrey Ewrote: > As a caveat, this was using default netperf message sizes. I get the same ~3 Gb/s with the default netperf sizes and driver 1.4.5. When you tcpdump during the run, do you see TSO/LRO in effect, i.e., do you see "segments" > 32K in the trace? Lars signature.asc Description: Message signed with OpenPGP using GPGMail
Re: ixl 40G bad performance?
On Sun, Oct 25, 2015 at 12:10 AM, Daniel Engberg < daniel.engberg.li...@pyret.net> wrote: > One thing I've noticed that probably affects your performance benchmarks > somewhat is that you're using iperf(2) instead of the newer iperf3 but I > could be wrong... > > Best regards, > Daniel > iperf3 is not a newer version of iperf. It is a total re-write and a rather different tool. It has significant improvements in many areas and new capabilities that might be of use. That said, there is no reason to think that the results of tests using iperf2 are in any way inaccurate. However, it is entirely possible to get misleading results if options not properly selected. -- Kevin Oberman, Part time kid herder and retired Network Engineer E-mail: rkober...@gmail.com PGP Fingerprint: D03FB98AFA78E3B78C1694B318AB39EF1B055683 ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: ixl 40G bad performance?
On 2015-10-23, at 23:36, Eric Joynerwrote: > I see that the sysctl does clobber the global value, but have you tried > lowering the interval / raising the rate? You could try something like > 10usecs, and see if that helps. We'll do some more investigation here -- > 3Gb/s on a 40Gb/s using default settings is terrible, and we shouldn't let > that be happening. I played with different settings, but I've never been able to get more than 4Gb/s, whereas under Linux 4.2 without any special settings I get 13. See my other email on TSO/LRO not looking to be effective; that would certainly explain it. Plausible? Anything to try here? Lars signature.asc Description: Message signed with OpenPGP using GPGMail
Re: ixl 40G bad performance?
13 on a 40G interface?? I don't think that's very good for Linux either, is this a 4x10 adapter? Maybe elaborating on the details of the hardware, you sure you don't have a bad PCI slot somewhere that might be throttling everything? Cheers, Jack On Sat, Oct 24, 2015 at 12:43 AM, Eggert, Larswrote: > On 2015-10-23, at 23:36, Eric Joyner wrote: > > I see that the sysctl does clobber the global value, but have you tried > lowering the interval / raising the rate? You could try something like > 10usecs, and see if that helps. We'll do some more investigation here -- > 3Gb/s on a 40Gb/s using default settings is terrible, and we shouldn't let > that be happening. > > > I played with different settings, but I've never been able to get more > than 4Gb/s, whereas under Linux 4.2 without any special settings I get 13. > > See my other email on TSO/LRO not looking to be effective; that would > certainly explain it. Plausible? Anything to try here? > > Lars > > ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: ixl 40G bad performance?
On 2015-10-24, at 10:32, Jack Vogelwrote: > 13 on a 40G interface?? I don't think that's very good for Linux either, is > this a 4x10 adapter? No, its's a 2x40. And I can get it into the high 30s with tuning. I just mentioned the value to illustrate that something seems to be seriously broken under FreeBSD. Lars > Maybe elaborating on the details of the hardware, you sure you don't have a > bad PCI slot > somewhere that might be throttling everything? > > Cheers, > > Jack > > > On Sat, Oct 24, 2015 at 12:43 AM, Eggert, Lars wrote: > >> On 2015-10-23, at 23:36, Eric Joyner wrote: >> >> I see that the sysctl does clobber the global value, but have you tried >> lowering the interval / raising the rate? You could try something like >> 10usecs, and see if that helps. We'll do some more investigation here -- >> 3Gb/s on a 40Gb/s using default settings is terrible, and we shouldn't let >> that be happening. >> >> >> I played with different settings, but I've never been able to get more >> than 4Gb/s, whereas under Linux 4.2 without any special settings I get 13. >> >> See my other email on TSO/LRO not looking to be effective; that would >> certainly explain it. Plausible? Anything to try here? >> >> Lars >> >> > ___ > freebsd-net@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" signature.asc Description: Message signed with OpenPGP using GPGMail
Re: ixl 40G bad performance?
Bruce mostly has it right -- ITR is the minimum latency between interrupts. But, it does actually guarantee a minimum period between interrupts. Though, Fortville actually is unique a little bit in that there is another ITR setting that can ensure a certain average number of interrupts per second (called Interrupt Rate Limiting), though, but I don't think this is used in the current version of the driver. I see that the sysctl does clobber the global value, but have you tried lowering the interval / raising the rate? You could try something like 10usecs, and see if that helps. We'll do some more investigation here -- 3Gb/s on a 40Gb/s using default settings is terrible, and we shouldn't let that be happening. - Eric On Thu, Oct 22, 2015 at 10:36 PM Bruce Evanswrote: > On Wed, 21 Oct 2015, Bruce Evans wrote: > > > Fix for em: > > > > X diff -u2 if_em.c~ if_em.c > > X --- if_em.c~2015-09-28 06:29:35.0 + > > X +++ if_em.c 2015-10-18 18:49:36.876699000 + > > X @@ -609,8 +609,8 @@ > > X em_tx_abs_int_delay_dflt); > > X em_add_int_delay_sysctl(adapter, "itr", > > X - "interrupt delay limit in usecs/4", > > X + "interrupt delay limit in usecs", > > X >tx_itr, > > X E1000_REGISTER(hw, E1000_ITR), > > X - DEFAULT_ITR); > > X + 100 / MAX_INTS_PER_SEC); > > X X /* Sysctl for limiting the amount of work done in the taskqueue */ > > > > "delay limit" is fairly good wording. Other parameters tend to give long > > delays, but itr limits the longest delay due to interrupt moderation to > > whatever the itr respresents. > > Everything in the last paragraph is backwards (inverted). Other > parameters tend to give short delays. They should be set to small > values to minimise latency. Then under load, itr limits the interrupt > _rate_ from above. The interrupt delay is the inverse of the interrupt > rate, so it is limited from below. So "delay limit" is fairly bad > wording. Normally, limits are from above, but the inversion makes > the itr limit from below. > > This is most easily understood by converting itr to a rate: itr = 125 > means a rate limit of 8000 Hz. It doesn't quite mean that the latency > is at least 125 usec. No one wants to ensure large latencies, and the > itr setting only ensures a minimal average latency them under load. > > Bruce > ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: ixl 40G bad performance?
Hi, for those of you following along, I did try jumbograms and throughput increases roughly 5x. So it looks like I'm hitting a packet-rate limit somewhere. Lars signature.asc Description: Message signed with OpenPGP using GPGMail
Re: ixl 40G bad performance?
On Wed, 21 Oct 2015, Bruce Evans wrote: Fix for em: X diff -u2 if_em.c~ if_em.c X --- if_em.c~ 2015-09-28 06:29:35.0 + X +++ if_em.c 2015-10-18 18:49:36.876699000 + X @@ -609,8 +609,8 @@ X em_tx_abs_int_delay_dflt); X em_add_int_delay_sysctl(adapter, "itr", X - "interrupt delay limit in usecs/4", X + "interrupt delay limit in usecs", X >tx_itr, X E1000_REGISTER(hw, E1000_ITR), X - DEFAULT_ITR); X + 100 / MAX_INTS_PER_SEC); X X /* Sysctl for limiting the amount of work done in the taskqueue */ "delay limit" is fairly good wording. Other parameters tend to give long delays, but itr limits the longest delay due to interrupt moderation to whatever the itr respresents. Everything in the last paragraph is backwards (inverted). Other parameters tend to give short delays. They should be set to small values to minimise latency. Then under load, itr limits the interrupt _rate_ from above. The interrupt delay is the inverse of the interrupt rate, so it is limited from below. So "delay limit" is fairly bad wording. Normally, limits are from above, but the inversion makes the itr limit from below. This is most easily understood by converting itr to a rate: itr = 125 means a rate limit of 8000 Hz. It doesn't quite mean that the latency is at least 125 usec. No one wants to ensure large latencies, and the itr setting only ensures a minimal average latency them under load. Bruce ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: ixl 40G bad performance?
On 2015-10-22, at 9:38, Eggert, Larswrote: > for those of you following along, I did try jumbograms and throughput > increases roughly 5x. So it looks like I'm hitting a packet-rate limit > somewhere. Does the ixl driver have an issue with TSO/LRO? If I tcpdump on the receiver when testing the 10G ix interfaces, I see that most "packets" are up to 64KB in the traces on both sender and receiver, which is expected with TSO/LRO. When I look at the traffic over the ixl interfaces, I see that most "packets" on the sender are much smaller (~2896 aka 2 segments; although some few are >40K). On the receiver, I only see 1448 byte packets. Lars signature.asc Description: Message signed with OpenPGP using GPGMail
Re: ixl 40G bad performance?
The 40G hardware is absolutely dependent on firmware, if you have a mismatch for instance, it can totally bork things. So, I would work with your Intel rep and be sure you have the correct version for your specific hardware. Good luck, Jack On Wed, Oct 21, 2015 at 5:25 AM, Eggert, Larswrote: > Hi Bruce, > > thanks for the very detailed analysis of the ixl sysctls! > > On 2015-10-20, at 16:51, Bruce Evans wrote: > > > > Lowering (improving) latency always lowers (unimproves) throughput by > > increasing load. > > That, I also understand. But even when I back off the itr values to > something more reasonable, throughput still remains low. > > With all the tweaking I have tried, I have yet to top 3 Gb/s with ixl > cards, whereas they do ~13 Gb/s on Linux straight out of the box. > > Lars > ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: ixl 40G bad performance?
Hi Jack, On 2015-10-21, at 16:14, Jack Vogelwrote: > The 40G hardware is absolutely dependent on firmware, if you have a mismatch > for instance, it can totally bork things. So, I would work with your Intel > rep and be sure you have the correct version for your specific hardware. I got these tester cards from Amazon, so I don't have a rep. I flashed the latest NVM (1.2.5), because previously the FreeBSD driver was complaining about the firmware being too old. But I did that before the experiments. If there is anything else I should be doing, I'd appreciate being put in contact with someone at Intel who can help. Thanks, Lars Lars signature.asc Description: Message signed with OpenPGP using GPGMail
Re: ixl 40G bad performance?
+ Eric from Intel (Also trimming the CC list as it wouldn't let me send the message otherwise.) On 10/21/15 at 02:59P, Eggert, Lars wrote: > Hi Jack, > > On 2015-10-21, at 16:14, Jack Vogelwrote: > > The 40G hardware is absolutely dependent on firmware, if you have a mismatch > > for instance, it can totally bork things. So, I would work with your Intel > > rep and be sure you have the correct version for your specific hardware. > > I got these tester cards from Amazon, so I don't have a rep. > > I flashed the latest NVM (1.2.5), because previously the FreeBSD driver was > complaining about the firmware being too old. But I did that before the > experiments. > > If there is anything else I should be doing, I'd appreciate being put in > contact with someone at Intel who can help. Eric, Can you think of anything else that could explain this low performance? Cheers, Hiren pgphg9v_JoGTw.pgp Description: PGP signature
Re: ixl 40G bad performance?
Hi Bruce, thanks for the very detailed analysis of the ixl sysctls! On 2015-10-20, at 16:51, Bruce Evanswrote: > > Lowering (improving) latency always lowers (unimproves) throughput by > increasing load. That, I also understand. But even when I back off the itr values to something more reasonable, throughput still remains low. With all the tweaking I have tried, I have yet to top 3 Gb/s with ixl cards, whereas they do ~13 Gb/s on Linux straight out of the box. Lars signature.asc Description: Message signed with OpenPGP using GPGMail
Re: ixl 40G bad performance?
Hi, On 2015-10-20, at 10:24, Ian Smithwrote: > Actually, you want to set hw.acpi.cpu.cx_lowest=C1 instead. Done. On 2015-10-19, at 17:55, Luigi Rizzo wrote: > On Mon, Oct 19, 2015 at 8:34 AM, Eggert, Lars wrote: >> The only other sysctls in ixl(4) that look relevant are: >> >> hw.ixl.rx_itr >> The RX interrupt rate value, set to 8K by default. >> >> hw.ixl.tx_itr >> The TX interrupt rate value, set to 4K by default. >> > > yes those. raise to 20-50k and see what you get in > terms of ping latency. While ixl(4) talks about 8K and 4K, the defaults actually seem to be: hw.ixl.tx_itr: 122 hw.ixl.rx_itr: 62 Doubling those values *increases* flood ping latency to ~200 usec (from ~116 usec). Halving them to 62/31 decreases flood ping latency to ~50 usec, but still doesn't increase iperf throughput (still 2.8 Gb/s). Going to 31/16 further drops latency to 24 usec, with no change in throughput. (Looking at the "interrupt Moderation parameters" #defines in sys/dev/ixl/ixl.h it seems that ixl likes to have its irq rates specified with some weird divider scheme.) With 5/5 (which corresponds to IXL_ITR_100K), I get down to 16 usec. Unfortunately, throughput is then also down to about 2 Gb/s. One thing I noticed in top is that one queue irq is using quite a bit of CPU when I run iperf: 11 0 -92- 0K 1152K CPU22 0:19 50.98% intr{irq293: ixl1:q2} 11 0 -92- 0K 1152K WAIT3 0:02 5.18% intr{irq294: ixl1:q3} 0 0 -920 0K 8944K - 25 0:01 1.07% kernel{ixl1 que} 11 0 -92- 0K 1152K WAIT1 0:01 0.00% intr{irq292: ixl1:q1} 11 0 -92- 0K 1152K WAIT0 0:00 0.00% intr{irq291: ixl1:q0} 0 0 -920 0K 8944K - 22 0:00 0.00% kernel{ixl1 adminq} 0 0 -920 0K 8944K - 31 0:00 0.00% kernel{ixl1 que} 0 0 -920 0K 8944K - 31 0:00 0.00% kernel{ixl1 que} 0 0 -920 0K 8944K - 31 0:00 0.00% kernel{ixl1 que} 11 0 -92- 0K 1152K WAIT -1 0:00 0.00% intr{irq290: ixl1:aq} With 10G ix interfaces and a throughput of ~9Gb/s, the CPU load is much lower: 11 0 -92- 0K 1152K WAIT0 0:05 7.67% intr{irq274: ix0:que } 0 0 -920 0K 8944K - 27 0:00 0.29% kernel{ix0 que} 0 0 -920 0K 8944K - 10 0:00 0.00% kernel{ix0 linkq} 11 0 -92- 0K 1152K WAIT1 0:00 0.00% intr{irq275: ix0:que } 11 0 -92- 0K 1152K WAIT3 0:00 0.00% intr{irq277: ix0:que } 11 0 -92- 0K 1152K WAIT2 0:00 0.00% intr{irq276: ix0:que } 11 0 -92- 0K 1152K WAIT 18 0:00 0.00% intr{irq278: ix0:link} 0 0 -920 0K 8944K - 0 0:00 0.00% kernel{ix0 que} 0 0 -920 0K 8944K - 0 0:00 0.00% kernel{ix0 que} 0 0 -920 0K 8944K - 0 0:00 0.00% kernel{ix0 que} Lars signature.asc Description: Message signed with OpenPGP using GPGMail
Re: ixl 40G bad performance?
On Mon, 19 Oct 2015 21:47:36 -0700, Kevin Oberman wrote: > > I suspect it might not touch the c states, but better check. The safest is > > disable them in the bios. > > > > To disable C-States: > sysctl dev.cpu.0.cx_lowest=C1 Actually, you want to set hw.acpi.cpu.cx_lowest=C1 instead. Otherwise you've only changed cpu.0; if you try it you should see that other CPUs will have retained their previous C-state setting - up to 9.3 at least. Setting performance_cx_lowest=C1 in rc.conf (and economy_cx_lowest=C1 on laptops) performs that by setting hw.acpi.cpu.cx_lowest on boot (and on every change to/from battery power) in power_profile via devd notifies. cheers, Ian ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: ixl 40G bad performance?
On Tue, 20 Oct 2015, Eggert, Lars wrote: Hi, On 2015-10-20, at 10:24, Ian Smithwrote: Actually, you want to set hw.acpi.cpu.cx_lowest=C1 instead. Done. On 2015-10-19, at 17:55, Luigi Rizzo wrote: On Mon, Oct 19, 2015 at 8:34 AM, Eggert, Lars wrote: The only other sysctls in ixl(4) that look relevant are: hw.ixl.rx_itr The RX interrupt rate value, set to 8K by default. hw.ixl.tx_itr The TX interrupt rate value, set to 4K by default. yes those. raise to 20-50k and see what you get in terms of ping latency. While ixl(4) talks about 8K and 4K, the defaults actually seem to be: hw.ixl.tx_itr: 122 hw.ixl.rx_itr: 62 ixl seems to have a different set of itr sysctl bugs than em. In em, 122 for the itr means 125 initially, but it is documented (only by sysctl -d, not by the man page) as having units usecs/4. The units are actually usecs*4 except initially, and these units take effect if you write the initial value back -- writing back 122 changes the active period from 125 to 488. 122 instead of 125 is the result of confusion between powers of 2 and powers of 10. The first obvious bug in ixl is that the above sysctls are read-only global tunables (not documented as sysctls of course), but you can write them using per-device sysctls (dev.ixl.[0-N].*itr?). Writing them for 1 device clobbers the globals and probably the settings for all ixl devices. sysctl -d doesn't say anything useful about ixl's itrs. It misdocuments the units for all of them as being rates. Actually, the units for 2 of them are boolean and the units for the other 2 are periods. ixl(4) uses better wording for the booleans but even worse wording for the periods ("rate value"). em uses better wording for its itr sysctl but em(4) has no documentation for any sysctl or its itr tunable. igb is more like em than ixl here. 122 seems to be the result of mis-scaling 125, and 62 from correctly scaling 62.5, but these numbers are also off by a factor of 2. Either there is a scaling bug or the undocumented units are usecs/2 where em's documented units are usecs/4. In em, the default itr rate is 8 kHz (power of 10), but in ixl it is unclear if 4K and 8K are actually 4000 and 8000, since they are scaled more in hardware (IXL_ITR_4K is hard-coded as 122; the scale is linear but their aren't enough bits to preserve linearity; it is unclear if the hard-coded values are defined by the hardware or are the result of precomputing the values (using hard-coded 0x7A (122) where em uses 100 / SCALE (10 being user-friendly microseconds and SCALE a hardware clock frequency)). I think 122 really does mean a period that approximates the period for a frequency of 4 khz. The period for this frequency is 250 usecs, and 122 is 250 with units of usec*2, with an approximate error of 3 units. Or 122 is the period for the documented frequency of 4K (binary power of 2 with undocumented units which I assume are Hz), with the weird usec*2 units and a tiny error. Similarly for 62 and 8K, except there is a rounding error of almost 1. Doubling those values *increases* flood ping latency to ~200 usec (from ~116 usec). Since they are periods and not frequencies, doubling them should double the latency. Since their units are weird and undocumented, it is hard to predict what the latency actually is. But I predict that if the units are usecs*2, then the unscaled values give average latencies from interrupt moderation. This gives 122 + 62 = 184 plus maybe another 20 for other delays. Since the observed average latency is less than half that, the units seem to usecs*1 and it is the documented frequencies that are off by a power of 2. Halving them to 62/31 decreases flood ping latency to ~50 usec, but still doesn't increase iperf throughput (still 2.8 Gb/s). Going to 31/16 further drops latency to 24 usec, with no change in throughput. For em and lem, I use itr = 0 or 1 when optimizing for latency. This reduces the latency to 50 for lem but only to 73 for em (where the connection goes through a slow switch to not so slow bge). 24 seems quite good, and the lowest I have seen for 1 Gbps is 26, but this requires kludges like a direct connection and polling, and I would hope for 40 times lower at 40 Gbps. (Looking at the "interrupt Moderation parameters" #defines in sys/dev/ixl/ixl.h it seems that ixl likes to have its irq rates specified with some weird divider scheme.) With 5/5 (which corresponds to IXL_ITR_100K), I get down to 16 usec. Unfortunately, throughput is then also down to about 2 Gb/s. Lowering (improving) latency always lowers (unimproves) throughput by increasing load. itr = 8 kHz is resonable for 1 Gbps (it gives higher latency than I like), but scaling that to 40 Gbps gives itr = 320 kHz and it is impossible to scale up the speed of a single CPU to reasonbly keep up with that. Fix for em: X diff -u2 if_em.c~ if_em.c X
Re: ixl 40G bad performance?
i would look at the following: - c states and clock speed - make sure you never go below C1, and fix the clock speed to max. Sure these parameters also affect the 10G card, but there may be strange interaction that trigger the power saving modes in different ways - interrupt moderation (may affect ping latency, do not remember how it is set in ixl but probably a sysctl - number of queues (32 is a lot i wouldn't use more than 4-8), may affect cpu-socket affinity - tso and flow director - i have seen bad effects of accelerations so i would run the iperf test with of these features disabled on both sides, and then enable them one at a time - queue sizes - the driver seems to use 1024 slots which is about 1.5 MB queued, which in turn means you have 300us (and possibly half of that) to drain the queue at 40Gbit/s. 150-300us may seem an eternity, but if a couple of cores fall into c7 your budget is gone and the loss will trigger a retransmission and window halving etc. cheers luigi On Mon, Oct 19, 2015 at 6:52 AM, Eggert, Larswrote: > Hi, > > I'm running a few simple tests on -CURRENT with a pair of dual-port Intel > XL710 boards, which are seen by the kernel as: > > ixl0: mem > 0xdc80-0xdcff,0xdd808000-0xdd80 irq 32 at device 0.0 on pci3 > ixl0: Using MSIX interrupts with 33 vectors > ixl0: f4.40 a1.4 n04.53 e80001dca > ixl0: Using defaults for TSO: 65518/35/2048 > ixl0: Ethernet address: 68:05:ca:32:0b:98 > ixl0: PCI Express Bus: Speed 8.0GT/s Width x8 > ixl0: netmap queues/slots: TX 32/1024, RX 32/1024 > ixl1: mem > 0xdc00-0xdc7f,0xdd80-0xdd807fff irq 32 at device 0.1 on pci3 > ixl1: Using MSIX interrupts with 33 vectors > ixl1: f4.40 a1.4 n04.53 e80001dca > ixl1: Using defaults for TSO: 65518/35/2048 > ixl1: Ethernet address: 68:05:ca:32:0b:99 > ixl1: PCI Express Bus: Speed 8.0GT/s Width x8 > ixl1: netmap queues/slots: TX 32/1024, RX 32/1024 > ixl0: link state changed to UP > ixl1: link state changed to UP > > I have two identical machines connected with patch cables (no switch). iperf > performance is bad: > > # iperf -c 10.0.1.2 > > Client connecting to 10.0.1.2, TCP port 5001 > TCP window size: 32.5 KByte (default) > > [ 3] local 10.0.1.1 port 19238 connected with 10.0.1.2 port 5001 > [ ID] Interval Transfer Bandwidth > [ 3] 0.0-10.0 sec 3.91 GBytes 3.36 Gbits/sec > > As is flood ping latency: > > # sudo ping -f 10.0.1.2 > PING 10.0.1.2 (10.0.1.2): 56 data bytes > .^C > --- 10.0.1.2 ping statistics --- > 41927 packets transmitted, 41926 packets received, 0.0% packet loss > round-trip min/avg/max/stddev = 0.084/0.116/0.145/0.002 ms > > Any ideas on what's going on here? Testing 10G ix interfaces between the same > two machines results in 9.39 Gbits/sec and flood ping latencies of 17 usec. > > Thanks, > Lars > > PS: Full dmesg attached. > > Copyright (c) 1992-2015 The FreeBSD Project. > Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 > The Regents of the University of California. All rights reserved. > FreeBSD is a registered trademark of The FreeBSD Foundation. > FreeBSD 11.0-CURRENT #2 483de3c(muclab)-dirty: Mon Oct 19 11:01:16 CEST 2015 > > el...@laurel.muccbc.hq.netapp.com:/usr/home/elars/obj/usr/home/elars/src/sys/MUCLAB > amd64 > FreeBSD clang version 3.7.0 (tags/RELEASE_370/final 246257) 20150906 > VT(vga): resolution 640x480 > CPU: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz (2000.05-MHz K8-class CPU) > Origin="GenuineIntel" Id=0x206d7 Family=0x6 Model=0x2d Stepping=7 > > Features=0xbfebfbff > > Features2=0x1fbee3ff > AMD Features=0x2c100800 > AMD Features2=0x1 > XSAVE Features=0x1 > VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID > TSC: P-state invariant, performance statistics > real memory = 137438953472 (131072 MB) > avail memory = 133484290048 (127300 MB) > Event timer "LAPIC" quality 600 > ACPI APIC Table: < > > FreeBSD/SMP: Multiprocessor System Detected: 32 CPUs > FreeBSD/SMP: 2 package(s) x 8 core(s) x 2 SMT threads > cpu0 (BSP): APIC ID: 0 > cpu1 (AP): APIC ID: 1 > cpu2 (AP): APIC ID: 2 > cpu3 (AP): APIC ID: 3 > cpu4 (AP): APIC ID: 4 > cpu5 (AP): APIC ID: 5 > cpu6 (AP): APIC ID: 6 > cpu7 (AP): APIC ID: 7 > cpu8 (AP): APIC ID: 8 > cpu9 (AP): APIC ID: 9 > cpu10 (AP): APIC ID: 10 > cpu11 (AP): APIC ID: 11 > cpu12 (AP): APIC ID: 12 > cpu13 (AP): APIC ID: 13 > cpu14 (AP): APIC ID: 14 > cpu15 (AP): APIC
Re: ixl 40G bad performance?
Hi, On 2015-10-19, at 16:20, Luigi Rizzowrote: > > i would look at the following: > - c states and clock speed - make sure you never go below C1, > and fix the clock speed to max. > Sure these parameters also affect the 10G card, but there > may be strange interaction that trigger the power saving > modes in different ways I already have powerd_flags="-a max -b max -n max" in rc.conf, which I hope should be enough. > - interrupt moderation (may affect ping latency, > do not remember how it is set in ixl but probably a sysctl ixl(4) describes two sysctls that sound like they control AIM, and they default to off: hw.ixl.dynamic_tx_itr: 0 hw.ixl.dynamic_rx_itr: 0 > - number of queues (32 is a lot i wouldn't use more than 4-8), > may affect cpu-socket affinity With hw.ixl.max_queues=4 in loader.conf, performance is still unchanged. > - tso and flow director - i have seen bad effects of > accelerations so i would run the iperf test with > of these features disabled on both sides, and then enable > them one at a time No change with "ifconfig -tso4 -tso6 -rxcsum -txcsum -lro". How do I turn off flow director? > - queue sizes - the driver seems to use 1024 slots which is > about 1.5 MB queued, which in turn means you have 300us > (and possibly half of that) to drain the queue at 40Gbit/s. > 150-300us may seem an eternity, but if a couple of cores fall > into c7 your budget is gone and the loss will trigger a > retransmission and window halving etc. Also no change with "hw.ixl.ringsz=256" in loader.conf. This is really weird. Lars signature.asc Description: Message signed with OpenPGP using GPGMail
Re: ixl 40G bad performance?
On Monday, October 19, 2015, Eggert, Larswrote: > Hi, > > On 2015-10-19, at 16:20, Luigi Rizzo > > wrote: > > > > i would look at the following: > > - c states and clock speed - make sure you never go below C1, > > and fix the clock speed to max. > > Sure these parameters also affect the 10G card, but there > > may be strange interaction that trigger the power saving > > modes in different ways > > I already have powerd_flags="-a max -b max -n max" in rc.conf, which I > hope should be enough. I suspect it might not touch the c states, but better check. The safest is disable them in the bios. > > > - interrupt moderation (may affect ping latency, > > do not remember how it is set in ixl but probably a sysctl > > ixl(4) describes two sysctls that sound like they control AIM, and they > default to off: > > hw.ixl.dynamic_tx_itr: 0 > hw.ixl.dynamic_rx_itr: 0 > > There must be some other control for the actual (fixed, not dynamic) moderation. > > - number of queues (32 is a lot i wouldn't use more than 4-8), > > may affect cpu-socket affinity > > With hw.ixl.max_queues=4 in loader.conf, performance is still unchanged. > > > - tso and flow director - i have seen bad effects of > > accelerations so i would run the iperf test with > > of these features disabled on both sides, and then enable > > them one at a time > > No change with "ifconfig -tso4 -tso6 -rxcsum -txcsum -lro". > > How do I turn off flow director? I am not sure if it is enabled I'm FreeBSD. It is in linux and almost halves the pkt rate with netmap (from 35 down to 19mpps). Maybe it is not too bad for bulk TCP. > > > - queue sizes - the driver seems to use 1024 slots which is > > about 1.5 MB queued, which in turn means you have 300us > > (and possibly half of that) to drain the queue at 40Gbit/s. > > 150-300us may seem an eternity, but if a couple of cores fall > > into c7 your budget is gone and the loss will trigger a > > retransmission and window halving etc. > > Also no change with "hw.ixl.ringsz=256" in loader.conf. Any better success with 2048 slots? 3.5 gbit is what I used to see on the ixgbe with tso disabled, probably hitting a CPU bound. Cheers Luigi > This is really weird. > > Lars > -- -+--- Prof. Luigi RIZZO, ri...@iet.unipi.it . Dip. di Ing. dell'Informazione http://www.iet.unipi.it/~luigi/. Universita` di Pisa TEL +39-050-2217533 . via Diotisalvi 2 Mobile +39-338-6809875 . 56122 PISA (Italy) -+--- ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: ixl 40G bad performance?
Hi, in order to eliminate network or hardware weirdness, I've rerun the test with Linux 4.3rc6, where I get 13.1 Gbits/sec throughput and 52 usec flood ping latency. Not great either, but in line with earlier experiments with Mellanox NICs and an untuned Linux system. On 2015-10-19, at 17:11, Luigi Rizzowrote: > I suspect it might not touch the c states, but better check. The safest is > disable them in the bios. I'll try that. >> hw.ixl.dynamic_tx_itr: 0 >> hw.ixl.dynamic_rx_itr: 0 >> >> > There must be some other control for the actual (fixed, not dynamic) > moderation. The only other sysctls in ixl(4) that look relevant are: hw.ixl.rx_itr The RX interrupt rate value, set to 8K by default. hw.ixl.tx_itr The TX interrupt rate value, set to 4K by default. I'll play with those. >> Also no change with "hw.ixl.ringsz=256" in loader.conf. > > Any better success with 2048 slots? > 3.5 gbit is what I used to see on the ixgbe with tso disabled, probably > hitting a CPU bound. Will try. Thanks! Lars signature.asc Description: Message signed with OpenPGP using GPGMail
Re: ixl 40G bad performance?
On Mon, Oct 19, 2015 at 8:34 AM, Eggert, Larswrote: > Hi, > > in order to eliminate network or hardware weirdness, I've rerun the test with > Linux 4.3rc6, where I get 13.1 Gbits/sec throughput and 52 usec flood ping > latency. Not great either, but in line with earlier experiments with Mellanox > NICs and an untuned Linux system. > ... >> There must be some other control for the actual (fixed, not dynamic) >> moderation. > > The only other sysctls in ixl(4) that look relevant are: > > hw.ixl.rx_itr > The RX interrupt rate value, set to 8K by default. > > hw.ixl.tx_itr > The TX interrupt rate value, set to 4K by default. > yes those. raise to 20-50k and see what you get in terms of ping latency. Note that 4k on tx means you get to reclaim buffers in the tx queue (unless it is done opportunistically) every 250us which is dangerously close to the 300us capacity of the queue itself. cheers luigi > I'll play with those. > >>> Also no change with "hw.ixl.ringsz=256" in loader.conf. >> >> Any better success with 2048 slots? >> 3.5 gbit is what I used to see on the ixgbe with tso disabled, probably >> hitting a CPU bound. > > Will try. > > Thanks! > > Lars -- -+--- Prof. Luigi RIZZO, ri...@iet.unipi.it . Dip. di Ing. dell'Informazione http://www.iet.unipi.it/~luigi/. Universita` di Pisa TEL +39-050-2217533 . via Diotisalvi 2 Mobile +39-338-6809875 . 56122 PISA (Italy) -+--- ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: ixl 40G bad performance?
On Mon, Oct 19, 2015 at 8:11 AM, Luigi Rizzowrote: > On Monday, October 19, 2015, Eggert, Lars wrote: > > > Hi, > > > > On 2015-10-19, at 16:20, Luigi Rizzo > > > wrote: > > > > > > i would look at the following: > > > - c states and clock speed - make sure you never go below C1, > > > and fix the clock speed to max. > > > Sure these parameters also affect the 10G card, but there > > > may be strange interaction that trigger the power saving > > > modes in different ways > > > > I already have powerd_flags="-a max -b max -n max" in rc.conf, which I > > hope should be enough. > > > I suspect it might not touch the c states, but better check. The safest is > disable them in the bios. > To disable C-States: sysctl dev.cpu.0.cx_lowest=C1 -- Kevin Oberman, Part time kid herder and retired Network Engineer ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: ixl 40G bad performance?
On 10/19/15 at 08:11P, Luigi Rizzo wrote: > On Monday, October 19, 2015, Eggert, Larswrote: > > > > > How do I turn off flow director? > > > I am not sure if it is enabled I'm FreeBSD. It is in linux and almost > halves the pkt rate with netmap (from 35 down to 19mpps). > Maybe it is not too bad for bulk TCP. > Flow director support is incomplete on FreeBSD and that's why it is disabled by default. Cheers, Hiren pgp9F3s1LYB3X.pgp Description: PGP signature