Re: ixl 40G bad performance?

2015-12-10 Thread Denis Pearson
On Thu, Dec 10, 2015 at 2:18 PM, Eggert, Lars  wrote:

> On 2015-10-26, at 18:40, Eggert, Lars  wrote:
> > On 2015-10-26, at 17:08, Pieper, Jeffrey E 
> wrote:
> >> As a caveat, this was using default netperf message sizes.
> >
> > I get the same ~3 Gb/s with the default netperf sizes and driver 1.4.5.
>
> Now there is version 1.4.8 on the Intel website, but it doesn't change
> things for me.
>

I had the opportunity to see similar numbers and behavior while using XL710
1.4.3 as of FreeBSD r291085 while in DPDK poll mode, but driver 1.2.8 as of
r292035 was providing expected numbers. While removing rxcsum/txcsum did
not provide differences, fully removing RSS + disabling rx/cxsum support
provided better numbers.

However now with driver 1.4.8 and the same set of hardware setup, except
for a different transceiver, I can get 36Gbps/24Mpps with no further
tweaks, so if you can replace your transceiver, shall be a different test
as a starting point.


>
> > When you tcpdump during the run, do you see TSO/LRO in effect, i.e., do
> you see "segments" > 32K in the trace?
>
> I still see no TSO/LRO in effect when tcpdump'ing on the receiver; note
> how all the packets are 1448 bytes:
>
> tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
> listening on ixl0, link-type EN10MB (Ethernet), capture size 262144 bytes
> 17:02:42.328782 IP 10.0.4.1.21507 > 10.0.4.2.12865: Flags [S], seq
> 15244366, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 478099
> ecr 0], length 0
> 17:02:42.328808 IP 10.0.4.2.12865 > 10.0.4.1.21507: Flags [S.], seq
> 1819579546, ack 15244367, win 65535, options [mss 1460,nop,wscale
> 6,sackOK,TS val 3553932482 ecr 478099], length 0
> 17:02:42.328842 IP 10.0.4.1.21507 > 10.0.4.2.12865: Flags [.], ack 1, win
> 1040, options [nop,nop,TS val 478099 ecr 3553932482], length 0
> 17:02:42.329804 IP 10.0.4.1.21507 > 10.0.4.2.12865: Flags [P.], seq 1:657,
> ack 1, win 1040, options [nop,nop,TS val 478100 ecr 3553932482], length 656
> 17:02:42.331671 IP 10.0.4.2.12865 > 10.0.4.1.21507: Flags [P.], seq 1:657,
> ack 657, win 1040, options [nop,nop,TS val 3553932485 ecr 478100], length
> 656
> 17:02:42.331717 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [S], seq
> 1387798477, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 478102
> ecr 0], length 0
> 17:02:42.331729 IP 10.0.4.2.30216 > 10.0.4.1.10449: Flags [S.], seq
> 4085135109, ack 1387798478, win 65535, options [mss 1460,nop,wscale
> 6,sackOK,TS val 282922 ecr 478102], length 0
> 17:02:42.331781 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], ack 1, win
> 1040, options [nop,nop,TS val 478102 ecr 282922], length 0
> 17:02:42.331796 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq 1:1449,
> ack 1, win 1040, options [nop,nop,TS val 478102 ecr 282922], length 1448
> 17:02:42.331800 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq
> 1449:2897, ack 1, win 1040, options [nop,nop,TS val 478102 ecr 282922],
> length 1448
> 17:02:42.331807 IP 10.0.4.2.30216 > 10.0.4.1.10449: Flags [.], ack 2897,
> win 1018, options [nop,nop,TS val 282923 ecr 478102], length 0
> 17:02:42.331809 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq
> 2897:4345, ack 1, win 1040, options [nop,nop,TS val 478102 ecr 282922],
> length 1448
> 17:02:42.331813 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq
> 4345:5793, ack 1, win 1040, options [nop,nop,TS val 478102 ecr 282922],
> length 1448
> 17:02:42.331817 IP 10.0.4.2.30216 > 10.0.4.1.10449: Flags [.], ack 5793,
> win 1018, options [nop,nop,TS val 282923 ecr 478102], length 0
> 17:02:42.331818 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq
> 5793:7241, ack 1, win 1040, options [nop,nop,TS val 478102 ecr 282922],
> length 1448
> 17:02:42.331821 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq
> 7241:8689, ack 1, win 1040, options [nop,nop,TS val 478102 ecr 282922],
> length 1448
> 17:02:42.331825 IP 10.0.4.2.30216 > 10.0.4.1.10449: Flags [.], ack 8689,
> win 1018, options [nop,nop,TS val 282923 ecr 478102], length 0
> 17:02:42.331826 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq
> 8689:10137, ack 1, win 1040, options [nop,nop,TS val 478102 ecr
> 282922], length 1448
> 17:02:42.331829 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq
> 10137:11585, ack 1, win 1040, options [nop,nop,TS val 478102 ecr
> 282922], length 1448
> ...
>
> Doing the same trace over 10G ix interfaces shows most segments in the
> 8-32K range, indicating that TSO/LRO are in use (and results in 9.9G
> throughput).
>
> Lars
>
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: ixl 40G bad performance?

2015-12-10 Thread Adrian Chadd
On 10 December 2015 at 10:29, Denis Pearson  wrote:
> On Thu, Dec 10, 2015 at 2:18 PM, Eggert, Lars  wrote:
>
>> On 2015-10-26, at 18:40, Eggert, Lars  wrote:
>> > On 2015-10-26, at 17:08, Pieper, Jeffrey E 
>> wrote:
>> >> As a caveat, this was using default netperf message sizes.
>> >
>> > I get the same ~3 Gb/s with the default netperf sizes and driver 1.4.5.
>>
>> Now there is version 1.4.8 on the Intel website, but it doesn't change
>> things for me.
>>
>
> I had the opportunity to see similar numbers and behavior while using XL710
> 1.4.3 as of FreeBSD r291085 while in DPDK poll mode, but driver 1.2.8 as of
> r292035 was providing expected numbers. While removing rxcsum/txcsum did
> not provide differences, fully removing RSS + disabling rx/cxsum support
> provided better numbers.

Can someone debug this a bit more? (My kit with ixl NICs in it is
still not up and available. :( )

Device RSS, even without kernel RSS enabled, shouldn't cause a massive
performance drop. If it is then something else odd is going on.

Do you have a diff where you removed things?


-adrian

> However now with driver 1.4.8 and the same set of hardware setup, except
> for a different transceiver, I can get 36Gbps/24Mpps with no further
> tweaks, so if you can replace your transceiver, shall be a different test
> as a starting point.
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: ixl 40G bad performance?

2015-12-10 Thread Denis Pearson
On Thu, Dec 10, 2015 at 4:40 PM, Adrian Chadd 
wrote:

> On 10 December 2015 at 10:29, Denis Pearson 
> wrote:
> > On Thu, Dec 10, 2015 at 2:18 PM, Eggert, Lars  wrote:
> >
> >> On 2015-10-26, at 18:40, Eggert, Lars  wrote:
> >> > On 2015-10-26, at 17:08, Pieper, Jeffrey E <
> jeffrey.e.pie...@intel.com>
> >> wrote:
> >> >> As a caveat, this was using default netperf message sizes.
> >> >
> >> > I get the same ~3 Gb/s with the default netperf sizes and driver
> 1.4.5.
> >>
> >> Now there is version 1.4.8 on the Intel website, but it doesn't change
> >> things for me.
> >>
> >
> > I had the opportunity to see similar numbers and behavior while using
> XL710
> > 1.4.3 as of FreeBSD r291085 while in DPDK poll mode, but driver 1.2.8 as
> of
> > r292035 was providing expected numbers. While removing rxcsum/txcsum did
> > not provide differences, fully removing RSS + disabling rx/cxsum support
> > provided better numbers.
>
> Can someone debug this a bit more? (My kit with ixl NICs in it is
> still not up and available. :( )
>
> Device RSS, even without kernel RSS enabled, shouldn't cause a massive
> performance drop. If it is then something else odd is going on.


> Do you have a diff where you removed things?
>

I can probably find out a snapshot with the code at the time and extract a
diff, yes. I just don't know how it worths wasting the time when the
problem is not reproducible on the current 1.4.8 driver which will
hopefully get into -CURRENT (if it's not already there?). And it's much
more specific, the performance drop happened on dpdk poll mode, not the
usual kernel operation so a simple diff only pointing out the changes for
the driver to actually build and run without rss will still require a
testlab and different ways to generate traffic.

This is why I suggested a transceiver change or replug first.

Anyway RSS performance dropping problem is far from a FreeBSD specific
problem, while researching I could find the exact same complaints on
Windows users starting from windows 8 while having RSS@4 or RSS@16 or RSS
completely disabled, some times with acceptable results only when it was
disabled (despiste the fact that MiniportInterruptDPC was using a whole CPU
when RSS was off results were still better). So I guess this is just a side
effect of when it's just good to have NIC features turned off. The reason,
I'm not an engineer to answer, but I would guess it's related to other NIC
features also doing something with the packet or any sort of errors netstat
or driver status may not tell.

I was able to see the problem even with low pps rates and big packet sizes,
as well as avg pkt size of 768bytes so I don't think it's any sort of card
resource starvation. I can manage to have the whole lab up and running by
the weekend if you want to investigate and compare, just ping me off list.


>
> -adrian
>
> > However now with driver 1.4.8 and the same set of hardware setup, except
> > for a different transceiver, I can get 36Gbps/24Mpps with no further
> > tweaks, so if you can replace your transceiver, shall be a different test
> > as a starting point.
>
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: ixl 40G bad performance?

2015-12-10 Thread Luigi Rizzo
On Thu, Dec 10, 2015 at 10:40 AM, Adrian Chadd  wrote:
> On 10 December 2015 at 10:29, Denis Pearson  wrote:
>> On Thu, Dec 10, 2015 at 2:18 PM, Eggert, Lars  wrote:
>>
>>> On 2015-10-26, at 18:40, Eggert, Lars  wrote:
>>> > On 2015-10-26, at 17:08, Pieper, Jeffrey E 
>>> wrote:
>>> >> As a caveat, this was using default netperf message sizes.
>>> >
>>> > I get the same ~3 Gb/s with the default netperf sizes and driver 1.4.5.
>>>
>>> Now there is version 1.4.8 on the Intel website, but it doesn't change
>>> things for me.
>>>
>>
>> I had the opportunity to see similar numbers and behavior while using XL710
>> 1.4.3 as of FreeBSD r291085 while in DPDK poll mode, but driver 1.2.8 as of
>> r292035 was providing expected numbers. While removing rxcsum/txcsum did
>> not provide differences, fully removing RSS + disabling rx/cxsum support
>> provided better numbers.
>
> Can someone debug this a bit more? (My kit with ixl NICs in it is
> still not up and available. :( )
>
> Device RSS, even without kernel RSS enabled, shouldn't cause a massive
> performance drop. If it is then something else odd is going on.

I am not sure whether we are digressing (Lars' complaint was about
poor bulk throughput, now i see DPDK and high packet rates mentioned
so i feel obliged to pitch in!) but a related piece of info:

last spring, with netmap and i40e on linux (don't remember
which driver/firmware), we saw that enabling FlowDirector
killed the pps throughput (from 32 down to 18 Mpps).

FlowDirector is a device feature which was probably affecting
ordinary processing on the NIC, either because of bugs or because
of consuming controller resources.
The same may be possibly happening with other device features.

cheers
luigi

>
> Do you have a diff where you removed things?
>
>
> -adrian
>
>> However now with driver 1.4.8 and the same set of hardware setup, except
>> for a different transceiver, I can get 36Gbps/24Mpps with no further
>> tweaks, so if you can replace your transceiver, shall be a different test
>> as a starting point.
> ___
> freebsd-net@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"



-- 
-+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2217533   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: ixl 40G bad performance?

2015-12-10 Thread Adrian Chadd
[snip]

If RSS works fine on the latest driver then great.

This was with single queue netperf, right?


-a
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: ixl 40G bad performance?

2015-12-10 Thread Eggert, Lars
Hi,

On 2015-12-10, at 20:42, Denis Pearson  wrote:
> I can probably find out a snapshot with the code at the time and extract a 
> diff, yes. I just don't know how it worths wasting the time when the problem 
> is not reproducible on the current 1.4.8 driver which will hopefully get into 
> -CURRENT (if it's not already there?).

per my last email, I do see the same issues with 1.4.8.

This is with a single netperf TCP flow, no NIC parameter tuning and no RSS or 
PCBGROUP in the kernel.

> This is why I suggested a transceiver change or replug first.

I will test this next week. (However, the same testbed booted into Linux 
doesn't see these low netperf numbers.)

It really smells like a TSO/LRO (= packet rate) issue. If I configure 
jumbograms, performance jumps up as expected.

Lars


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: ixl 40G bad performance?

2015-12-10 Thread Eggert, Lars
On 2015-10-26, at 18:40, Eggert, Lars  wrote:
> On 2015-10-26, at 17:08, Pieper, Jeffrey E  wrote:
>> As a caveat, this was using default netperf message sizes.
> 
> I get the same ~3 Gb/s with the default netperf sizes and driver 1.4.5.

Now there is version 1.4.8 on the Intel website, but it doesn't change things 
for me.

> When you tcpdump during the run, do you see TSO/LRO in effect, i.e., do you 
> see "segments" > 32K in the trace?

I still see no TSO/LRO in effect when tcpdump'ing on the receiver; note how all 
the packets are 1448 bytes:

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ixl0, link-type EN10MB (Ethernet), capture size 262144 bytes
17:02:42.328782 IP 10.0.4.1.21507 > 10.0.4.2.12865: Flags [S], seq 15244366, 
win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 478099 ecr 0], length 0
17:02:42.328808 IP 10.0.4.2.12865 > 10.0.4.1.21507: Flags [S.], seq 1819579546, 
ack 15244367, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 
3553932482 ecr 478099], length 0
17:02:42.328842 IP 10.0.4.1.21507 > 10.0.4.2.12865: Flags [.], ack 1, win 1040, 
options [nop,nop,TS val 478099 ecr 3553932482], length 0
17:02:42.329804 IP 10.0.4.1.21507 > 10.0.4.2.12865: Flags [P.], seq 1:657, ack 
1, win 1040, options [nop,nop,TS val 478100 ecr 3553932482], length 656
17:02:42.331671 IP 10.0.4.2.12865 > 10.0.4.1.21507: Flags [P.], seq 1:657, ack 
657, win 1040, options [nop,nop,TS val 3553932485 ecr 478100], length 656
17:02:42.331717 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [S], seq 1387798477, 
win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 478102 ecr 0], length 0
17:02:42.331729 IP 10.0.4.2.30216 > 10.0.4.1.10449: Flags [S.], seq 4085135109, 
ack 1387798478, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 
282922 ecr 478102], length 0
17:02:42.331781 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], ack 1, win 1040, 
options [nop,nop,TS val 478102 ecr 282922], length 0
17:02:42.331796 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq 1:1449, ack 
1, win 1040, options [nop,nop,TS val 478102 ecr 282922], length 1448
17:02:42.331800 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq 1449:2897, 
ack 1, win 1040, options [nop,nop,TS val 478102 ecr 282922], length 1448
17:02:42.331807 IP 10.0.4.2.30216 > 10.0.4.1.10449: Flags [.], ack 2897, win 
1018, options [nop,nop,TS val 282923 ecr 478102], length 0
17:02:42.331809 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq 2897:4345, 
ack 1, win 1040, options [nop,nop,TS val 478102 ecr 282922], length 1448
17:02:42.331813 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq 4345:5793, 
ack 1, win 1040, options [nop,nop,TS val 478102 ecr 282922], length 1448
17:02:42.331817 IP 10.0.4.2.30216 > 10.0.4.1.10449: Flags [.], ack 5793, win 
1018, options [nop,nop,TS val 282923 ecr 478102], length 0
17:02:42.331818 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq 5793:7241, 
ack 1, win 1040, options [nop,nop,TS val 478102 ecr 282922], length 1448
17:02:42.331821 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq 7241:8689, 
ack 1, win 1040, options [nop,nop,TS val 478102 ecr 282922], length 1448
17:02:42.331825 IP 10.0.4.2.30216 > 10.0.4.1.10449: Flags [.], ack 8689, win 
1018, options [nop,nop,TS val 282923 ecr 478102], length 0
17:02:42.331826 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq 8689:10137, 
ack 1, win 1040, options [nop,nop,TS val 478102 ecr 282922], length 1448
17:02:42.331829 IP 10.0.4.1.10449 > 10.0.4.2.30216: Flags [.], seq 10137:11585, 
ack 1, win 1040, options [nop,nop,TS val 478102 ecr 282922], length 1448
...

Doing the same trace over 10G ix interfaces shows most segments in the 8-32K 
range, indicating that TSO/LRO are in use (and results in 9.9G throughput).

Lars


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: ixl 40G bad performance?

2015-10-26 Thread Eggert, Lars
On 2015-10-26, at 4:38, Kevin Oberman  wrote:
> On Sun, Oct 25, 2015 at 12:10 AM, Daniel Engberg <
> daniel.engberg.li...@pyret.net> wrote:
> 
>> One thing I've noticed that probably affects your performance benchmarks
>> somewhat is that you're using iperf(2) instead of the newer iperf3 but I
>> could be wrong...
> 
> iperf3 is not a newer version of iperf. It is a total re-write and a rather
> different tool. It has significant improvements in many areas and new
> capabilities that might be of use. That said, there is no reason to think
> that the results of tests using iperf2 are in any way inaccurate. However,
> it is entirely possible to get misleading results if options not properly
> selected.

FWIW, I've been using netperf and tried various options.

I don't think the issues is the benchmarking tool. I think the issue is TSO/LRO 
issues (per my earlier email.)

Lars



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: ixl 40G bad performance?

2015-10-26 Thread Eggert, Lars
On 2015-10-26, at 15:38, Pieper, Jeffrey E  wrote:
> With the latest ixl component from: 
> https://downloadcenter.intel.com/download/25160/Network-Adapter-Driver-for-PCI-E-40-Gigabit-Network-Connections-under-FreeBSD-
> 
> running on 10.2 amd64, I easily get 9.6 Gb/s with one netperf stream, either 
> b2b or through a switch. This is with no driver/kernel tuning. Running 4 
> streams easily gets me 36 GB/s.

Thanks, will test!

If the newer driver makes a difference, any chance we'll see it in -HEAD soon?

Lars


signature.asc
Description: Message signed with OpenPGP using GPGMail


RE: ixl 40G bad performance?

2015-10-26 Thread Pieper, Jeffrey E


-Original Message-
From: owner-freebsd-...@freebsd.org [mailto:owner-freebsd-...@freebsd.org] On 
Behalf Of Eggert, Lars
Sent: Monday, October 26, 2015 2:28 AM
To: Kevin Oberman <rkober...@gmail.com>
Cc: freebsd-net@freebsd.org; Daniel Engberg <daniel.engberg.li...@pyret.net>
Subject: Re: ixl 40G bad performance?

On 2015-10-26, at 4:38, Kevin Oberman <rkober...@gmail.com> wrote:
> On Sun, Oct 25, 2015 at 12:10 AM, Daniel Engberg <
> daniel.engberg.li...@pyret.net> wrote:
> 
>> One thing I've noticed that probably affects your performance benchmarks
>> somewhat is that you're using iperf(2) instead of the newer iperf3 but I
>> could be wrong...
> 
> iperf3 is not a newer version of iperf. It is a total re-write and a rather
> different tool. It has significant improvements in many areas and new
> capabilities that might be of use. That said, there is no reason to think
> that the results of tests using iperf2 are in any way inaccurate. However,
> it is entirely possible to get misleading results if options not properly
> selected.
>
>FWIW, I've been using netperf and tried various options.
>
>I don't think the issues is the benchmarking tool. I think the issue is 
>TSO/LRO issues (per my earlier email.)
>
>Lars

With the latest ixl component from: 
https://downloadcenter.intel.com/download/25160/Network-Adapter-Driver-for-PCI-E-40-Gigabit-Network-Connections-under-FreeBSD-

running on 10.2 amd64, I easily get 9.6 Gb/s with one netperf stream, either 
b2b or through a switch. This is with no driver/kernel tuning. Running 4 
streams easily gets me 36 GB/s. 

Jeff


___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


RE: ixl 40G bad performance?

2015-10-26 Thread Pieper, Jeffrey E


-Original Message-
From: Eggert, Lars [mailto:l...@netapp.com] 
Sent: Monday, October 26, 2015 8:08 AM
To: Pieper, Jeffrey E <jeffrey.e.pie...@intel.com>
Cc: Kevin Oberman <rkober...@gmail.com>; freebsd-net@freebsd.org; Daniel 
Engberg <daniel.engberg.li...@pyret.net>
Subject: Re: ixl 40G bad performance?

On 2015-10-26, at 15:38, Pieper, Jeffrey E <jeffrey.e.pie...@intel.com> wrote:
> With the latest ixl component from: 
> https://downloadcenter.intel.com/download/25160/Network-Adapter-Driver-for-PCI-E-40-Gigabit-Network-Connections-under-FreeBSD-
> 
> running on 10.2 amd64, I easily get 9.6 Gb/s with one netperf stream, either 
> b2b or through a switch. This is with no driver/kernel tuning. Running 4 
> streams easily gets me 36 GB/s.
>
>Thanks, will test!
>
>If the newer driver makes a difference, any chance we'll see it in -HEAD soon?
>
>Lars

As a caveat, this was using default netperf message sizes. 

Jeff


___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: ixl 40G bad performance?

2015-10-26 Thread Eggert, Lars
On 2015-10-26, at 17:08, Pieper, Jeffrey E  wrote:
> As a caveat, this was using default netperf message sizes.

I get the same ~3 Gb/s with the default netperf sizes and driver 1.4.5.

When you tcpdump during the run, do you see TSO/LRO in effect, i.e., do you see 
"segments" > 32K in the trace?

Lars


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: ixl 40G bad performance?

2015-10-25 Thread Kevin Oberman
On Sun, Oct 25, 2015 at 12:10 AM, Daniel Engberg <
daniel.engberg.li...@pyret.net> wrote:

> One thing I've noticed that probably affects your performance benchmarks
> somewhat is that you're using iperf(2) instead of the newer iperf3 but I
> could be wrong...
>
> Best regards,
> Daniel
>

iperf3 is not a newer version of iperf. It is a total re-write and a rather
different tool. It has significant improvements in many areas and new
capabilities that might be of use. That said, there is no reason to think
that the results of tests using iperf2 are in any way inaccurate. However,
it is entirely possible to get misleading results if options not properly
selected.
--
Kevin Oberman, Part time kid herder and retired Network Engineer
E-mail: rkober...@gmail.com
PGP Fingerprint: D03FB98AFA78E3B78C1694B318AB39EF1B055683
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: ixl 40G bad performance?

2015-10-24 Thread Eggert, Lars
On 2015-10-23, at 23:36, Eric Joyner  wrote:
> I see that the sysctl does clobber the global value, but have you tried 
> lowering the interval / raising the rate? You could try something like 
> 10usecs, and see if that helps. We'll do some more investigation here -- 
> 3Gb/s on a 40Gb/s using default settings is terrible, and we shouldn't let 
> that be happening.

I played with different settings, but I've never been able to get more than 
4Gb/s, whereas under Linux 4.2 without any special settings I get 13.

See my other email on TSO/LRO not looking to be effective; that would certainly 
explain it. Plausible? Anything to try here?

Lars



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: ixl 40G bad performance?

2015-10-24 Thread Jack Vogel
13 on a 40G interface?? I don't think that's very good for Linux either, is
this a 4x10 adapter?
Maybe elaborating on the details of the hardware, you sure you don't have a
bad PCI slot
somewhere that might be throttling everything?

Cheers,

Jack


On Sat, Oct 24, 2015 at 12:43 AM, Eggert, Lars  wrote:

> On 2015-10-23, at 23:36, Eric Joyner  wrote:
>
> I see that the sysctl does clobber the global value, but have you tried
> lowering the interval / raising the rate? You could try something like
> 10usecs, and see if that helps. We'll do some more investigation here --
> 3Gb/s on a 40Gb/s using default settings is terrible, and we shouldn't let
> that be happening.
>
>
> I played with different settings, but I've never been able to get more
> than 4Gb/s, whereas under Linux 4.2 without any special settings I get 13.
>
> See my other email on TSO/LRO not looking to be effective; that would
> certainly explain it. Plausible? Anything to try here?
>
> Lars
>
>
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: ixl 40G bad performance?

2015-10-24 Thread Eggert, Lars
On 2015-10-24, at 10:32, Jack Vogel  wrote:
> 13 on a 40G interface?? I don't think that's very good for Linux either, is
> this a 4x10 adapter?

No, its's a 2x40. And I can get it into the high 30s with tuning. I just 
mentioned the value to illustrate that something seems to be seriously broken 
under FreeBSD.

Lars

> Maybe elaborating on the details of the hardware, you sure you don't have a
> bad PCI slot
> somewhere that might be throttling everything?
> 
> Cheers,
> 
> Jack
> 
> 
> On Sat, Oct 24, 2015 at 12:43 AM, Eggert, Lars  wrote:
> 
>> On 2015-10-23, at 23:36, Eric Joyner  wrote:
>> 
>> I see that the sysctl does clobber the global value, but have you tried
>> lowering the interval / raising the rate? You could try something like
>> 10usecs, and see if that helps. We'll do some more investigation here --
>> 3Gb/s on a 40Gb/s using default settings is terrible, and we shouldn't let
>> that be happening.
>> 
>> 
>> I played with different settings, but I've never been able to get more
>> than 4Gb/s, whereas under Linux 4.2 without any special settings I get 13.
>> 
>> See my other email on TSO/LRO not looking to be effective; that would
>> certainly explain it. Plausible? Anything to try here?
>> 
>> Lars
>> 
>> 
> ___
> freebsd-net@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: ixl 40G bad performance?

2015-10-23 Thread Eric Joyner
Bruce mostly has it right -- ITR is the minimum latency between interrupts.
But, it does actually guarantee a minimum period between interrupts.
Though, Fortville actually is unique a little bit in that there is another
ITR setting that can ensure a certain average number of interrupts per
second (called Interrupt Rate Limiting), though, but I don't think this is
used in the current version of the driver.

I see that the sysctl does clobber the global value, but have you tried
lowering the interval / raising the rate? You could try something like
10usecs, and see if that helps. We'll do some more investigation here --
3Gb/s on a 40Gb/s using default settings is terrible, and we shouldn't let
that be happening.

- Eric

On Thu, Oct 22, 2015 at 10:36 PM Bruce Evans  wrote:

> On Wed, 21 Oct 2015, Bruce Evans wrote:
>
> > Fix for em:
> >
> > X diff -u2 if_em.c~ if_em.c
> > X --- if_em.c~2015-09-28 06:29:35.0 +
> > X +++ if_em.c 2015-10-18 18:49:36.876699000 +
> > X @@ -609,8 +609,8 @@
> > X em_tx_abs_int_delay_dflt);
> > X em_add_int_delay_sysctl(adapter, "itr",
> > X -   "interrupt delay limit in usecs/4",
> > X +   "interrupt delay limit in usecs",
> > X >tx_itr,
> > X E1000_REGISTER(hw, E1000_ITR),
> > X -   DEFAULT_ITR);
> > X +   100 / MAX_INTS_PER_SEC);
> > X X   /* Sysctl for limiting the amount of work done in the taskqueue */
> >
> > "delay limit" is fairly good wording.  Other parameters tend to give long
> > delays, but itr limits the longest delay due to interrupt moderation to
> > whatever the itr respresents.
>
> Everything in the last paragraph is backwards (inverted).  Other
> parameters tend to give short delays.  They should be set to small
> values to minimise latency.  Then under load, itr limits the interrupt
> _rate_ from above.  The interrupt delay is the inverse of the interrupt
> rate, so it is limited from below.  So "delay limit" is fairly bad
> wording.  Normally, limits are from above, but the inversion makes
> the itr limit from below.
>
> This is most easily understood by converting itr to a rate: itr = 125
> means a rate limit of 8000 Hz.  It doesn't quite mean that the latency
> is at least 125 usec.  No one wants to ensure large latencies, and the
> itr setting only ensures a minimal average latency them under load.
>
> Bruce
>
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: ixl 40G bad performance?

2015-10-22 Thread Eggert, Lars
Hi,

for those of you following along, I did try jumbograms and throughput increases 
roughly 5x. So it looks like I'm hitting a packet-rate limit somewhere.

Lars


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: ixl 40G bad performance?

2015-10-22 Thread Bruce Evans

On Wed, 21 Oct 2015, Bruce Evans wrote:


Fix for em:

X diff -u2 if_em.c~ if_em.c
X --- if_em.c~  2015-09-28 06:29:35.0 +
X +++ if_em.c   2015-10-18 18:49:36.876699000 +
X @@ -609,8 +609,8 @@
X   em_tx_abs_int_delay_dflt);
X   em_add_int_delay_sysctl(adapter, "itr",
X - "interrupt delay limit in usecs/4",
X + "interrupt delay limit in usecs",
X   >tx_itr,
X   E1000_REGISTER(hw, E1000_ITR),
X - DEFAULT_ITR);
X + 100 / MAX_INTS_PER_SEC);
X X /* Sysctl for limiting the amount of work done in the taskqueue */

"delay limit" is fairly good wording.  Other parameters tend to give long
delays, but itr limits the longest delay due to interrupt moderation to
whatever the itr respresents.


Everything in the last paragraph is backwards (inverted).  Other
parameters tend to give short delays.  They should be set to small
values to minimise latency.  Then under load, itr limits the interrupt
_rate_ from above.  The interrupt delay is the inverse of the interrupt
rate, so it is limited from below.  So "delay limit" is fairly bad
wording.  Normally, limits are from above, but the inversion makes
the itr limit from below.

This is most easily understood by converting itr to a rate: itr = 125
means a rate limit of 8000 Hz.  It doesn't quite mean that the latency
is at least 125 usec.  No one wants to ensure large latencies, and the
itr setting only ensures a minimal average latency them under load.

Bruce
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: ixl 40G bad performance?

2015-10-22 Thread Eggert, Lars
On 2015-10-22, at 9:38, Eggert, Lars  wrote:
> for those of you following along, I did try jumbograms and throughput 
> increases roughly 5x. So it looks like I'm hitting a packet-rate limit 
> somewhere.

Does the ixl driver have an issue with TSO/LRO?

If I tcpdump on the receiver when testing the 10G ix interfaces, I see that 
most "packets" are up to 64KB in the traces on both sender and receiver, which 
is expected with TSO/LRO.

When I look at the traffic over the ixl interfaces, I see that most "packets" 
on the sender are much smaller (~2896 aka 2 segments; although some few are 
>40K). On the receiver, I only see 1448 byte packets.

Lars


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: ixl 40G bad performance?

2015-10-21 Thread Jack Vogel
The 40G hardware is absolutely dependent on firmware, if you have a mismatch
for instance, it can totally bork things. So, I would work with your Intel
rep and be
sure you have the correct version for your specific hardware.

Good luck,

Jack


On Wed, Oct 21, 2015 at 5:25 AM, Eggert, Lars  wrote:

> Hi Bruce,
>
> thanks for the very detailed analysis of the ixl sysctls!
>
> On 2015-10-20, at 16:51, Bruce Evans  wrote:
> >
> > Lowering (improving) latency always lowers (unimproves) throughput by
> > increasing load.
>
> That, I also understand. But even when I back off the itr values to
> something more reasonable, throughput still remains low.
>
> With all the tweaking I have tried, I have yet to top 3 Gb/s with ixl
> cards, whereas they do ~13 Gb/s on Linux straight out of the box.
>
> Lars
>
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: ixl 40G bad performance?

2015-10-21 Thread Eggert, Lars
Hi Jack,

On 2015-10-21, at 16:14, Jack Vogel  wrote:
> The 40G hardware is absolutely dependent on firmware, if you have a mismatch
> for instance, it can totally bork things. So, I would work with your Intel
> rep and be sure you have the correct version for your specific hardware.

I got these tester cards from Amazon, so I don't have a rep.

I flashed the latest NVM (1.2.5), because previously the FreeBSD driver was 
complaining about the firmware being too old. But I did that before the 
experiments.

If there is anything else I should be doing, I'd appreciate being put in 
contact with someone at Intel who can help.

Thanks,
Lars
Lars


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: ixl 40G bad performance?

2015-10-21 Thread hiren panchasara
+ Eric from Intel
(Also trimming the CC list as it wouldn't let me send the message
otherwise.)

On 10/21/15 at 02:59P, Eggert, Lars wrote:
> Hi Jack,
> 
> On 2015-10-21, at 16:14, Jack Vogel  wrote:
> > The 40G hardware is absolutely dependent on firmware, if you have a mismatch
> > for instance, it can totally bork things. So, I would work with your Intel
> > rep and be sure you have the correct version for your specific hardware.
> 
> I got these tester cards from Amazon, so I don't have a rep.
> 
> I flashed the latest NVM (1.2.5), because previously the FreeBSD driver was 
> complaining about the firmware being too old. But I did that before the 
> experiments.
> 
> If there is anything else I should be doing, I'd appreciate being put in 
> contact with someone at Intel who can help.

Eric,

Can you think of anything else that could explain this low performance?

Cheers,
Hiren


pgphg9v_JoGTw.pgp
Description: PGP signature


Re: ixl 40G bad performance?

2015-10-21 Thread Eggert, Lars
Hi Bruce,

thanks for the very detailed analysis of the ixl sysctls!

On 2015-10-20, at 16:51, Bruce Evans  wrote:
> 
> Lowering (improving) latency always lowers (unimproves) throughput by
> increasing load.

That, I also understand. But even when I back off the itr values to something 
more reasonable, throughput still remains low.

With all the tweaking I have tried, I have yet to top 3 Gb/s with ixl cards, 
whereas they do ~13 Gb/s on Linux straight out of the box.

Lars


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: ixl 40G bad performance?

2015-10-20 Thread Eggert, Lars
Hi,

On 2015-10-20, at 10:24, Ian Smith  wrote:
> Actually, you want to set hw.acpi.cpu.cx_lowest=C1 instead.

Done.

On 2015-10-19, at 17:55, Luigi Rizzo  wrote:
> On Mon, Oct 19, 2015 at 8:34 AM, Eggert, Lars  wrote:
>> The only other sysctls in ixl(4) that look relevant are:
>> 
>> hw.ixl.rx_itr
>> The RX interrupt rate value, set to 8K by default.
>> 
>> hw.ixl.tx_itr
>> The TX interrupt rate value, set to 4K by default.
>> 
> 
> yes those. raise to 20-50k and see what you get in
> terms of ping latency.

While ixl(4) talks about 8K and 4K, the defaults actually seem to be:

hw.ixl.tx_itr: 122
hw.ixl.rx_itr: 62

Doubling those values *increases* flood ping latency to ~200 usec (from ~116 
usec).

Halving them to 62/31 decreases flood ping latency to ~50 usec, but still 
doesn't increase iperf throughput (still 2.8 Gb/s). Going to 31/16 further 
drops latency to 24 usec, with no change in throughput.

(Looking at the "interrupt Moderation parameters" #defines in sys/dev/ixl/ixl.h 
it seems that ixl likes to have its irq rates specified with some weird divider 
scheme.)

With 5/5 (which corresponds to IXL_ITR_100K), I get down to 16 usec. 
Unfortunately, throughput is then also down to about 2 Gb/s.

One thing I noticed in top is that one queue irq is using quite a bit of CPU 
when I run iperf:

   11  0   -92- 0K  1152K CPU22   0:19  50.98% intr{irq293: 
ixl1:q2}
   11  0   -92- 0K  1152K WAIT3   0:02   5.18% intr{irq294: 
ixl1:q3}
0  0   -920 0K  8944K -  25   0:01   1.07% kernel{ixl1 que}
   11  0   -92- 0K  1152K WAIT1   0:01   0.00% intr{irq292: 
ixl1:q1}
   11  0   -92- 0K  1152K WAIT0   0:00   0.00% intr{irq291: 
ixl1:q0}
0  0   -920 0K  8944K -  22   0:00   0.00% kernel{ixl1 
adminq}
0  0   -920 0K  8944K -  31   0:00   0.00% kernel{ixl1 que}
0  0   -920 0K  8944K -  31   0:00   0.00% kernel{ixl1 que}
0  0   -920 0K  8944K -  31   0:00   0.00% kernel{ixl1 que}
   11  0   -92- 0K  1152K WAIT   -1   0:00   0.00% intr{irq290: 
ixl1:aq}

With 10G ix interfaces and a throughput of ~9Gb/s, the CPU load is much lower:

   11  0   -92- 0K  1152K WAIT0   0:05   7.67% intr{irq274: 
ix0:que }
0  0   -920 0K  8944K -  27   0:00   0.29% kernel{ix0 que}
0  0   -920 0K  8944K -  10   0:00   0.00% kernel{ix0 linkq}
   11  0   -92- 0K  1152K WAIT1   0:00   0.00% intr{irq275: 
ix0:que }
   11  0   -92- 0K  1152K WAIT3   0:00   0.00% intr{irq277: 
ix0:que }
   11  0   -92- 0K  1152K WAIT2   0:00   0.00% intr{irq276: 
ix0:que }
   11  0   -92- 0K  1152K WAIT   18   0:00   0.00% intr{irq278: 
ix0:link}
0  0   -920 0K  8944K -   0   0:00   0.00% kernel{ix0 que}
0  0   -920 0K  8944K -   0   0:00   0.00% kernel{ix0 que}
0  0   -920 0K  8944K -   0   0:00   0.00% kernel{ix0 que}

Lars


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: ixl 40G bad performance?

2015-10-20 Thread Ian Smith
On Mon, 19 Oct 2015 21:47:36 -0700, Kevin Oberman wrote:
 > > I suspect it might not touch the c states, but better check. The safest is
 > > disable them in the bios.
 > >
 > 
 > To disable C-States:
 > sysctl dev.cpu.0.cx_lowest=C1

Actually, you want to set hw.acpi.cpu.cx_lowest=C1 instead.  Otherwise 
you've only changed cpu.0; if you try it you should see that other CPUs 
will have retained their previous C-state setting - up to 9.3 at least.

Setting performance_cx_lowest=C1 in rc.conf (and economy_cx_lowest=C1 on 
laptops) performs that by setting hw.acpi.cpu.cx_lowest on boot (and on 
every change to/from battery power) in power_profile via devd notifies.

cheers, Ian
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: ixl 40G bad performance?

2015-10-20 Thread Bruce Evans

On Tue, 20 Oct 2015, Eggert, Lars wrote:


Hi,

On 2015-10-20, at 10:24, Ian Smith  wrote:

Actually, you want to set hw.acpi.cpu.cx_lowest=C1 instead.


Done.

On 2015-10-19, at 17:55, Luigi Rizzo  wrote:

On Mon, Oct 19, 2015 at 8:34 AM, Eggert, Lars  wrote:

The only other sysctls in ixl(4) that look relevant are:

hw.ixl.rx_itr
The RX interrupt rate value, set to 8K by default.

hw.ixl.tx_itr
The TX interrupt rate value, set to 4K by default.



yes those. raise to 20-50k and see what you get in
terms of ping latency.


While ixl(4) talks about 8K and 4K, the defaults actually seem to be:

hw.ixl.tx_itr: 122
hw.ixl.rx_itr: 62


ixl seems to have a different set of itr sysctl bugs than em.  In em,
122 for the itr means 125 initially, but it is documented (only by
sysctl -d, not by the man page) as having units usecs/4.  The units
are actually usecs*4 except initially, and these units take effect if
you write the initial value back -- writing back 122 changes the active
period from 125 to 488.  122 instead of 125 is the result of confusion
between powers of 2 and powers of 10.

The first obvious bug in ixl is that the above sysctls are read-only
global tunables (not documented as sysctls of course), but you can
write them using per-device sysctls (dev.ixl.[0-N].*itr?).  Writing
them for 1 device clobbers the globals and probably the settings for
all ixl devices.

sysctl -d doesn't say anything useful about ixl's itrs.  It misdocuments
the units for all of them as being rates.  Actually, the units for 2
of them are boolean and the units for the other 2 are periods.  ixl(4)
uses better wording for the booleans but even worse wording for the
periods ("rate value").  em uses better wording for its itr sysctl but
em(4) has no documentation for any sysctl or its itr tunable.  igb is
more like em than ixl here.

122 seems to be the result of mis-scaling 125, and 62 from correctly
scaling 62.5, but these numbers are also off by a factor of 2.  Either
there is a scaling bug or the undocumented units are usecs/2 where
em's documented units are usecs/4.  In em, the default itr rate is
8 kHz (power of 10), but in ixl it is unclear if 4K and 8K are actually
4000 and 8000, since they are scaled more in hardware (IXL_ITR_4K is
hard-coded as 122; the scale is linear but their aren't enough bits
to preserve linearity; it is unclear if the hard-coded values are
defined by the hardware or are the result of precomputing the values
(using hard-coded 0x7A (122) where em uses 100 / SCALE (10
being user-friendly microseconds and SCALE a hardware clock frequency)).

I think 122 really does mean a period that approximates the period for
a frequency of 4 khz.  The period for this frequency is 250 usecs,
and 122 is 250 with units of usec*2, with an approximate error of
3 units.  Or 122 is the period for the documented frequency of 4K
(binary power of 2 with undocumented units which I assume are Hz),
with the weird usec*2 units and a tiny error.  Similarly for 62 and
8K, except there is a rounding error of almost 1.


Doubling those values *increases* flood ping latency to ~200 usec (from ~116 
usec).


Since they are periods and not frequencies, doubling them should double
the latency.  Since their units are weird and undocumented, it is hard to
predict what the latency actually is.  But I predict that if the units are
usecs*2, then the unscaled values give average latencies from interrupt
moderation.  This gives 122 + 62 = 184 plus maybe another 20 for other
delays.  Since the observed average latency is less than half that, the
units seem to usecs*1 and it is the documented frequencies that are off
by a power of 2.


Halving them to 62/31 decreases flood ping latency to ~50 usec, but still 
doesn't increase iperf throughput (still 2.8 Gb/s). Going to 31/16 further 
drops latency to 24 usec, with no change in throughput.


For em and lem, I use itr = 0 or 1 when optimizing for latency.  This
reduces the latency to 50 for lem but only to 73 for em (where the
connection goes through a slow switch to not so slow bge).  24 seems
quite good, and the lowest I have seen for 1 Gbps is 26, but this
requires kludges like a direct connection and polling, and I would
hope for 40 times lower at 40 Gbps.


(Looking at the "interrupt Moderation parameters" #defines in sys/dev/ixl/ixl.h 
it seems that ixl likes to have its irq rates specified with some weird divider scheme.)

With 5/5 (which corresponds to IXL_ITR_100K), I get down to 16 usec. 
Unfortunately, throughput is then also down to about 2 Gb/s.


Lowering (improving) latency always lowers (unimproves) throughput by
increasing load.  itr = 8 kHz is resonable for 1 Gbps (it gives higher
latency than I like), but scaling that to 40 Gbps gives itr = 320 kHz
and it is impossible to scale up the speed of a single CPU to reasonbly
keep up with that.

Fix for em:

X diff -u2 if_em.c~ if_em.c
X 

Re: ixl 40G bad performance?

2015-10-19 Thread Luigi Rizzo
i would look at the following:
- c states and clock speed - make sure you never go below C1,
  and fix the clock speed to max.
  Sure these parameters also affect the 10G card, but there
  may be strange interaction that trigger the power saving
  modes in different ways

- interrupt moderation (may affect ping latency,
  do not remember how it is set in ixl but probably a sysctl

- number of queues (32 is a lot i wouldn't use more than 4-8),
  may affect cpu-socket affinity

- tso and flow director - i have seen bad effects of
  accelerations so i would run the iperf test with
  of these features disabled on both sides, and then enable
  them one at a time

- queue sizes - the driver seems to use 1024 slots which is
  about 1.5 MB queued, which in turn means you have 300us
  (and possibly half of that) to drain the queue at 40Gbit/s.
  150-300us may seem an eternity, but if a couple of cores fall
  into c7 your budget is gone and the loss will trigger a
  retransmission and window halving etc.

cheers
luigi

On Mon, Oct 19, 2015 at 6:52 AM, Eggert, Lars  wrote:
> Hi,
>
> I'm running a few simple tests on -CURRENT with a pair of dual-port Intel 
> XL710 boards, which are seen by the kernel as:
>
> ixl0:  mem 
> 0xdc80-0xdcff,0xdd808000-0xdd80 irq 32 at device 0.0 on pci3
> ixl0: Using MSIX interrupts with 33 vectors
> ixl0: f4.40 a1.4 n04.53 e80001dca
> ixl0: Using defaults for TSO: 65518/35/2048
> ixl0: Ethernet address: 68:05:ca:32:0b:98
> ixl0: PCI Express Bus: Speed 8.0GT/s Width x8
> ixl0: netmap queues/slots: TX 32/1024, RX 32/1024
> ixl1:  mem 
> 0xdc00-0xdc7f,0xdd80-0xdd807fff irq 32 at device 0.1 on pci3
> ixl1: Using MSIX interrupts with 33 vectors
> ixl1: f4.40 a1.4 n04.53 e80001dca
> ixl1: Using defaults for TSO: 65518/35/2048
> ixl1: Ethernet address: 68:05:ca:32:0b:99
> ixl1: PCI Express Bus: Speed 8.0GT/s Width x8
> ixl1: netmap queues/slots: TX 32/1024, RX 32/1024
> ixl0: link state changed to UP
> ixl1: link state changed to UP
>
> I have two identical machines connected with patch cables (no switch). iperf 
> performance is bad:
>
> # iperf -c 10.0.1.2
> 
> Client connecting to 10.0.1.2, TCP port 5001
> TCP window size: 32.5 KByte (default)
> 
> [  3] local 10.0.1.1 port 19238 connected with 10.0.1.2 port 5001
> [ ID] Interval   Transfer Bandwidth
> [  3]  0.0-10.0 sec  3.91 GBytes  3.36 Gbits/sec
>
> As is flood ping latency:
>
> # sudo ping -f 10.0.1.2
> PING 10.0.1.2 (10.0.1.2): 56 data bytes
> .^C
> --- 10.0.1.2 ping statistics ---
> 41927 packets transmitted, 41926 packets received, 0.0% packet loss
> round-trip min/avg/max/stddev = 0.084/0.116/0.145/0.002 ms
>
> Any ideas on what's going on here? Testing 10G ix interfaces between the same 
> two machines results in 9.39 Gbits/sec and flood ping latencies of 17 usec.
>
> Thanks,
> Lars
>
> PS: Full dmesg attached.
>
> Copyright (c) 1992-2015 The FreeBSD Project.
> Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
> The Regents of the University of California. All rights reserved.
> FreeBSD is a registered trademark of The FreeBSD Foundation.
> FreeBSD 11.0-CURRENT #2 483de3c(muclab)-dirty: Mon Oct 19 11:01:16 CEST 2015
> 
> el...@laurel.muccbc.hq.netapp.com:/usr/home/elars/obj/usr/home/elars/src/sys/MUCLAB
>  amd64
> FreeBSD clang version 3.7.0 (tags/RELEASE_370/final 246257) 20150906
> VT(vga): resolution 640x480
> CPU: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz (2000.05-MHz K8-class CPU)
>   Origin="GenuineIntel"  Id=0x206d7  Family=0x6  Model=0x2d  Stepping=7
>   
> Features=0xbfebfbff
>   
> Features2=0x1fbee3ff
>   AMD Features=0x2c100800
>   AMD Features2=0x1
>   XSAVE Features=0x1
>   VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID
>   TSC: P-state invariant, performance statistics
> real memory  = 137438953472 (131072 MB)
> avail memory = 133484290048 (127300 MB)
> Event timer "LAPIC" quality 600
> ACPI APIC Table: < >
> FreeBSD/SMP: Multiprocessor System Detected: 32 CPUs
> FreeBSD/SMP: 2 package(s) x 8 core(s) x 2 SMT threads
>  cpu0 (BSP): APIC ID:  0
>  cpu1 (AP): APIC ID:  1
>  cpu2 (AP): APIC ID:  2
>  cpu3 (AP): APIC ID:  3
>  cpu4 (AP): APIC ID:  4
>  cpu5 (AP): APIC ID:  5
>  cpu6 (AP): APIC ID:  6
>  cpu7 (AP): APIC ID:  7
>  cpu8 (AP): APIC ID:  8
>  cpu9 (AP): APIC ID:  9
>  cpu10 (AP): APIC ID: 10
>  cpu11 (AP): APIC ID: 11
>  cpu12 (AP): APIC ID: 12
>  cpu13 (AP): APIC ID: 13
>  cpu14 (AP): APIC ID: 14
>  cpu15 (AP): APIC 

Re: ixl 40G bad performance?

2015-10-19 Thread Eggert, Lars
Hi,

On 2015-10-19, at 16:20, Luigi Rizzo  wrote:
> 
> i would look at the following:
> - c states and clock speed - make sure you never go below C1,
>  and fix the clock speed to max.
>  Sure these parameters also affect the 10G card, but there
>  may be strange interaction that trigger the power saving
>  modes in different ways

I already have powerd_flags="-a max -b max -n max" in rc.conf, which I hope 
should be enough.

> - interrupt moderation (may affect ping latency,
>  do not remember how it is set in ixl but probably a sysctl

ixl(4) describes two sysctls that sound like they control AIM, and they default 
to off:

hw.ixl.dynamic_tx_itr: 0
hw.ixl.dynamic_rx_itr: 0

> - number of queues (32 is a lot i wouldn't use more than 4-8),
>  may affect cpu-socket affinity

With hw.ixl.max_queues=4 in loader.conf, performance is still unchanged.

> - tso and flow director - i have seen bad effects of
>  accelerations so i would run the iperf test with
>  of these features disabled on both sides, and then enable
>  them one at a time

No change with "ifconfig -tso4 -tso6 -rxcsum -txcsum -lro".

How do I turn off flow director?

> - queue sizes - the driver seems to use 1024 slots which is
>  about 1.5 MB queued, which in turn means you have 300us
>  (and possibly half of that) to drain the queue at 40Gbit/s.
>  150-300us may seem an eternity, but if a couple of cores fall
>  into c7 your budget is gone and the loss will trigger a
>  retransmission and window halving etc.

Also no change with "hw.ixl.ringsz=256" in loader.conf.

This is really weird.

Lars


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: ixl 40G bad performance?

2015-10-19 Thread Luigi Rizzo
On Monday, October 19, 2015, Eggert, Lars  wrote:

> Hi,
>
> On 2015-10-19, at 16:20, Luigi Rizzo >
> wrote:
> >
> > i would look at the following:
> > - c states and clock speed - make sure you never go below C1,
> >  and fix the clock speed to max.
> >  Sure these parameters also affect the 10G card, but there
> >  may be strange interaction that trigger the power saving
> >  modes in different ways
>
> I already have powerd_flags="-a max -b max -n max" in rc.conf, which I
> hope should be enough.


I suspect it might not touch the c states, but better check. The safest is
disable them in the bios.


>
> > - interrupt moderation (may affect ping latency,
> >  do not remember how it is set in ixl but probably a sysctl
>
> ixl(4) describes two sysctls that sound like they control AIM, and they
> default to off:
>
> hw.ixl.dynamic_tx_itr: 0
> hw.ixl.dynamic_rx_itr: 0
>
>
There must be some other control for the actual (fixed, not dynamic)
moderation.


> > - number of queues (32 is a lot i wouldn't use more than 4-8),
> >  may affect cpu-socket affinity
>
> With hw.ixl.max_queues=4 in loader.conf, performance is still unchanged.
>
> > - tso and flow director - i have seen bad effects of
> >  accelerations so i would run the iperf test with
> >  of these features disabled on both sides, and then enable
> >  them one at a time
>
> No change with "ifconfig -tso4 -tso6 -rxcsum -txcsum -lro".
>
> How do I turn off flow director?


I am not sure if it is enabled I'm FreeBSD. It is in linux and almost
halves the pkt rate with netmap (from 35 down to 19mpps).
Maybe it is not too bad for bulk TCP.


>
> > - queue sizes - the driver seems to use 1024 slots which is
> >  about 1.5 MB queued, which in turn means you have 300us
> >  (and possibly half of that) to drain the queue at 40Gbit/s.
> >  150-300us may seem an eternity, but if a couple of cores fall
> >  into c7 your budget is gone and the loss will trigger a
> >  retransmission and window halving etc.
>
> Also no change with "hw.ixl.ringsz=256" in loader.conf.


Any better success with 2048 slots?
3.5 gbit  is what I used to see on the ixgbe with tso disabled, probably
hitting a CPU bound.

Cheers
Luigi


> This is really weird.
>
> Lars
>


-- 
-+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2217533   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: ixl 40G bad performance?

2015-10-19 Thread Eggert, Lars
Hi,

in order to eliminate network or hardware weirdness, I've rerun the test with 
Linux 4.3rc6, where I get 13.1 Gbits/sec throughput and 52 usec flood ping 
latency. Not great either, but in line with earlier experiments with Mellanox 
NICs and an untuned Linux system.

On 2015-10-19, at 17:11, Luigi Rizzo  wrote:
> I suspect it might not touch the c states, but better check. The safest is
> disable them in the bios.

I'll try that.

>> hw.ixl.dynamic_tx_itr: 0
>> hw.ixl.dynamic_rx_itr: 0
>> 
>> 
> There must be some other control for the actual (fixed, not dynamic)
> moderation.

The only other sysctls in ixl(4) that look relevant are:

 hw.ixl.rx_itr
 The RX interrupt rate value, set to 8K by default.

 hw.ixl.tx_itr
 The TX interrupt rate value, set to 4K by default.

I'll play with those.

>> Also no change with "hw.ixl.ringsz=256" in loader.conf.
> 
> Any better success with 2048 slots?
> 3.5 gbit  is what I used to see on the ixgbe with tso disabled, probably
> hitting a CPU bound.

Will try.

Thanks!

Lars


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: ixl 40G bad performance?

2015-10-19 Thread Luigi Rizzo
On Mon, Oct 19, 2015 at 8:34 AM, Eggert, Lars  wrote:
> Hi,
>
> in order to eliminate network or hardware weirdness, I've rerun the test with 
> Linux 4.3rc6, where I get 13.1 Gbits/sec throughput and 52 usec flood ping 
> latency. Not great either, but in line with earlier experiments with Mellanox 
> NICs and an untuned Linux system.
>
...

>> There must be some other control for the actual (fixed, not dynamic)
>> moderation.
>
> The only other sysctls in ixl(4) that look relevant are:
>
>  hw.ixl.rx_itr
>  The RX interrupt rate value, set to 8K by default.
>
>  hw.ixl.tx_itr
>  The TX interrupt rate value, set to 4K by default.
>

yes those. raise to 20-50k and see what you get in
terms of ping latency.
Note that 4k on tx means you get to reclaim buffers
in the tx queue (unless it is done opportunistically)
every 250us which is dangerously close to the 300us
capacity of the queue itself.

cheers
luigi

> I'll play with those.
>
>>> Also no change with "hw.ixl.ringsz=256" in loader.conf.
>>
>> Any better success with 2048 slots?
>> 3.5 gbit  is what I used to see on the ixgbe with tso disabled, probably
>> hitting a CPU bound.
>
> Will try.
>
> Thanks!
>
> Lars



-- 
-+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2217533   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: ixl 40G bad performance?

2015-10-19 Thread Kevin Oberman
On Mon, Oct 19, 2015 at 8:11 AM, Luigi Rizzo  wrote:

> On Monday, October 19, 2015, Eggert, Lars  wrote:
>
> > Hi,
> >
> > On 2015-10-19, at 16:20, Luigi Rizzo >
> > wrote:
> > >
> > > i would look at the following:
> > > - c states and clock speed - make sure you never go below C1,
> > >  and fix the clock speed to max.
> > >  Sure these parameters also affect the 10G card, but there
> > >  may be strange interaction that trigger the power saving
> > >  modes in different ways
> >
> > I already have powerd_flags="-a max -b max -n max" in rc.conf, which I
> > hope should be enough.
>
>
> I suspect it might not touch the c states, but better check. The safest is
> disable them in the bios.
>

To disable C-States:
sysctl dev.cpu.0.cx_lowest=C1
--
Kevin Oberman, Part time kid herder and retired Network Engineer
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: ixl 40G bad performance?

2015-10-19 Thread hiren panchasara
On 10/19/15 at 08:11P, Luigi Rizzo wrote:
> On Monday, October 19, 2015, Eggert, Lars  wrote:
> 
> >
> > How do I turn off flow director?
> 
> 
> I am not sure if it is enabled I'm FreeBSD. It is in linux and almost
> halves the pkt rate with netmap (from 35 down to 19mpps).
> Maybe it is not too bad for bulk TCP.
>

Flow director support is incomplete on FreeBSD and that's why it is
disabled by default.

Cheers,
Hiren


pgp9F3s1LYB3X.pgp
Description: PGP signature