[dpdk-dev] [dpdk-users] RSS Hash not working for XL710/X710 NICs for some RX mbuf sizes

2016-07-26 Thread Take Ceara
Hi Beilei,

On Tue, Jul 26, 2016 at 10:47 AM, Zhang, Helin  wrote:
> Hi Ceara
>
> For testpmd command line, txqflags = 0xf01 should be set for receiving 
> packets which needs more than one mbufs.
> I am not sure if it is helpful for you here. Please have a try!
>

Just tried, and it doesn't really help:
testpmd -c 1 -n 4 -w :01:00.3 -w :81:00.3 -- -i
--coremask=0x0 --rxq=2 --txq=2 --mbuf-size 1152 --txpkts 1024
--enable-rx-cksum --rss-udp --txqflags 0xf01

  src=68:05:CA:38:6D:63 - dst=02:00:00:00:00:01 - type=0x0800 -
length=1024 - nb_segs=1 - RSS hash=0x0 - RSS queue=0x0 - (outer) L2
type: ETHER - (outer) L3 type: IPV4_EXT_UNKNOWN SIP=C0A80001
DIP=C0A8000A - (outer) L4 type: UDP - Tunnel type: Unknown - Inner L2
type: Unknown - Inner L3 type: Unknown - Inner L4 type: Unknown
 - Receive queue=0x0
  PKT_RX_RSS_HASH

As I was saying in my previous email the problem is that the RSS is
set in the last mbuf instead of the first:

http://dpdk.org/browse/dpdk/tree/drivers/net/i40e/i40e_rxtx.c#n1438

Even worse, the last rxm mbuf was already freed if it only contained
the CRC which had to be stripped:

http://dpdk.org/browse/dpdk/tree/drivers/net/i40e/i40e_rxtx.c#n1419

Regards,
Dumitru


> Regards,
> Helin
>
>> -----Original Message-
>> From: Take Ceara [mailto:dumitru.ceara at gmail.com]
>> Sent: Tuesday, July 26, 2016 4:38 PM
>> To: Xing, Beilei 
>> Cc: Zhang, Helin ; Wu, Jingjing > intel.com>;
>> dev at dpdk.org
>> Subject: Re: [dpdk-dev] [dpdk-users] RSS Hash not working for XL710/X710 NICs
>> for some RX mbuf sizes
>>
>> Hi Beilei,
>>
>> On Mon, Jul 25, 2016 at 12:04 PM, Take Ceara 
>> wrote:
>> > Hi Beilei,
>> >
>> > On Mon, Jul 25, 2016 at 5:24 AM, Xing, Beilei  
>> > wrote:
>> >> Hi,
>> >>
>> >>> -Original Message-
>> >>> From: Take Ceara [mailto:dumitru.ceara at gmail.com]
>> >>> Sent: Friday, July 22, 2016 8:32 PM
>> >>> To: Xing, Beilei 
>> >>> Cc: Zhang, Helin ; Wu, Jingjing
>> >>> ; dev at dpdk.org
>> >>> Subject: Re: [dpdk-dev] [dpdk-users] RSS Hash not working for
>> >>> XL710/X710 NICs for some RX mbuf sizes
>> >>>
>> >>> I was using the test-pmd "txonly" implementation which sends fixed
>> >>> UDP packets from 192.168.0.1:1024 -> 192.168.0.2:1024.
>> >>>
>> >>> I changed the test-pmd tx_only code so that it sends traffic with
>> >>> incremental destination IP: 192.168.0.1:1024 -> [192.168.0.2,
>> >>> 192.168.0.12]:1024
>> >>> I also dumped the source and destination IPs in the "rxonly"
>> >>> pkt_burst_receive function.
>> >>> Then I see that packets are indeed sent to different queues but the
>> >>> mbuf->hash.rss value is still 0.
>> >>>
>> >>> ./testpmd -c 1 -n 4 -w :01:00.3 -w :81:00.3 -- -i
>> >>> --coremask=0x0 --rxq=16 --txq=16 --mbuf-size 1152 --txpkts 1024
>> >>> --enable-rx-cksum --rss-udp
>> >>>
>> >>> [...]
>> >>>
>> >>>  - Receive queue=0xf
>> >>>   PKT_RX_RSS_HASH
>> >>>   src=68:05:CA:38:6D:63 - dst=02:00:00:00:00:01 - type=0x0800 -
>> >>> length=1024 - nb_segs=1 - RSS queue=0xa - (outer) L2 type: ETHER -
>> >>> (outer) L3 type: IPV4_EXT_UNKNOWN SIP=C0A80001 DIP=C0A80006 -
>> >>> (outer)
>> >>> L4 type: UDP - Tunnel type: Unknown - RSS hash=0x0 - Inner L2 type:
>> >>> Unknown - RSS queue=0xf - RSS queue=0x7 - (outer) L2 type: ETHER -
>> >>> (outer) L3 type: IPV4_EXT_UNKNOWN SIP=C0A80001 DIP=C0A80007 -
>> >>> (outer)
>> >>> L4 type: UDP - Tunnel type: Unknown - Inner L2 type: Unknown - Inner
>> >>> L3 type: Unknown - Inner L4 type: Unknown
>> >>>  - Receive queue=0x7
>> >>>   PKT_RX_RSS_HASH
>> >>>   src=68:05:CA:38:6D:63 - dst=02:00:00:00:00:01 - (outer) L2 type:
>> >>> ETHER - (outer) L3 type: IPV4_EXT_UNKNOWN SIP=C0A80001 DIP=C0A80009
>> >>> -
>> >>> type=0x0800 - length=1024 - nb_segs=1 - Inner L3 type: Unknown -
>> >>> Inner
>> >>> L4 type: Unknown - RSS hash=0x0 - (outer) L4 type: UDP - Tunnel type:
>> >>> Unknown - Inner L2 type: Unknown - Inner L3 type: Unknown - RSS
>> >>> queue=0x7 - Inner L4 type: Unknown
>> >>>
>> >>> [...]
>> >>>
>> >>> testpmd> stop
>

[dpdk-dev] [dpdk-users] RSS Hash not working for XL710/X710 NICs for some RX mbuf sizes

2016-07-26 Thread Take Ceara
Hi Beilei,

On Mon, Jul 25, 2016 at 12:04 PM, Take Ceara  wrote:
> Hi Beilei,
>
> On Mon, Jul 25, 2016 at 5:24 AM, Xing, Beilei  
> wrote:
>> Hi,
>>
>>> -Original Message-
>>> From: Take Ceara [mailto:dumitru.ceara at gmail.com]
>>> Sent: Friday, July 22, 2016 8:32 PM
>>> To: Xing, Beilei 
>>> Cc: Zhang, Helin ; Wu, Jingjing >> intel.com>;
>>> dev at dpdk.org
>>> Subject: Re: [dpdk-dev] [dpdk-users] RSS Hash not working for XL710/X710 
>>> NICs
>>> for some RX mbuf sizes
>>>
>>> I was using the test-pmd "txonly" implementation which sends fixed UDP
>>> packets from 192.168.0.1:1024 -> 192.168.0.2:1024.
>>>
>>> I changed the test-pmd tx_only code so that it sends traffic with 
>>> incremental
>>> destination IP: 192.168.0.1:1024 -> [192.168.0.2,
>>> 192.168.0.12]:1024
>>> I also dumped the source and destination IPs in the "rxonly"
>>> pkt_burst_receive function.
>>> Then I see that packets are indeed sent to different queues but the
>>> mbuf->hash.rss value is still 0.
>>>
>>> ./testpmd -c 1 -n 4 -w :01:00.3 -w :81:00.3 -- -i
>>> --coremask=0x0 --rxq=16 --txq=16 --mbuf-size 1152 --txpkts 1024
>>> --enable-rx-cksum --rss-udp
>>>
>>> [...]
>>>
>>>  - Receive queue=0xf
>>>   PKT_RX_RSS_HASH
>>>   src=68:05:CA:38:6D:63 - dst=02:00:00:00:00:01 - type=0x0800 -
>>> length=1024 - nb_segs=1 - RSS queue=0xa - (outer) L2 type: ETHER -
>>> (outer) L3 type: IPV4_EXT_UNKNOWN SIP=C0A80001 DIP=C0A80006 - (outer)
>>> L4 type: UDP - Tunnel type: Unknown - RSS hash=0x0 - Inner L2 type:
>>> Unknown - RSS queue=0xf - RSS queue=0x7 - (outer) L2 type: ETHER -
>>> (outer) L3 type: IPV4_EXT_UNKNOWN SIP=C0A80001 DIP=C0A80007 - (outer)
>>> L4 type: UDP - Tunnel type: Unknown - Inner L2 type: Unknown - Inner
>>> L3 type: Unknown - Inner L4 type: Unknown
>>>  - Receive queue=0x7
>>>   PKT_RX_RSS_HASH
>>>   src=68:05:CA:38:6D:63 - dst=02:00:00:00:00:01 - (outer) L2 type:
>>> ETHER - (outer) L3 type: IPV4_EXT_UNKNOWN SIP=C0A80001 DIP=C0A80009 -
>>> type=0x0800 - length=1024 - nb_segs=1 - Inner L3 type: Unknown - Inner
>>> L4 type: Unknown - RSS hash=0x0 - (outer) L4 type: UDP - Tunnel type:
>>> Unknown - Inner L2 type: Unknown - Inner L3 type: Unknown - RSS
>>> queue=0x7 - Inner L4 type: Unknown
>>>
>>> [...]
>>>
>>> testpmd> stop
>>>   --- Forward Stats for RX Port= 0/Queue= 0 -> TX Port= 1/Queue= 0 
>>> ---
>>>   RX-packets: 0  TX-packets: 32 TX-dropped: 0
>>>   --- Forward Stats for RX Port= 0/Queue= 1 -> TX Port= 1/Queue= 1 
>>> ---
>>>   RX-packets: 59 TX-packets: 32 TX-dropped: 0
>>>   --- Forward Stats for RX Port= 0/Queue= 2 -> TX Port= 1/Queue= 2 
>>> ---
>>>   RX-packets: 48 TX-packets: 32 TX-dropped: 0
>>>   --- Forward Stats for RX Port= 0/Queue= 3 -> TX Port= 1/Queue= 3 
>>> ---
>>>   RX-packets: 0  TX-packets: 32 TX-dropped: 0
>>>   --- Forward Stats for RX Port= 0/Queue= 4 -> TX Port= 1/Queue= 4 
>>> ---
>>>   RX-packets: 59 TX-packets: 32 TX-dropped: 0
>>>   --- Forward Stats for RX Port= 0/Queue= 5 -> TX Port= 1/Queue= 5 
>>> ---
>>>   RX-packets: 0  TX-packets: 32 TX-dropped: 0
>>>   --- Forward Stats for RX Port= 0/Queue= 6 -> TX Port= 1/Queue= 6 
>>> ---
>>>   RX-packets: 0  TX-packets: 32 TX-dropped: 0
>>>   --- Forward Stats for RX Port= 0/Queue= 7 -> TX Port= 1/Queue= 7 
>>> ---
>>>   RX-packets: 48 TX-packets: 32 TX-dropped: 0
>>>   --- Forward Stats for RX Port= 0/Queue= 8 -> TX Port= 1/Queue= 8 
>>> ---
>>>   RX-packets: 0  TX-packets: 32 TX-dropped: 0
>>>   --- Forward Stats for RX Port= 0/Queue= 9 -> TX Port= 1/Queue= 9 
>>> ---
>>>   RX-packets: 59 TX-packets: 32 TX-dropped: 0
>>>   --- Forward Stats for RX Port= 0/Queue=10 -> TX Port= 1/Queue=10 
>>> ---
>>>   RX-packets: 48 TX-packets: 32 TX-dropped: 0
>>>   --- Forward Stats for RX Port= 0/Queue=11 -> TX Port= 1/Queue=11 
>>>

[dpdk-dev] [dpdk-users] RSS Hash not working for XL710/X710 NICs for some RX mbuf sizes

2016-07-25 Thread Take Ceara
Hi Beilei,

On Mon, Jul 25, 2016 at 5:24 AM, Xing, Beilei  wrote:
> Hi,
>
>> -Original Message-----
>> From: Take Ceara [mailto:dumitru.ceara at gmail.com]
>> Sent: Friday, July 22, 2016 8:32 PM
>> To: Xing, Beilei 
>> Cc: Zhang, Helin ; Wu, Jingjing > intel.com>;
>> dev at dpdk.org
>> Subject: Re: [dpdk-dev] [dpdk-users] RSS Hash not working for XL710/X710 NICs
>> for some RX mbuf sizes
>>
>> I was using the test-pmd "txonly" implementation which sends fixed UDP
>> packets from 192.168.0.1:1024 -> 192.168.0.2:1024.
>>
>> I changed the test-pmd tx_only code so that it sends traffic with incremental
>> destination IP: 192.168.0.1:1024 -> [192.168.0.2,
>> 192.168.0.12]:1024
>> I also dumped the source and destination IPs in the "rxonly"
>> pkt_burst_receive function.
>> Then I see that packets are indeed sent to different queues but the
>> mbuf->hash.rss value is still 0.
>>
>> ./testpmd -c 1 -n 4 -w :01:00.3 -w :81:00.3 -- -i
>> --coremask=0x0 --rxq=16 --txq=16 --mbuf-size 1152 --txpkts 1024
>> --enable-rx-cksum --rss-udp
>>
>> [...]
>>
>>  - Receive queue=0xf
>>   PKT_RX_RSS_HASH
>>   src=68:05:CA:38:6D:63 - dst=02:00:00:00:00:01 - type=0x0800 -
>> length=1024 - nb_segs=1 - RSS queue=0xa - (outer) L2 type: ETHER -
>> (outer) L3 type: IPV4_EXT_UNKNOWN SIP=C0A80001 DIP=C0A80006 - (outer)
>> L4 type: UDP - Tunnel type: Unknown - RSS hash=0x0 - Inner L2 type:
>> Unknown - RSS queue=0xf - RSS queue=0x7 - (outer) L2 type: ETHER -
>> (outer) L3 type: IPV4_EXT_UNKNOWN SIP=C0A80001 DIP=C0A80007 - (outer)
>> L4 type: UDP - Tunnel type: Unknown - Inner L2 type: Unknown - Inner
>> L3 type: Unknown - Inner L4 type: Unknown
>>  - Receive queue=0x7
>>   PKT_RX_RSS_HASH
>>   src=68:05:CA:38:6D:63 - dst=02:00:00:00:00:01 - (outer) L2 type:
>> ETHER - (outer) L3 type: IPV4_EXT_UNKNOWN SIP=C0A80001 DIP=C0A80009 -
>> type=0x0800 - length=1024 - nb_segs=1 - Inner L3 type: Unknown - Inner
>> L4 type: Unknown - RSS hash=0x0 - (outer) L4 type: UDP - Tunnel type:
>> Unknown - Inner L2 type: Unknown - Inner L3 type: Unknown - RSS
>> queue=0x7 - Inner L4 type: Unknown
>>
>> [...]
>>
>> testpmd> stop
>>   --- Forward Stats for RX Port= 0/Queue= 0 -> TX Port= 1/Queue= 0 
>> ---
>>   RX-packets: 0  TX-packets: 32 TX-dropped: 0
>>   --- Forward Stats for RX Port= 0/Queue= 1 -> TX Port= 1/Queue= 1 
>> ---
>>   RX-packets: 59 TX-packets: 32 TX-dropped: 0
>>   --- Forward Stats for RX Port= 0/Queue= 2 -> TX Port= 1/Queue= 2 
>> ---
>>   RX-packets: 48 TX-packets: 32 TX-dropped: 0
>>   --- Forward Stats for RX Port= 0/Queue= 3 -> TX Port= 1/Queue= 3 
>> ---
>>   RX-packets: 0  TX-packets: 32 TX-dropped: 0
>>   --- Forward Stats for RX Port= 0/Queue= 4 -> TX Port= 1/Queue= 4 
>> ---
>>   RX-packets: 59 TX-packets: 32 TX-dropped: 0
>>   --- Forward Stats for RX Port= 0/Queue= 5 -> TX Port= 1/Queue= 5 
>> ---
>>   RX-packets: 0  TX-packets: 32 TX-dropped: 0
>>   --- Forward Stats for RX Port= 0/Queue= 6 -> TX Port= 1/Queue= 6 
>> ---
>>   RX-packets: 0  TX-packets: 32 TX-dropped: 0
>>   --- Forward Stats for RX Port= 0/Queue= 7 -> TX Port= 1/Queue= 7 
>> ---
>>   RX-packets: 48 TX-packets: 32 TX-dropped: 0
>>   --- Forward Stats for RX Port= 0/Queue= 8 -> TX Port= 1/Queue= 8 
>> ---
>>   RX-packets: 0  TX-packets: 32 TX-dropped: 0
>>   --- Forward Stats for RX Port= 0/Queue= 9 -> TX Port= 1/Queue= 9 
>> ---
>>   RX-packets: 59 TX-packets: 32 TX-dropped: 0
>>   --- Forward Stats for RX Port= 0/Queue=10 -> TX Port= 1/Queue=10 
>> ---
>>   RX-packets: 48 TX-packets: 32 TX-dropped: 0
>>   --- Forward Stats for RX Port= 0/Queue=11 -> TX Port= 1/Queue=11 
>> ---
>>   RX-packets: 0  TX-packets: 32 TX-dropped: 0
>>   --- Forward Stats for RX Port= 0/Queue=12 -> TX Port= 1/Queue=12 
>> ---
>>   RX-packets: 59 TX-packets: 32 TX-dropped: 0
>>   --- Forward Stats for RX Port= 0/Queue=13 -> TX Port= 1/Queue=13 
>> --

[dpdk-dev] [dpdk-users] RSS Hash not working for XL710/X710 NICs for some RX mbuf sizes

2016-07-22 Thread Take Ceara
On Fri, Jul 22, 2016 at 2:31 PM, Take Ceara  wrote:
> Hi Beilei,
>
> On Fri, Jul 22, 2016 at 11:04 AM, Xing, Beilei  
> wrote:
>> Hi Ceara,
>>
>>> -Original Message-
>>> From: Take Ceara [mailto:dumitru.ceara at gmail.com]
>>> Sent: Thursday, July 21, 2016 6:58 PM
>>> To: Xing, Beilei 
>>> Cc: Zhang, Helin ; Wu, Jingjing
>>> ; dev at dpdk.org
>>> Subject: Re: [dpdk-dev] [dpdk-users] RSS Hash not working for XL710/X710
>>> NICs for some RX mbuf sizes
>>>
>>>
>>> Following your testpmd example run I managed to replicate the problem on
>>> my dpdk 16.04 setup like this:
>>>
>>> I have two X710 adapters connected back to back:
>>> $ ./tools/dpdk_nic_bind.py -s
>>>
>>> Network devices using DPDK-compatible driver
>>> 
>>> :01:00.3 'Ethernet Controller X710 for 10GbE SFP+' drv=igb_uio unused=
>>> :81:00.3 'Ethernet Controller X710 for 10GbE SFP+' drv=igb_uio unused=
>>>
>>> The firmware of the two adapters is up to date with the latest
>>> version: 5.04 (f5.0.40043 a1.5 n5.04 e24cd)
>>>
>>> I run testpmd with mbuf-size 1152 and txpktsize 1024 such that upon receival
>>> the whole mbuf (except headroom) is filled.
>>> I enabled RX IP checksum in hw and RX RSS hashing for UDP.
>>> With test-pmd forward mode "rxonly" and verbose 1 I see that incoming
>>> packets have PKT_RX_RSS_HASH set but the hash value is 0.
>>>
>>> ./testpmd -c 1 -n 4 -w :01:00.3 -w :81:00.3 -- -i
>>> --coremask=0x0 --rxq=16 --txq=16 --mbuf-size 1152 --txpkts 1024 --
>>> enable-rx-cksum --rss-udp [...]
>>> testpmd> set verbose 1
>>> Change verbose level from 0 to 1
>>> testpmd> set fwd rxonly
>>> Set rxonly packet forwarding mode
>>> testpmd> start tx_first
>>>   rxonly packet forwarding - CRC stripping disabled - packets/burst=32
>>>   nb forwarding cores=16 - nb forwarding ports=2
>>>   RX queues=16 - RX desc=128 - RX free threshold=32
>>>   RX threshold registers: pthresh=8 hthresh=8 wthresh=0
>>>   TX queues=16 - TX desc=512 - TX free threshold=32
>>>   TX threshold registers: pthresh=32 hthresh=0 wthresh=0
>>>   TX RS bit threshold=32 - TXQ flags=0xf01 port 0/queue 1: received 32
>>> packets
>>>   src=68:05:CA:38:6D:63 - dst=02:00:00:00:00:01 - type=0x0800 -
>>> length=1024 - nb_segs=1 - RSS hash=0x0 - RSS queue=0x1 - (outer) L2
>>> type: ETHER - (outer) L3 type: IPV4_EXT_UNKNOWN - (outer) L4 type: UDP
>>> - Tunnel type: Unknown - Inner L2 type: Unknown - Inner L3 type:
>>> Unknown - Inner L4 type: Unknown
>>>  - Receive queue=0x1
>>>   PKT_RX_RSS_HASH
>>>   src=68:05:CA:38:6D:63 - dst=02:00:00:00:00:01 - type=0x0800 -
>>> length=1024 - nb_segs=1 - RSS hash=0x0 - RSS queue=0x1 - (outer) L2
>>> type: ETHER - (outer) L3 type: IPV4_EXT_UNKNOWN - (outer) L4 type: UDP
>>> - Tunnel type: Unknown - Inner L2 type: Unknown - Inner L3 type:
>>> Unknown - Inner L4 type: Unknown
>>>  - Receive queue=0x1
>>>   PKT_RX_RSS_HASH
>>
>> What's the source ip address and destination ip address of the packet you 
>> sent to port 0? Could you try to change ip address or port number to observe 
>> if hash value changes? I remember I saw hash value was 0 before, but with 
>> different ip address, there'll be different hash values.
>
> I was using the test-pmd "txonly" implementation which sends fixed UDP
> packets from 192.168.0.1:1024 -> 192.168.0.2:1024.
>
> I changed the test-pmd tx_only code so that it sends traffic with
> incremental destination IP: 192.168.0.1:1024 -> [192.168.0.2,
> 192.168.0.12]:1024
> I also dumped the source and destination IPs in the "rxonly"
> pkt_burst_receive function.
> Then I see that packets are indeed sent to different queues but the
> mbuf->hash.rss value is still 0.
>
> ./testpmd -c 1 -n 4 -w :01:00.3 -w :81:00.3 -- -i
> --coremask=0x0 --rxq=16 --txq=16 --mbuf-size 1152 --txpkts 1024
> --enable-rx-cksum --rss-udp
>
> [...]
>
>  - Receive queue=0xf
>   PKT_RX_RSS_HASH
>   src=68:05:CA:38:6D:63 - dst=02:00:00:00:00:01 - type=0x0800 -
> length=1024 - nb_segs=1 - RSS queue=0xa - (outer) L2 type: ETHER -
> (outer) L3 type: IPV4_EXT_UNKNOWN SIP=C0A80001 DIP=C0A80006 - (outer)
> L4 type: UDP - Tunnel type: Unknown - RSS hash=0x0 - Inner L2 type:
> Unknown - RSS queue=0xf - RSS queue=0x7 - (outer) L2 type: ETH

[dpdk-dev] [dpdk-users] RSS Hash not working for XL710/X710 NICs for some RX mbuf sizes

2016-07-22 Thread Take Ceara
Hi Beilei,

On Fri, Jul 22, 2016 at 11:04 AM, Xing, Beilei  wrote:
> Hi Ceara,
>
>> -Original Message-----
>> From: Take Ceara [mailto:dumitru.ceara at gmail.com]
>> Sent: Thursday, July 21, 2016 6:58 PM
>> To: Xing, Beilei 
>> Cc: Zhang, Helin ; Wu, Jingjing
>> ; dev at dpdk.org
>> Subject: Re: [dpdk-dev] [dpdk-users] RSS Hash not working for XL710/X710
>> NICs for some RX mbuf sizes
>>
>>
>> Following your testpmd example run I managed to replicate the problem on
>> my dpdk 16.04 setup like this:
>>
>> I have two X710 adapters connected back to back:
>> $ ./tools/dpdk_nic_bind.py -s
>>
>> Network devices using DPDK-compatible driver
>> 
>> :01:00.3 'Ethernet Controller X710 for 10GbE SFP+' drv=igb_uio unused=
>> :81:00.3 'Ethernet Controller X710 for 10GbE SFP+' drv=igb_uio unused=
>>
>> The firmware of the two adapters is up to date with the latest
>> version: 5.04 (f5.0.40043 a1.5 n5.04 e24cd)
>>
>> I run testpmd with mbuf-size 1152 and txpktsize 1024 such that upon receival
>> the whole mbuf (except headroom) is filled.
>> I enabled RX IP checksum in hw and RX RSS hashing for UDP.
>> With test-pmd forward mode "rxonly" and verbose 1 I see that incoming
>> packets have PKT_RX_RSS_HASH set but the hash value is 0.
>>
>> ./testpmd -c 1 -n 4 -w :01:00.3 -w :81:00.3 -- -i
>> --coremask=0x0 --rxq=16 --txq=16 --mbuf-size 1152 --txpkts 1024 --
>> enable-rx-cksum --rss-udp [...]
>> testpmd> set verbose 1
>> Change verbose level from 0 to 1
>> testpmd> set fwd rxonly
>> Set rxonly packet forwarding mode
>> testpmd> start tx_first
>>   rxonly packet forwarding - CRC stripping disabled - packets/burst=32
>>   nb forwarding cores=16 - nb forwarding ports=2
>>   RX queues=16 - RX desc=128 - RX free threshold=32
>>   RX threshold registers: pthresh=8 hthresh=8 wthresh=0
>>   TX queues=16 - TX desc=512 - TX free threshold=32
>>   TX threshold registers: pthresh=32 hthresh=0 wthresh=0
>>   TX RS bit threshold=32 - TXQ flags=0xf01 port 0/queue 1: received 32
>> packets
>>   src=68:05:CA:38:6D:63 - dst=02:00:00:00:00:01 - type=0x0800 -
>> length=1024 - nb_segs=1 - RSS hash=0x0 - RSS queue=0x1 - (outer) L2
>> type: ETHER - (outer) L3 type: IPV4_EXT_UNKNOWN - (outer) L4 type: UDP
>> - Tunnel type: Unknown - Inner L2 type: Unknown - Inner L3 type:
>> Unknown - Inner L4 type: Unknown
>>  - Receive queue=0x1
>>   PKT_RX_RSS_HASH
>>   src=68:05:CA:38:6D:63 - dst=02:00:00:00:00:01 - type=0x0800 -
>> length=1024 - nb_segs=1 - RSS hash=0x0 - RSS queue=0x1 - (outer) L2
>> type: ETHER - (outer) L3 type: IPV4_EXT_UNKNOWN - (outer) L4 type: UDP
>> - Tunnel type: Unknown - Inner L2 type: Unknown - Inner L3 type:
>> Unknown - Inner L4 type: Unknown
>>  - Receive queue=0x1
>>   PKT_RX_RSS_HASH
>
> What's the source ip address and destination ip address of the packet you 
> sent to port 0? Could you try to change ip address or port number to observe 
> if hash value changes? I remember I saw hash value was 0 before, but with 
> different ip address, there'll be different hash values.

I was using the test-pmd "txonly" implementation which sends fixed UDP
packets from 192.168.0.1:1024 -> 192.168.0.2:1024.

I changed the test-pmd tx_only code so that it sends traffic with
incremental destination IP: 192.168.0.1:1024 -> [192.168.0.2,
192.168.0.12]:1024
I also dumped the source and destination IPs in the "rxonly"
pkt_burst_receive function.
Then I see that packets are indeed sent to different queues but the
mbuf->hash.rss value is still 0.

./testpmd -c 1 -n 4 -w :01:00.3 -w :81:00.3 -- -i
--coremask=0x0 --rxq=16 --txq=16 --mbuf-size 1152 --txpkts 1024
--enable-rx-cksum --rss-udp

[...]

 - Receive queue=0xf
  PKT_RX_RSS_HASH
  src=68:05:CA:38:6D:63 - dst=02:00:00:00:00:01 - type=0x0800 -
length=1024 - nb_segs=1 - RSS queue=0xa - (outer) L2 type: ETHER -
(outer) L3 type: IPV4_EXT_UNKNOWN SIP=C0A80001 DIP=C0A80006 - (outer)
L4 type: UDP - Tunnel type: Unknown - RSS hash=0x0 - Inner L2 type:
Unknown - RSS queue=0xf - RSS queue=0x7 - (outer) L2 type: ETHER -
(outer) L3 type: IPV4_EXT_UNKNOWN SIP=C0A80001 DIP=C0A80007 - (outer)
L4 type: UDP - Tunnel type: Unknown - Inner L2 type: Unknown - Inner
L3 type: Unknown - Inner L4 type: Unknown
 - Receive queue=0x7
  PKT_RX_RSS_HASH
  src=68:05:CA:38:6D:63 - dst=02:00:00:00:00:01 - (outer) L2 type:
ETHER - (outer) L3 type: IPV4_EXT_UNKNOWN SIP=C0A80001 DIP=C0A80009 -
type=0x0800 - length=1024 - nb_segs=1 - Inner L3 type: Unknown - Inner
L4 type: Unknown - RSS has

[dpdk-dev] [dpdk-users] RSS Hash not working for XL710/X710 NICs for some RX mbuf sizes

2016-07-21 Thread Take Ceara
Hi Beilei,

On Wed, Jul 20, 2016 at 3:59 AM, Xing, Beilei  wrote:
> Hi Ceara,
>
>> -Original Message-----
>> From: Take Ceara [mailto:dumitru.ceara at gmail.com]
>> Sent: Tuesday, July 19, 2016 10:59 PM
>> To: Xing, Beilei 
>> Cc: Zhang, Helin ; Wu, Jingjing
>> ; dev at dpdk.org
>> Subject: Re: [dpdk-dev] [dpdk-users] RSS Hash not working for XL710/X710
>> NICs for some RX mbuf sizes
>>
>> Hi Beilei,
>>
>> I changed the way I run testmpd to:
>>
>> testpmd -c 0x331 -w :82:00.0 -w :83:00.0 -- --mbuf-size 1152 
>> --rss-ip -
>> -rxq=2 --txpkts 1024 -i
>>
>> As far as I understand this will allocate mbufs with the same size I was 
>> using
>> in my test (--mbuf-size seems to include the mbuf headroom therefore 1152
>> = 1024 + 128 headroom).
>>
>> testpmd> start tx_first
>>   io packet forwarding - CRC stripping disabled - packets/burst=32
>>   nb forwarding cores=1 - nb forwarding ports=2
>>   RX queues=2 - RX desc=128 - RX free threshold=32
>>   RX threshold registers: pthresh=8 hthresh=8 wthresh=0
>>   TX queues=1 - TX desc=512 - TX free threshold=32
>>   TX threshold registers: pthresh=32 hthresh=0 wthresh=0
>>   TX RS bit threshold=32 - TXQ flags=0xf01
>> testpmd> show port stats all
>>
>>    NIC statistics for port 0
>> 
>>   RX-packets: 18817613   RX-missed: 5  RX-bytes:  19269115888
>>   RX-errors: 0
>>   RX-nombuf:  0
>>   TX-packets: 18818064   TX-errors: 0  TX-bytes:  19269567464
>>
>> ##
>> ##
>>
>>    NIC statistics for port 1
>> 
>>   RX-packets: 18818392   RX-missed: 5  RX-bytes:  19269903360
>>   RX-errors: 0
>>   RX-nombuf:  0
>>   TX-packets: 18817979   TX-errors: 0  TX-bytes:  19269479424
>>
>> ##
>> ##
>>
>> Ttraffic is sent/received. However, I couldn't find any way to verify that 
>> the
>> incoming mbufs actually have the mbuf->hash.rss field set except for starting
>> test-pmd with gdb and setting a breakpoint in the io fwd engine. After doing
>> that I noticed that none of the incoming packets has the PKT_RX_RSS_HASH
>> flag set in ol_flags... I guess for some reason test-pmd doesn't actually
>> configure RSS in this case but I fail to see where.
>>
>
> Actually there's a way to check mbuf->hash.rss, you need set forward mode to 
> "rxonly", and set verbose to 1.
> I run testpmd with the configuration you used, and found i40e RSS works well.
> With the following steps, you can see RSS hash value and receive queue, and 
> PKT_RX_RSS_HASH is set too.
> I think you can use the same way to check what you want.
>
> ./testpmd -c f -n 4 -- -i --coremask=0xe --rxq=16 --txq=16 
> --mbuf-size 1152 --rss-ip --txpkts 1024
> testpmd> set verbose 1
> testpmd> set fwd rxonly
> testpmd> start
> testpmd> port 0/queue 1: received 1 packets
>   src=00:00:01:00:0F:00 - dst=68:05:CA:32:03:4C - type=0x0800 - length=1020 - 
> nb
>  - Receive queue=0x1
>   PKT_RX_RSS_HASH
> port 0/queue 0: received 1 packets
>   src=00:00:01:00:0F:00 - dst=68:05:CA:32:03:4C - type=0x0800 - length=1020 - 
> nb_segs=1 - RSS hash=0x4e949f23 - RSS queue=0x0Unknown packet type
>  - Receive queue=0x0
>   PKT_RX_RSS_HASH
> port 0/queue 8: received 1 packets
>   src=00:00:01:00:0F:00 - dst=68:05:CA:32:03:4C - type=0x0800 - length=1020 - 
> nb_segs=1 - RSS hash=0xa3c78b2b - RSS queue=0x8Unknown packet type
>  - Receive queue=0x8
>   PKT_RX_RSS_HASH
> port 0/queue 5: received 1 packets
>   src=00:00:01:00:0F:00 - dst=68:05:CA:32:03:4C - type=0x0800 - length=1020 - 
> nb_segs=1 - RSS hash=0xe29b3d36 - RSS queue=0x5Unknown packet type
>  - Receive queue=0x5
>   PKT_RX_RSS_HASH
>

Following your testpmd example run I managed to replicate the problem
on my dpdk 16.04 setup like this:

I have two X710 adapters connected back to back:
$ ./tools/dpdk_nic_bind.py -s

Network devices using DPDK-compatible driver

:01:00.3 'Ethernet Controller X710 for 10GbE SFP+' drv=igb_uio unused=
:81:00.3 'Ethernet Controller X710 for 10GbE SFP+' drv=igb_uio unused=

The firmware of the two adapters is up to date with the latest
version: 5.04 (f5.0.40043 a1.5 n5.04 e24cd)

I run testpmd with mbuf-size 1152 and txpktsize 1024 such that upon
receival the whole mbuf (except headroom) is fil

[dpdk-dev] [dpdk-users] RSS Hash not working for XL710/X710 NICs for some RX mbuf sizes

2016-07-19 Thread Take Ceara
Hi Beilei,

On Tue, Jul 19, 2016 at 11:31 AM, Xing, Beilei  wrote:
> Hi Ceara,
>
>> -Original Message-
>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Take Ceara
>> Sent: Tuesday, July 19, 2016 12:14 AM
>> To: Zhang, Helin 
>> Cc: Wu, Jingjing ; dev at dpdk.org
>> Subject: Re: [dpdk-dev] [dpdk-users] RSS Hash not working for XL710/X710
>> NICs for some RX mbuf sizes
>>
>> Hi Helin,
>>
>> On Mon, Jul 18, 2016 at 5:15 PM, Zhang, Helin 
>> wrote:
>> > Hi Ceara
>> >
>> > Could you help to let me know your firmware version?
>>
>> # ethtool -i p7p1 | grep firmware
>> firmware-version: f4.40.35115 a1.4 n4.53 e2021
>>
>> > And could you help to try with the standard DPDK example application,
>> such as testpmd, to see if there is the same issue?
>> > Basically we always set the same size for both rx and tx buffer, like the
>> default one of 2048 for a lot of applications.
>>
>> I'm a bit lost in the testpmd CLI. I enabled RSS, configured 2 RX queues per
>> port and started sending traffic with single segmnet packets of size 2K but I
>> didn't figure out how to actually verify that the RSS hash is correctly set..
>> Please let me know if I should do it in a different way.
>>
>> testpmd -c 0x331 -w :82:00.0 -w :83:00.0 -- --mbuf-size 2048 -i [...]
>>
>> testpmd> port stop all
>> Stopping ports...
>> Checking link statuses...
>> Port 0 Link Up - speed 4 Mbps - full-duplex Port 1 Link Up - speed 4
>> Mbps - full-duplex Done
>>
>> testpmd> port config all txq 2
>>
>> testpmd> port config all rss all
>>
>> testpmd> port config all max-pkt-len 2048 port start all
>> Configuring Port 0 (socket 0)
>> PMD: i40e_set_tx_function_flag(): Vector tx can be enabled on this txq.
>> PMD: i40e_set_tx_function_flag(): Vector tx can be enabled on this txq.
>> PMD: i40e_dev_rx_queue_setup(): Rx Burst Bulk Alloc Preconditions are
>> satisfied. Rx Burst Bulk Alloc function will be used on port=0, queue=0.
>> PMD: i40e_set_tx_function(): Vector tx finally be used.
>> PMD: i40e_set_rx_function(): Using Vector Scattered Rx callback (port=0).
>> Port 0: 3C:FD:FE:9D:BE:F0
>> Configuring Port 1 (socket 0)
>> PMD: i40e_set_tx_function_flag(): Vector tx can be enabled on this txq.
>> PMD: i40e_set_tx_function_flag(): Vector tx can be enabled on this txq.
>> PMD: i40e_dev_rx_queue_setup(): Rx Burst Bulk Alloc Preconditions are
>> satisfied. Rx Burst Bulk Alloc function will be used on port=1, queue=0.
>> PMD: i40e_set_tx_function(): Vector tx finally be used.
>> PMD: i40e_set_rx_function(): Using Vector Scattered Rx callback (port=1).
>> Port 1: 3C:FD:FE:9D:BF:30
>> Checking link statuses...
>> Port 0 Link Up - speed 4 Mbps - full-duplex Port 1 Link Up - speed 4
>> Mbps - full-duplex Done
>>
>> testpmd> set txpkts 2048
>> testpmd> show config txpkts
>> Number of segments: 1
>> Segment sizes: 2048
>> Split packet: off
>>
>>
>> testpmd> start tx_first
>>   io packet forwarding - CRC stripping disabled - packets/burst=32
>>   nb forwarding cores=1 - nb forwarding ports=2
>>   RX queues=1 - RX desc=128 - RX free threshold=32
>
> In testpmd, when RX queues=1, RSS will be disabled, so could you re-configure 
> rx queue(>1) and try again with testpmd?

I changed the way I run testmpd to:

testpmd -c 0x331 -w :82:00.0 -w :83:00.0 -- --mbuf-size 1152
--rss-ip --rxq=2 --txpkts 1024 -i

As far as I understand this will allocate mbufs with the same size I
was using in my test (--mbuf-size seems to include the mbuf headroom
therefore 1152 = 1024 + 128 headroom).

testpmd> start tx_first
  io packet forwarding - CRC stripping disabled - packets/burst=32
  nb forwarding cores=1 - nb forwarding ports=2
  RX queues=2 - RX desc=128 - RX free threshold=32
  RX threshold registers: pthresh=8 hthresh=8 wthresh=0
  TX queues=1 - TX desc=512 - TX free threshold=32
  TX threshold registers: pthresh=32 hthresh=0 wthresh=0
  TX RS bit threshold=32 - TXQ flags=0xf01
testpmd> show port stats all

   NIC statistics for port 0  
  RX-packets: 18817613   RX-missed: 5  RX-bytes:  19269115888
  RX-errors: 0
  RX-nombuf:  0
  TX-packets: 18818064   TX-errors: 0  TX-bytes:  19269567464
  

   NIC statistics for port 1  
  RX-packets: 18818392   RX-missed: 5  RX-bytes:  19269903360
  RX-errors: 0
  RX-nombuf:  0
  TX-packets: 18817979

[dpdk-dev] [dpdk-users] RSS Hash not working for XL710/X710 NICs for some RX mbuf sizes

2016-07-18 Thread Take Ceara
Hi Helin,

On Mon, Jul 18, 2016 at 5:15 PM, Zhang, Helin  wrote:
> Hi Ceara
>
> Could you help to let me know your firmware version?

# ethtool -i p7p1 | grep firmware
firmware-version: f4.40.35115 a1.4 n4.53 e2021

> And could you help to try with the standard DPDK example application, such as 
> testpmd, to see if there is the same issue?
> Basically we always set the same size for both rx and tx buffer, like the 
> default one of 2048 for a lot of applications.

I'm a bit lost in the testpmd CLI. I enabled RSS, configured 2 RX
queues per port and started sending traffic with single segmnet
packets of size 2K but I didn't figure out how to actually verify that
the RSS hash is correctly set.. Please let me know if I should do it
in a different way.

testpmd -c 0x331 -w :82:00.0 -w :83:00.0 -- --mbuf-size 2048 -i
[...]

testpmd> port stop all
Stopping ports...
Checking link statuses...
Port 0 Link Up - speed 4 Mbps - full-duplex
Port 1 Link Up - speed 4 Mbps - full-duplex
Done

testpmd> port config all txq 2

testpmd> port config all rss all

testpmd> port config all max-pkt-len 2048
testpmd> port start all
Configuring Port 0 (socket 0)
PMD: i40e_set_tx_function_flag(): Vector tx can be enabled on this txq.
PMD: i40e_set_tx_function_flag(): Vector tx can be enabled on this txq.
PMD: i40e_dev_rx_queue_setup(): Rx Burst Bulk Alloc Preconditions are
satisfied. Rx Burst Bulk Alloc function will be used on port=0,
queue=0.
PMD: i40e_set_tx_function(): Vector tx finally be used.
PMD: i40e_set_rx_function(): Using Vector Scattered Rx callback (port=0).
Port 0: 3C:FD:FE:9D:BE:F0
Configuring Port 1 (socket 0)
PMD: i40e_set_tx_function_flag(): Vector tx can be enabled on this txq.
PMD: i40e_set_tx_function_flag(): Vector tx can be enabled on this txq.
PMD: i40e_dev_rx_queue_setup(): Rx Burst Bulk Alloc Preconditions are
satisfied. Rx Burst Bulk Alloc function will be used on port=1,
queue=0.
PMD: i40e_set_tx_function(): Vector tx finally be used.
PMD: i40e_set_rx_function(): Using Vector Scattered Rx callback (port=1).
Port 1: 3C:FD:FE:9D:BF:30
Checking link statuses...
Port 0 Link Up - speed 4 Mbps - full-duplex
Port 1 Link Up - speed 4 Mbps - full-duplex
Done

testpmd> set txpkts 2048
testpmd> show config txpkts
Number of segments: 1
Segment sizes: 2048
Split packet: off


testpmd> start tx_first
  io packet forwarding - CRC stripping disabled - packets/burst=32
  nb forwarding cores=1 - nb forwarding ports=2
  RX queues=1 - RX desc=128 - RX free threshold=32
  RX threshold registers: pthresh=8 hthresh=8 wthresh=0
  TX queues=2 - TX desc=512 - TX free threshold=32
  TX threshold registers: pthresh=32 hthresh=0 wthresh=0
  TX RS bit threshold=32 - TXQ flags=0xf01
testpmd> stop
Telling cores to stop...
Waiting for lcores to finish...

  -- Forward statistics for port 0  --
  RX-packets: 32 RX-dropped: 0 RX-total: 32
  TX-packets: 32 TX-dropped: 0 TX-total: 32
  

  -- Forward statistics for port 1  --
  RX-packets: 32 RX-dropped: 0 RX-total: 32
  TX-packets: 32 TX-dropped: 0 TX-total: 32
  

  +++ Accumulated forward statistics for all ports+++
  RX-packets: 64 RX-dropped: 0 RX-total: 64
  TX-packets: 64 TX-dropped: 0 TX-total: 64
  

Done.
testpmd>


>
> Definitely we will try to reproduce that issue with testpmd, with using 2K 
> mbufs. Hopefully we can find the root cause, or tell you that's not an issue.
>

I forgot to mention that in my test code the TX/RX_MBUF_SIZE macros
also include the mbuf headroom and the size of the mbuf structure.
Therefore testing with 2K mbufs in my scenario actually creates
mempools of objects of size 2K + sizeof(struct rte_mbuf) +
RTE_PKTMBUF_HEADROOM.

> Thank you very much for your reporting!
>
> BTW, dev at dpdk.org should be the right one to replace users at dpdk.org, 
> for sending questions/issues like this.

Thanks, I'll keep that in mind.

>
> Regards,
> Helin

Regards,
Dumitru

>
>> -Original Message-
>> From: Take Ceara [mailto:dumitru.ceara at gmail.com]
>> Sent: Monday, July 18, 2016 4:03 PM
>> To: users at dpdk.org
>> Cc: Zhang, Helin ; Wu, Jingjing > intel.com>
>> Subject: [dpdk-users] RSS Hash not working for XL710/X710 NICs for some RX
>> mbuf sizes
>>
>> Hi,
>>
>> Is there any known issue regarding the i40e DPDK driver when having RSS
>> hashing enabled in DPDK 16.04?
>> I'

[dpdk-dev] Performance hit - NICs on different CPU sockets

2016-06-16 Thread Take Ceara
On Thu, Jun 16, 2016 at 10:19 PM, Wiles, Keith  wrote:
>
> On 6/16/16, 3:16 PM, "dev on behalf of Wiles, Keith"  on behalf of keith.wiles at intel.com> wrote:
>
>>
>>On 6/16/16, 3:00 PM, "Take Ceara"  wrote:
>>
>>>On Thu, Jun 16, 2016 at 9:33 PM, Wiles, Keith  
>>>wrote:
>>>> On 6/16/16, 1:20 PM, "Take Ceara"  wrote:
>>>>
>>>>>On Thu, Jun 16, 2016 at 6:59 PM, Wiles, Keith  
>>>>>wrote:
>>>>>>
>>>>>> On 6/16/16, 11:56 AM, "dev on behalf of Wiles, Keith" >>>>> dpdk.org on behalf of keith.wiles at intel.com> wrote:
>>>>>>
>>>>>>>
>>>>>>>On 6/16/16, 11:20 AM, "Take Ceara"  wrote:
>>>>>>>
>>>>>>>>On Thu, Jun 16, 2016 at 5:29 PM, Wiles, Keith >>>>>>>intel.com> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Right now I do not know what the issue is with the system. Could be 
>>>>>>>>> too many Rx/Tx ring pairs per port and limiting the memory in the 
>>>>>>>>> NICs, which is why you get better performance when you have 8 core 
>>>>>>>>> per port. I am not really seeing the whole picture and how DPDK is 
>>>>>>>>> configured to help more. Sorry.
>>>>>>>>
>>>>>>>>I doubt that there is a limitation wrt running 16 cores per port vs 8
>>>>>>>>cores per port as I've tried with two different machines connected
>>>>>>>>back to back each with one X710 port and 16 cores on each of them
>>>>>>>>running on that port. In that case our performance doubled as
>>>>>>>>expected.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Maybe seeing the DPDK command line would help.
>>>>>>>>
>>>>>>>>The command line I use with ports 01:00.3 and 81:00.3 is:
>>>>>>>>./warp17 -c 0xF3   -m 32768 -w :81:00.3 -w :01:00.3 --
>>>>>>>>--qmap 0.0x003FF003F0 --qmap 1.0x0FC00FFC00
>>>>>>>>
>>>>>>>>Our own qmap args allow the user to control exactly how cores are
>>>>>>>>split between ports. In this case we end up with:
>>>>>>>>
>>>>>>>>warp17> show port map
>>>>>>>>Port 0[socket: 0]:
>>>>>>>>   Core 4[socket:0] (Tx: 0, Rx: 0)
>>>>>>>>   Core 5[socket:0] (Tx: 1, Rx: 1)
>>>>>>>>   Core 6[socket:0] (Tx: 2, Rx: 2)
>>>>>>>>   Core 7[socket:0] (Tx: 3, Rx: 3)
>>>>>>>>   Core 8[socket:0] (Tx: 4, Rx: 4)
>>>>>>>>   Core 9[socket:0] (Tx: 5, Rx: 5)
>>>>>>>>   Core 20[socket:0] (Tx: 6, Rx: 6)
>>>>>>>>   Core 21[socket:0] (Tx: 7, Rx: 7)
>>>>>>>>   Core 22[socket:0] (Tx: 8, Rx: 8)
>>>>>>>>   Core 23[socket:0] (Tx: 9, Rx: 9)
>>>>>>>>   Core 24[socket:0] (Tx: 10, Rx: 10)
>>>>>>>>   Core 25[socket:0] (Tx: 11, Rx: 11)
>>>>>>>>   Core 26[socket:0] (Tx: 12, Rx: 12)
>>>>>>>>   Core 27[socket:0] (Tx: 13, Rx: 13)
>>>>>>>>   Core 28[socket:0] (Tx: 14, Rx: 14)
>>>>>>>>   Core 29[socket:0] (Tx: 15, Rx: 15)
>>>>>>>>
>>>>>>>>Port 1[socket: 1]:
>>>>>>>>   Core 10[socket:1] (Tx: 0, Rx: 0)
>>>>>>>>   Core 11[socket:1] (Tx: 1, Rx: 1)
>>>>>>>>   Core 12[socket:1] (Tx: 2, Rx: 2)
>>>>>>>>   Core 13[socket:1] (Tx: 3, Rx: 3)
>>>>>>>>   Core 14[socket:1] (Tx: 4, Rx: 4)
>>>>>>>>   Core 15[socket:1] (Tx: 5, Rx: 5)
>>>>>>>>   Core 16[socket:1] (Tx: 6, Rx: 6)
>>>>>>>>   Core 17[socket:1] (Tx: 7, Rx: 7)
>>>>>>>>   Core 18[socket:1] (Tx: 8, Rx: 8)
>>>>>>>>   Core 19[socket:1] (Tx: 9, Rx: 9)
>>>>>>>>   Core 30[socket:1] (Tx: 10, Rx: 10)
>>>>>>>>   Core 31[socket:1] (Tx: 11, Rx: 11)
>>>>>>>>   Core 32[socket:1] (Tx: 12, Rx: 12)
>>>>>>>>   Core 33[socket:1] (Tx: 13, Rx: 13

[dpdk-dev] Performance hit - NICs on different CPU sockets

2016-06-16 Thread Take Ceara
On Thu, Jun 16, 2016 at 9:33 PM, Wiles, Keith  wrote:
> On 6/16/16, 1:20 PM, "Take Ceara"  wrote:
>
>>On Thu, Jun 16, 2016 at 6:59 PM, Wiles, Keith  
>>wrote:
>>>
>>> On 6/16/16, 11:56 AM, "dev on behalf of Wiles, Keith" >> dpdk.org on behalf of keith.wiles at intel.com> wrote:
>>>
>>>>
>>>>On 6/16/16, 11:20 AM, "Take Ceara"  wrote:
>>>>
>>>>>On Thu, Jun 16, 2016 at 5:29 PM, Wiles, Keith  
>>>>>wrote:
>>>>>
>>>>>>
>>>>>> Right now I do not know what the issue is with the system. Could be too 
>>>>>> many Rx/Tx ring pairs per port and limiting the memory in the NICs, 
>>>>>> which is why you get better performance when you have 8 core per port. I 
>>>>>> am not really seeing the whole picture and how DPDK is configured to 
>>>>>> help more. Sorry.
>>>>>
>>>>>I doubt that there is a limitation wrt running 16 cores per port vs 8
>>>>>cores per port as I've tried with two different machines connected
>>>>>back to back each with one X710 port and 16 cores on each of them
>>>>>running on that port. In that case our performance doubled as
>>>>>expected.
>>>>>
>>>>>>
>>>>>> Maybe seeing the DPDK command line would help.
>>>>>
>>>>>The command line I use with ports 01:00.3 and 81:00.3 is:
>>>>>./warp17 -c 0xF3   -m 32768 -w :81:00.3 -w :01:00.3 --
>>>>>--qmap 0.0x003FF003F0 --qmap 1.0x0FC00FFC00
>>>>>
>>>>>Our own qmap args allow the user to control exactly how cores are
>>>>>split between ports. In this case we end up with:
>>>>>
>>>>>warp17> show port map
>>>>>Port 0[socket: 0]:
>>>>>   Core 4[socket:0] (Tx: 0, Rx: 0)
>>>>>   Core 5[socket:0] (Tx: 1, Rx: 1)
>>>>>   Core 6[socket:0] (Tx: 2, Rx: 2)
>>>>>   Core 7[socket:0] (Tx: 3, Rx: 3)
>>>>>   Core 8[socket:0] (Tx: 4, Rx: 4)
>>>>>   Core 9[socket:0] (Tx: 5, Rx: 5)
>>>>>   Core 20[socket:0] (Tx: 6, Rx: 6)
>>>>>   Core 21[socket:0] (Tx: 7, Rx: 7)
>>>>>   Core 22[socket:0] (Tx: 8, Rx: 8)
>>>>>   Core 23[socket:0] (Tx: 9, Rx: 9)
>>>>>   Core 24[socket:0] (Tx: 10, Rx: 10)
>>>>>   Core 25[socket:0] (Tx: 11, Rx: 11)
>>>>>   Core 26[socket:0] (Tx: 12, Rx: 12)
>>>>>   Core 27[socket:0] (Tx: 13, Rx: 13)
>>>>>   Core 28[socket:0] (Tx: 14, Rx: 14)
>>>>>   Core 29[socket:0] (Tx: 15, Rx: 15)
>>>>>
>>>>>Port 1[socket: 1]:
>>>>>   Core 10[socket:1] (Tx: 0, Rx: 0)
>>>>>   Core 11[socket:1] (Tx: 1, Rx: 1)
>>>>>   Core 12[socket:1] (Tx: 2, Rx: 2)
>>>>>   Core 13[socket:1] (Tx: 3, Rx: 3)
>>>>>   Core 14[socket:1] (Tx: 4, Rx: 4)
>>>>>   Core 15[socket:1] (Tx: 5, Rx: 5)
>>>>>   Core 16[socket:1] (Tx: 6, Rx: 6)
>>>>>   Core 17[socket:1] (Tx: 7, Rx: 7)
>>>>>   Core 18[socket:1] (Tx: 8, Rx: 8)
>>>>>   Core 19[socket:1] (Tx: 9, Rx: 9)
>>>>>   Core 30[socket:1] (Tx: 10, Rx: 10)
>>>>>   Core 31[socket:1] (Tx: 11, Rx: 11)
>>>>>   Core 32[socket:1] (Tx: 12, Rx: 12)
>>>>>   Core 33[socket:1] (Tx: 13, Rx: 13)
>>>>>   Core 34[socket:1] (Tx: 14, Rx: 14)
>>>>>   Core 35[socket:1] (Tx: 15, Rx: 15)
>>>>
>>>>On each socket you have 10 physical cores or 20 lcores per socket for 40 
>>>>lcores total.
>>>>
>>>>The above is listing the LCORES (or hyper-threads) and not COREs, which I 
>>>>understand some like to think they are interchangeable. The problem is the 
>>>>hyper-threads are logically interchangeable, but not performance wise. If 
>>>>you have two run-to-completion threads on a single physical core each on a 
>>>>different hyper-thread of that core [0,1], then the second lcore or thread 
>>>>(1) on that physical core will only get at most about 30-20% of the CPU 
>>>>cycles. Normally it is much less, unless you tune the code to make sure 
>>>>each thread is not trying to share the internal execution units, but some 
>>>>internal execution units are always shared.
>>>>
>>>>To get the best performance when hyper-threading is enabl

[dpdk-dev] Performance hit - NICs on different CPU sockets

2016-06-16 Thread Take Ceara
On Thu, Jun 16, 2016 at 6:59 PM, Wiles, Keith  wrote:
>
> On 6/16/16, 11:56 AM, "dev on behalf of Wiles, Keith"  dpdk.org on behalf of keith.wiles at intel.com> wrote:
>
>>
>>On 6/16/16, 11:20 AM, "Take Ceara"  wrote:
>>
>>>On Thu, Jun 16, 2016 at 5:29 PM, Wiles, Keith  
>>>wrote:
>>>
>>>>
>>>> Right now I do not know what the issue is with the system. Could be too 
>>>> many Rx/Tx ring pairs per port and limiting the memory in the NICs, which 
>>>> is why you get better performance when you have 8 core per port. I am not 
>>>> really seeing the whole picture and how DPDK is configured to help more. 
>>>> Sorry.
>>>
>>>I doubt that there is a limitation wrt running 16 cores per port vs 8
>>>cores per port as I've tried with two different machines connected
>>>back to back each with one X710 port and 16 cores on each of them
>>>running on that port. In that case our performance doubled as
>>>expected.
>>>
>>>>
>>>> Maybe seeing the DPDK command line would help.
>>>
>>>The command line I use with ports 01:00.3 and 81:00.3 is:
>>>./warp17 -c 0xF3   -m 32768 -w :81:00.3 -w :01:00.3 --
>>>--qmap 0.0x003FF003F0 --qmap 1.0x0FC00FFC00
>>>
>>>Our own qmap args allow the user to control exactly how cores are
>>>split between ports. In this case we end up with:
>>>
>>>warp17> show port map
>>>Port 0[socket: 0]:
>>>   Core 4[socket:0] (Tx: 0, Rx: 0)
>>>   Core 5[socket:0] (Tx: 1, Rx: 1)
>>>   Core 6[socket:0] (Tx: 2, Rx: 2)
>>>   Core 7[socket:0] (Tx: 3, Rx: 3)
>>>   Core 8[socket:0] (Tx: 4, Rx: 4)
>>>   Core 9[socket:0] (Tx: 5, Rx: 5)
>>>   Core 20[socket:0] (Tx: 6, Rx: 6)
>>>   Core 21[socket:0] (Tx: 7, Rx: 7)
>>>   Core 22[socket:0] (Tx: 8, Rx: 8)
>>>   Core 23[socket:0] (Tx: 9, Rx: 9)
>>>   Core 24[socket:0] (Tx: 10, Rx: 10)
>>>   Core 25[socket:0] (Tx: 11, Rx: 11)
>>>   Core 26[socket:0] (Tx: 12, Rx: 12)
>>>   Core 27[socket:0] (Tx: 13, Rx: 13)
>>>   Core 28[socket:0] (Tx: 14, Rx: 14)
>>>   Core 29[socket:0] (Tx: 15, Rx: 15)
>>>
>>>Port 1[socket: 1]:
>>>   Core 10[socket:1] (Tx: 0, Rx: 0)
>>>   Core 11[socket:1] (Tx: 1, Rx: 1)
>>>   Core 12[socket:1] (Tx: 2, Rx: 2)
>>>   Core 13[socket:1] (Tx: 3, Rx: 3)
>>>   Core 14[socket:1] (Tx: 4, Rx: 4)
>>>   Core 15[socket:1] (Tx: 5, Rx: 5)
>>>   Core 16[socket:1] (Tx: 6, Rx: 6)
>>>   Core 17[socket:1] (Tx: 7, Rx: 7)
>>>   Core 18[socket:1] (Tx: 8, Rx: 8)
>>>   Core 19[socket:1] (Tx: 9, Rx: 9)
>>>   Core 30[socket:1] (Tx: 10, Rx: 10)
>>>   Core 31[socket:1] (Tx: 11, Rx: 11)
>>>   Core 32[socket:1] (Tx: 12, Rx: 12)
>>>   Core 33[socket:1] (Tx: 13, Rx: 13)
>>>   Core 34[socket:1] (Tx: 14, Rx: 14)
>>>   Core 35[socket:1] (Tx: 15, Rx: 15)
>>
>>On each socket you have 10 physical cores or 20 lcores per socket for 40 
>>lcores total.
>>
>>The above is listing the LCORES (or hyper-threads) and not COREs, which I 
>>understand some like to think they are interchangeable. The problem is the 
>>hyper-threads are logically interchangeable, but not performance wise. If you 
>>have two run-to-completion threads on a single physical core each on a 
>>different hyper-thread of that core [0,1], then the second lcore or thread 
>>(1) on that physical core will only get at most about 30-20% of the CPU 
>>cycles. Normally it is much less, unless you tune the code to make sure each 
>>thread is not trying to share the internal execution units, but some internal 
>>execution units are always shared.
>>
>>To get the best performance when hyper-threading is enable is to not run both 
>>threads on a single physical core, but only run one hyper-thread-0.
>>
>>In the table below the table lists the physical core id and each of the lcore 
>>ids per socket. Use the first lcore per socket for the best performance:
>>Core 1 [1, 21][11, 31]
>>Use lcore 1 or 11 depending on the socket you are on.
>>
>>The info below is most likely the best performance and utilization of your 
>>system. If I got the values right ?
>>
>>./warp17 -c 0x0FFFe0   -m 32768 -w :81:00.3 -w :01:00.3 --
>>--qmap 0.0x0003FE --qmap 1.0x0FFE00
>>
>>Port 0[socket: 0]:
>>   Core 2[socket:0] (Tx: 0, Rx: 0)
>>   Core 3[socket:0] (Tx: 1, Rx: 1)
>>   Core 4[socket:0] (Tx: 2, Rx: 2)
&g

[dpdk-dev] Performance hit - NICs on different CPU sockets

2016-06-16 Thread Take Ceara
On Thu, Jun 16, 2016 at 5:29 PM, Wiles, Keith  wrote:

>
> Right now I do not know what the issue is with the system. Could be too many 
> Rx/Tx ring pairs per port and limiting the memory in the NICs, which is why 
> you get better performance when you have 8 core per port. I am not really 
> seeing the whole picture and how DPDK is configured to help more. Sorry.

I doubt that there is a limitation wrt running 16 cores per port vs 8
cores per port as I've tried with two different machines connected
back to back each with one X710 port and 16 cores on each of them
running on that port. In that case our performance doubled as
expected.

>
> Maybe seeing the DPDK command line would help.

The command line I use with ports 01:00.3 and 81:00.3 is:
./warp17 -c 0xF3   -m 32768 -w :81:00.3 -w :01:00.3 --
--qmap 0.0x003FF003F0 --qmap 1.0x0FC00FFC00

Our own qmap args allow the user to control exactly how cores are
split between ports. In this case we end up with:

warp17> show port map
Port 0[socket: 0]:
   Core 4[socket:0] (Tx: 0, Rx: 0)
   Core 5[socket:0] (Tx: 1, Rx: 1)
   Core 6[socket:0] (Tx: 2, Rx: 2)
   Core 7[socket:0] (Tx: 3, Rx: 3)
   Core 8[socket:0] (Tx: 4, Rx: 4)
   Core 9[socket:0] (Tx: 5, Rx: 5)
   Core 20[socket:0] (Tx: 6, Rx: 6)
   Core 21[socket:0] (Tx: 7, Rx: 7)
   Core 22[socket:0] (Tx: 8, Rx: 8)
   Core 23[socket:0] (Tx: 9, Rx: 9)
   Core 24[socket:0] (Tx: 10, Rx: 10)
   Core 25[socket:0] (Tx: 11, Rx: 11)
   Core 26[socket:0] (Tx: 12, Rx: 12)
   Core 27[socket:0] (Tx: 13, Rx: 13)
   Core 28[socket:0] (Tx: 14, Rx: 14)
   Core 29[socket:0] (Tx: 15, Rx: 15)

Port 1[socket: 1]:
   Core 10[socket:1] (Tx: 0, Rx: 0)
   Core 11[socket:1] (Tx: 1, Rx: 1)
   Core 12[socket:1] (Tx: 2, Rx: 2)
   Core 13[socket:1] (Tx: 3, Rx: 3)
   Core 14[socket:1] (Tx: 4, Rx: 4)
   Core 15[socket:1] (Tx: 5, Rx: 5)
   Core 16[socket:1] (Tx: 6, Rx: 6)
   Core 17[socket:1] (Tx: 7, Rx: 7)
   Core 18[socket:1] (Tx: 8, Rx: 8)
   Core 19[socket:1] (Tx: 9, Rx: 9)
   Core 30[socket:1] (Tx: 10, Rx: 10)
   Core 31[socket:1] (Tx: 11, Rx: 11)
   Core 32[socket:1] (Tx: 12, Rx: 12)
   Core 33[socket:1] (Tx: 13, Rx: 13)
   Core 34[socket:1] (Tx: 14, Rx: 14)
   Core 35[socket:1] (Tx: 15, Rx: 15)

Just for reference, the cpu_layout script shows:
$ $RTE_SDK/tools/cpu_layout.py

Core and Socket Information (as reported by '/proc/cpuinfo')


cores =  [0, 1, 2, 3, 4, 8, 9, 10, 11, 12]
sockets =  [0, 1]

Socket 0Socket 1

Core 0  [0, 20] [10, 30]
Core 1  [1, 21] [11, 31]
Core 2  [2, 22] [12, 32]
Core 3  [3, 23] [13, 33]
Core 4  [4, 24] [14, 34]
Core 8  [5, 25] [15, 35]
Core 9  [6, 26] [16, 36]
Core 10 [7, 27] [17, 37]
Core 11 [8, 28] [18, 38]
Core 12 [9, 29] [19, 39]

I know it might be complicated to gigure out exactly what's happening
in our setup with our own code so please let me know if you need
additional information.

I appreciate the help!

Thanks,
Dumitru


[dpdk-dev] Performance hit - NICs on different CPU sockets

2016-06-16 Thread Take Ceara
On Thu, Jun 16, 2016 at 4:58 PM, Wiles, Keith  wrote:
>
> From the output below it appears the x710 devices 01:00.[0-3] are on socket 0
> And the x710 devices 02:00.[0-3] sit on socket 1.
>

I assume there's a mistake here. The x710 devices on socket 0 are:
$ lspci | grep -ie "01:.*x710"
01:00.0 Ethernet controller: Intel Corporation Ethernet Controller
X710 for 10GbE SFP+ (rev 01)
01:00.1 Ethernet controller: Intel Corporation Ethernet Controller
X710 for 10GbE SFP+ (rev 01)
01:00.2 Ethernet controller: Intel Corporation Ethernet Controller
X710 for 10GbE SFP+ (rev 01)
01:00.3 Ethernet controller: Intel Corporation Ethernet Controller
X710 for 10GbE SFP+ (rev 01)

and the X710 devices on socket 1 are:
$ lspci | grep -ie "81:.*x710"
81:00.0 Ethernet controller: Intel Corporation Ethernet Controller
X710 for 10GbE SFP+ (rev 01)
81:00.1 Ethernet controller: Intel Corporation Ethernet Controller
X710 for 10GbE SFP+ (rev 01)
81:00.2 Ethernet controller: Intel Corporation Ethernet Controller
X710 for 10GbE SFP+ (rev 01)
81:00.3 Ethernet controller: Intel Corporation Ethernet Controller
X710 for 10GbE SFP+ (rev 01)

> This means the ports on 01.00.xx should be handled by socket 0 CPUs and 
> 02:00.xx should be handled by Socket 1. I can not tell if that is the case 
> for you here. The CPUs or lcores from the cpu_layout.py should help 
> understand the layout.
>

That was the first scenario I tried:
- assign 16 CPUs from socket 0 to port 0 (01:00.3)
- assign 16 CPUs from socket 1 to port 1 (81:00.3)

Our performance measurements show then a setup rate of 1.6M sess/s
which is less then half of what I get when i install both X710 on
socket 1 and use only 16 CPUs from socket 1 for both ports.

I double checked the cpu layout. We also have our own CLI and warnings
when using cores that are not on the same socket as the port they're
assigned too so the mapping should be fine.

Thanks,
Dumitru


[dpdk-dev] Performance hit - NICs on different CPU sockets

2016-06-16 Thread Take Ceara
Hi Keith,

On Tue, Jun 14, 2016 at 3:47 PM, Wiles, Keith  wrote:
>>> Normally the limitation is in the hardware, basically how the PCI bus is 
>>> connected to the CPUs (or sockets). How the PCI buses are connected to the 
>>> system depends on the Mother board design. I normally see the buses 
>>> attached to socket 0, but you could have some of the buses attached to the 
>>> other sockets or all on one socket via a PCI bridge device.
>>>
>>> No easy way around the problem if some of your PCI buses are split or all 
>>> on a single socket. Need to look at your system docs or look at lspci it 
>>> has an option to dump the PCI bus as an ASCII tree, at least on Ubuntu.
>>
>>This is the motherboard we use on our system:
>>
>>http://www.supermicro.com/products/motherboard/Xeon/C600/X10DRX.cfm
>>
>>I need to swap some NICs around (as now we moved everything on socket
>>1) before I can share the lspci output.
>
> FYI: the option for lspci is ?lspci ?tv?, but maybe more options too.
>

I retested with two 10G X710 ports connected back to back:
port 0: :01:00.3 - socket 0
port 1: :81:00.3 - socket 1

I ran the following scenarios:
- assign 16 threads from CPU 0 on socket 0 to port 0 and 16 threads
from CPU 1 to port 1 => setup rate of 1.6M sess/s
- assign only the 16 threads from CPU0 for both ports (so 8 threads on
socket 0 for port 0 and 8 threads on socket 0 for port 1) => setup
rate of 3M sess/s
- assign only the 16 threads from CPU1 for both ports (so 8 threads on
socket 1 for port 0 and 8 threads on socket 1 for port 1) => setup
rate of 3M sess/s

I also tried a scenario with two machines connected back to back each
of which had a NIC on socket 1. I assigned 16 threads from socket 1 on
each machine to the port and performance scaled to 6M sess/s as
expected.

I double checked all our memory allocations and, at least in the
tested scenario, we never use memory that's not on the same socket as
the core.

I pasted below the output of lspci -tv. I see that :01:00.3 and
:81:00.3 are connected to different PCI bridges but on each of
those bridges there are also "Intel Corporation Xeon E7 v3/Xeon E5
v3/Core i7 DMA Channel " devices.

It would be great if you could also take a look in case I
missed/misunderstood something.

Thanks,
Dumitru

# lspci -tv
-+-[:ff]-+-08.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 0
 |   +-08.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 0
 |   +-08.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 0
 |   +-09.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 1
 |   +-09.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 1
 |   +-09.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 1
 |   +-0b.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
R3 QPI Link 0 & 1 Monitoring
 |   +-0b.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
R3 QPI Link 0 & 1 Monitoring
 |   +-0b.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
R3 QPI Link 0 & 1 Monitoring
 |   +-0c.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Unicast Registers
 |   +-0c.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Unicast Registers
 |   +-0c.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Unicast Registers
 |   +-0c.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Unicast Registers
 |   +-0c.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Unicast Registers
 |   +-0c.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Unicast Registers
 |   +-0c.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Unicast Registers
 |   +-0c.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Unicast Registers
 |   +-0d.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Unicast Registers
 |   +-0d.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Unicast Registers
 |   +-0f.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Buffered Ring Agent
 |   +-0f.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Buffered Ring Agent
 |   +-0f.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Buffered Ring Agent
 |   +-0f.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Buffered Ring Agent
 |   +-0f.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
System Address Decoder & Broadcast Registers
 |   +-0f.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
System Address Decoder & Broadcast Registers
 |   +-0f.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
System Address Decoder & Broadcast Registers
 |   +-10.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
PCIe Ring Interface
 |   +-10.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
PCIe Ring Interface
 |   +-10.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Scratchpad & Semaphore Registers
 |   +-10.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Scratchp

[dpdk-dev] Performance hit - NICs on different CPU sockets

2016-06-14 Thread Take Ceara
Hi Bruce,

On Mon, Jun 13, 2016 at 4:28 PM, Bruce Richardson
 wrote:
> On Mon, Jun 13, 2016 at 04:07:37PM +0200, Take Ceara wrote:
>> Hi,
>>
>> I'm reposting here as I didn't get any answers on the dpdk-users mailing 
>> list.
>>
>> We're working on a stateful traffic generator (www.warp17.net) using
>> DPDK and we would like to control two XL710 NICs (one on each socket)
>> to maximize CPU usage. It looks that we run into the following
>> limitation:
>>
>> http://dpdk.org/doc/guides/linux_gsg/nic_perf_intel_platform.html
>> section 7.2, point 3
>>
>> We completely split memory/cpu/NICs across the two sockets. However,
>> the performance with a single CPU and both NICs on the same socket is
>> better.
>> Why do all the NICs have to be on the same socket, is there a
>> driver/hw limitation?
>>
> Hi,
>
> so long as each thread only ever accesses the NIC on it's own local socket, 
> then
> there is no performance penalty. It's only when a thread on one socket works
> using a NIC on a remote socket that you start seeing a penalty, with all
> NIC-core communication having to go across QPI.
>
> /Bruce

Thanks for the confirmation. We'll go through our code again to double
check that no thread accesses the NIC or memory on a remote socket.

Regards,
Dumitru


[dpdk-dev] Performance hit - NICs on different CPU sockets

2016-06-14 Thread Take Ceara
Hi Keith,

On Mon, Jun 13, 2016 at 9:35 PM, Wiles, Keith  wrote:
>
> On 6/13/16, 9:07 AM, "dev on behalf of Take Ceara"  on behalf of dumitru.ceara at gmail.com> wrote:
>
>>Hi,
>>
>>I'm reposting here as I didn't get any answers on the dpdk-users mailing list.
>>
>>We're working on a stateful traffic generator (www.warp17.net) using
>>DPDK and we would like to control two XL710 NICs (one on each socket)
>>to maximize CPU usage. It looks that we run into the following
>>limitation:
>>
>>http://dpdk.org/doc/guides/linux_gsg/nic_perf_intel_platform.html
>>section 7.2, point 3
>>
>>We completely split memory/cpu/NICs across the two sockets. However,
>>the performance with a single CPU and both NICs on the same socket is
>>better.
>>Why do all the NICs have to be on the same socket, is there a
>>driver/hw limitation?
>
> Normally the limitation is in the hardware, basically how the PCI bus is 
> connected to the CPUs (or sockets). How the PCI buses are connected to the 
> system depends on the Mother board design. I normally see the buses attached 
> to socket 0, but you could have some of the buses attached to the other 
> sockets or all on one socket via a PCI bridge device.
>
> No easy way around the problem if some of your PCI buses are split or all on 
> a single socket. Need to look at your system docs or look at lspci it has an 
> option to dump the PCI bus as an ASCII tree, at least on Ubuntu.

This is the motherboard we use on our system:

http://www.supermicro.com/products/motherboard/Xeon/C600/X10DRX.cfm

I need to swap some NICs around (as now we moved everything on socket
1) before I can share the lspci output.

Thanks,
Dumitru


[dpdk-dev] Performance hit - NICs on different CPU sockets

2016-06-13 Thread Take Ceara
Hi,

I'm reposting here as I didn't get any answers on the dpdk-users mailing list.

We're working on a stateful traffic generator (www.warp17.net) using
DPDK and we would like to control two XL710 NICs (one on each socket)
to maximize CPU usage. It looks that we run into the following
limitation:

http://dpdk.org/doc/guides/linux_gsg/nic_perf_intel_platform.html
section 7.2, point 3

We completely split memory/cpu/NICs across the two sockets. However,
the performance with a single CPU and both NICs on the same socket is
better.
Why do all the NICs have to be on the same socket, is there a
driver/hw limitation?

Thanks,
Dumitru Ceara