Re: [E1000-devel] getting max throughput from AMD Opteron w/igb-2.2.9

Alexander Duyck Tue, 27 Jul 2010 09:43:32 -0700

Ed Ravin wrote:
> On Mon, Jul 26, 2010 at 08:52:20AM -0700, Alexander Duyck wrote:
>> My recommendations are to do the following.:
>> 1.  Set the RX and TX ring sizes to 256.  This makes it so that all the
>> descriptors for each ring fit within a 4K single page.
> 
> Tried that. rx_fifo_errors went up to over 6000 per second.


The fact that the ring size seems to effect the number of packets 
dropped per second implies that there may be some sort of latency issue. 
  One thing you might try is different values for rx-usecs via ethtool 
-C.  You may find that fixing the value at something fairly low like 33 
usecs per interrupt may help to reduce the number of rx_fifo_errors.

>> 2.  You may want to just stack all of the same queues on the same CPU,
>> so rx/tx 0 for both ports on CPU 0, rx/tx1 on CPU 1, etc.  This way you
>> can keep the memory local and reduce cross cpu and cross node allocation
>> and free.
> 
> This is the default layout after loading the 2.2.9 driver:
> 
> # eth_affinity_tool show --verbose
> 16 CPUs detected
> 
> eth0:
>         25: eth0: ffff
>         26: eth0-rx-0: 0001
>         27: eth0-rx-1: 0002
>         28: eth0-rx-2: 0004
>         29: eth0-rx-3: 0008
>         30: eth0-tx-0: 0001
>         31: eth0-tx-1: 0002
>         32: eth0-tx-2: 0004
>         33: eth0-tx-3: 0008
> 
> eth1:
>         34: eth1: ffff
>         35: eth1-rx-0: 0001
>         36: eth1-rx-1: 0002
>         37: eth1-rx-2: 0004
>         38: eth1-rx-3: 0008
>         39: eth1-tx-0: 0001
>         40: eth1-tx-1: 0002
>         41: eth1-tx-2: 0004
>         42: eth1-tx-3: 0008
> 

This looks okay, but you may want to try running with QueuePairs on so 
that you have 8 RX/TX queues and spread the work over more cores.

>> 3.  You could probably also set the RSS value to 0 and see how many
>> queues this gives you.
> 
> With the stock 2.6.34.1 igb driver, version 2.1.0-k2, which does not seem
> to be tuneable, I get 8 paired RX/TX queues.   With 2.2.9, I can get
> either 4 RX and 4 TX queues per interface, or with RSS=0 and not setting
> QueuePairs to 0, I get 8 RX/TX queues per interface.

Since you are running a pair of 8 core processors it might be best to 
just set things up for 8 queues and spread them over a single physical 
processor ID.  I realize that didn't give you the best performance but I 
suspect the reason for that is that the adapter is actually closer to 
one of the CPUs than the other.

>> Depending on what hardware you have there may be
>> more queues available and if the CPUS contain a stack of queues as I
>> suggested in item 2 then spreading it out over more CPUs would be
>> advisable.
> 
> Don't see much difference when doing that.  The worst performance was
> when everything was on CPUs with the same physical ID.  The best performance
> appears to be the 2.1.0-k2 driver in its unchangeable configuration, which
> is much different than when on an Intel Xeon non-NUMA platform - on that
> one the 2.2.9 has the best performance.

After looking over your lspci dump I am assuming you are running a 
Supermicro motherboard with the AMD SR5690/SP5100 chipset.  If that is 
the case you will probably find that one physical ID works much better 
than the other for network performance because the SR5690 that the 82576 
is connected to is going to be node local for one of the sockets and 
remote for the other.

Another factor you will need to take into account is that the ring 
memory should be allocated on the same node the hardware is on.  You 
should be able to accomplish that by using taskset with the correct CPU 
mask for the physical ID you are using when calling modprobe/insmod and 
the ifconfig commands to bring up the interfaces.  This should help to 
decrease the memory latency and increase the throughput available to the 
adapter.

>> 4.  One other thing that might be useful would be to put a static entry
>> into your ARP table for the destination IPs you are routing too.  I have
>> seen instances where this can cause packets to be dropped due to a delay
>>  in obtaining the MAC address via ARP.
> 
> Already had static ARP for the destination IPs.
> 
> I fetched numactl and numastat from the Debian repository.
> Interestingly, "numastat" says there aren't any misses:
> 
> r...@big-tester:~# numastat
>                            node0           node3
> numa_hit                 3873922         1052324
> numa_miss                      0               0
> numa_foreign                   0               0
> interleave_hit              5601            5331
> local_node               3733844         1050500
> other_node                140078            1824
> 
> but "numactl" seems to have some issues so I don't know if I can trust
> any of these numbers:
> 
> r...@big-tester:~# numactl -H
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3 4 5 6 7
> node 0 size: 2047 MB
> node 0 free: 1740 MB
> libnuma: Warning: /sys not mounted or invalid. Assuming one node: No such 
> file or directory
> node 1 cpus:
> node 1 size: <not available>
> node 1 free: <not available>
> node 2 cpus:
> node 2 size: <not available>
> node 2 free: <not available>
> node 3 cpus: 8 9 10 11 12 13 14 15
> node 3 size: 2046 MB
> node 3 free: 1934 MB
> node distances:
> node   0   1   2   3
>   0:  10  20   0   0
>   1:   0   0   0  134529
>   2:   0  134529 -143519016 -143519016
>   3: -143519016 -143519016   0   0
> 
>> That is what I can think up off of the top of my head.  Other than that
>> if you could provide more information on the system it would be useful.
>> Perhaps an lspci -vvv, a dump of /proc/cpuinfo, and /proc/zoneinfo. With
>> that I can give more detailed steps on the layout that might provide the
>> best performance.
> 
> 
> Requested dumps below - cpuinfo, zoneinfo, lspci.  Thank you!

Thanks for the information.  One other item I would be interested in 
seeing is the kind of numbers we are talking about.  If you could 
provide me with an ethtool -S dump from 10 seconds of one of your tests 
that might be useful for me to better understand the kind of pressures 
the system is under.

Thanks,

Alex

------------------------------------------------------------------------------
The Palm PDK Hot Apps Program offers developers who use the
Plug-In Development Kit to bring their C/C++ apps to Palm for a share 
of $1 Million in cash or HP Products. Visit us here for more details:
http://ad.doubleclick.net/clk;226879339;13503038;l?
http://clk.atdmt.com/CRS/go/247765532/direct/01/
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Re: [E1000-devel] getting max throughput from AMD Opteron w/igb-2.2.9

Reply via email to