On 03/22/2015 04:03 AM, Wolfgang Rosner wrote:
> Am Sonntag, 22. März 2015 07:38:43 schrieben Sie:
>> On 03/21/2015 12:38 PM, Wolfgang Rosner wrote:
>>> for the first case (one thread)
>>>   PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
>>> 24561 root      20   0 96700 1492 1340 S  22.2  0.0   0:00.80 iperf
>>>
>>> for the second case (two threads)
>>>   PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
>>> 25086 root      20   0  166m 1516 1368 S  41.2  0.0   0:01.31 iperf
>>>
>>> So it's not a CPU bottleneck.
>> You might want to also check the per-CPU statistics when running your
>> test. 
> Sorry, there is always this tradeoff between information and clobber.
> Here is the whole story.
> I hope you have set your mailer to a monospace font (i'm just looking on 
> mine...)
>
>
> top - 18:46:27 up 1 day, 22:25,  8 users,  load average: 0.13, 0.11, 0.14
> Tasks: 266 total,   1 running, 265 sleeping,   0 stopped,   0 zombie
> %Cpu0  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu1  :  0.0 us,  2.7 sy,  0.0 ni, 97.0 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 
> st
> %Cpu2  :  0.0 us,  6.0 sy,  0.0 ni, 93.2 id,  0.0 wa,  0.0 hi,  0.8 si,  0.0 
> st
> %Cpu3  :  0.0 us,  3.8 sy,  0.0 ni, 96.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu4  :  0.0 us,  0.7 sy,  0.0 ni, 99.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu5  :  0.0 us,  3.0 sy,  0.0 ni, 96.2 id,  0.8 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu6  :  0.0 us,  4.3 sy,  0.0 ni, 95.3 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 
> st
> %Cpu7  :  0.4 us,  4.3 sy,  0.0 ni, 95.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> KiB Mem:  16335308 total, 16105428 used,   229880 free,   184344 buffers
> KiB Swap: 31249404 total,    69976 used, 31179428 free, 12255816 cached
>
>   PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
> 24561 root      20   0 96700 1492 1340 S  22.2  0.0   0:00.80 iperf
>     7 root      20   0     0    0    0 S   0.3  0.0   1:27.50 rcu_sched
>    18 root      20   0     0    0    0 S   0.3  0.0   0:07.88 ksoftirqd/2
>    23 root      20   0     0    0    0 S   0.3  0.0   0:04.94 ksoftirqd/3
>    33 root      20   0     0    0    0 S   0.3  0.0   0:04.87 ksoftirqd/5
>
>
>
> top - 18:47:24 up 1 day, 22:26,  8 users,  load average: 0.25, 0.15, 0.15
> Tasks: 269 total,   2 running, 267 sleeping,   0 stopped,   0 zombie
> %Cpu0  :  0.3 us,  0.7 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu1  :  0.4 us,  7.4 sy,  0.0 ni, 90.9 id,  0.0 wa,  0.0 hi,  1.2 si,  0.0 
> st
> %Cpu2  :  0.0 us,  4.5 sy,  0.0 ni, 92.6 id,  0.0 wa,  0.0 hi,  2.9 si,  0.0 
> st
> %Cpu3  :  0.0 us,  7.9 sy,  0.0 ni, 90.9 id,  0.0 wa,  0.0 hi,  1.2 si,  0.0 
> st
> %Cpu4  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu5  :  0.4 us,  7.0 sy,  0.0 ni, 91.4 id,  0.0 wa,  0.0 hi,  1.2 si,  0.0 
> st
> %Cpu6  :  0.0 us,  6.3 sy,  0.0 ni, 92.4 id,  0.0 wa,  0.0 hi,  1.3 si,  0.0 
> st
> %Cpu7  :  0.0 us, 10.0 sy,  0.0 ni, 88.8 id,  0.0 wa,  0.0 hi,  1.2 si,  0.0 
> st
> KiB Mem:  16335308 total, 16115900 used,   219408 free,   184376 buffers
> KiB Swap: 31249404 total,    69976 used, 31179428 free, 12256524 cached
>
>   PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
> 25086 root      20   0  166m 1516 1368 S  41.2  0.0   0:01.31 iperf
>    43 root      20   0     0    0    0 S   0.7  0.0   0:04.54 ksoftirqd/7
>    13 root      20   0     0    0    0 S   0.3  0.0   0:04.36 ksoftirqd/1
>    23 root      20   0     0    0    0 S   0.3  0.0   0:04.96 ksoftirqd/3
>  2949 nobody    20   0 12700 2020 1672 S   0.3  0.0   3:36.12 tcpspy
> 14830 root      20   0     0    0    0 S   0.3  0.0   0:01.21 kworker/2:1
>
>
>
>> I normally use a tool such as 'sar' to collect the per-CPU  utilization
> Sure. 
> I actually turned of a collectd again, because I did'nt manage to get it to 
> resolve time finer than 60 s.
> But even if I could, this puts quite some background load on anything.
> Nevertheless, such kind of a collect demon for the whole cluster (not just 
> the single box) is still on the to-do list.
> Until that, I appreciate the flexibility of top snapshotting.
>
>
>> as it could be that you have one CPU core maxed out on 
>> either the receiver or the transmitter and that may only show up as a
>> fractional CPU since other cores would show up as being unused.
> As far as I can tell, in (at least my version of) top the per-process CPU 
> figures are scaled per CPU-core. 
> I remeber qite sometimes having far > 100 % figures there for multithreaded 
> processes.
> I've cross checked this on shots similiar to the above in more unambigouous 
> situations.
>
> But lets scrutinize abobe data:
>
> adding up all sy-Figures in the single thread case
> 2.7+6.0+3.8+0.7+3.0 +4.3+4.3 = 24.8 %
> 22.% is shown for iperf, 
> + some fractional % of nearly idling other stuff
> For userland, there is only 0.4 % in Cpu 7 in the single case thread.
>
> Compare it to two-Thread case:
> 0.7+7.4+4.5+7.9+7.0+6.3+10.0 = 43.8 % in sy
> 41,2 % is for iperf
> 0.3 + 0.4 +0.4 = 1.1 % in us
>
> Can I conclude from the sy/us relation, that most of the time, CPUs are busy 
> in kernel work?
> And even in a distributed way. 
> As load balancing is supposed to work like.
> So I'd say, linux does a real good job here, doesn't it?

Yeah, it looks like we can rule out a maxed out CPU on this end at
least.  You should probably check the receiving end as well.  With
either end TCP can slow the flow down.


>> One other possibility would be a socket buffer bottleneck.  By opening a
>> second thread you open a second socket if I am not mistaken, and that
>> could be what is allowing for the increased throughput as more packets
>> are allowed to be in flight.
> Agree. That's basically what I suspect.
> But should'nt this go away when I increase all those memory limits in 
> /proc/sys/net
> and the TCP Window, too?
>
> And why do I encounter this only when sending on the faster machine, not on 
> the slower ones?
> If it were a common default setting, I would expect the slower machines 
> running into same bottlenecks at earlier rates,
> wouldn't I ?
>
> Or ist due to a combination of fast sender - slow receiver?

If the receiver is slower it is possible that the receiving end requires
more CPU so switching ends would give you better performance if that is
the case.  I would recommend collecting stats on both sides as the
receiver may be the bottleneck.

>
>>> If it were that it takes some µsec to look up some soft IRQ pointer that
>>> share the same hard IRQ, this may hurt performance??
>> I assume the adapters are likely using message signaled interrupts so
>> they shouldn't be limited by the hard wired IRQs assigned to the slots.
>>
> So we can safely consider the IRQ assignement table in the Asus manual pg 37 
> as irrelevant?

Yes, MSI/MSI-X can ignore the hard wired IRQ per slot limits.

>>>      [IDT] PES12N3A PCI Express Switch
>> I believe there are IDT switches as a part of your quad port adapters.
>> If I am not mistaken there is a pair of dual port chips that are
>> connected to the IDT PCIe switch.  So what you should see is 3
>> functions, one on the upstream side, and two on the downstream side for
>> each adapter.
> Ah, this brings light into the picture.
> So may I ask you for some patience regarding the PCI bus structure issue, 
> until I grasp that, please?

No problem.

>
>> This IDT switch is a part of the device listed in d:00.[01] and
>> e:00.[01] below.  The status of 0b:00.0 tells us what the link speed and
>> width is for the connection of the adapter to the motherboard.  There
>> should also be another switch taking up 06:00.0 and 07:02.0 and 07:04.0
>> for the other adapter.  There the status of 06:00.0 should tell you
> Im still sitting here, comparing two windows with outpt of 
>
> lspic -tv
> lspci -vv | grep -P "[0-9a-f]{2}:[0-9a-f]{2}\.[0-9a-f]|LnkSta:"
>
> and try to understand what your writing has tought me so far.
>
> So the whole branch in lspci -t to the left of the NICs is located on the NIC 
> card, not on the main board.
> So when I've read' LnkSta x4 for some Intel 82571EB device, this relates to 
> the connections
> PES12N3A PCI Express Switch <-> Intel 82571EB 
> which is internally to the NIC board.
>
> But hang on, what do the numbers in brackets of lspci tree view refer to?
> First I thought this would map devices on a subtree to slots as viewed in the 
> main bus.
> But the longer i look to the listings, the more this contradicts.
>
> just start here:
>
> -[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] RD890 PCI to PCI 
> bridge (external gfx0 port B)
>            +-02.0-[01]--+-00.0  NVIDIA Corporation GF100GL [Tesla M2070]
>            |            \-00.1  NVIDIA Corporation GF100 High Definition 
> Audio Controller
>            +-04.0-[02]----00.0  ASMedia Technology Inc. ASM1062 Serial ATA 
> Controller
>
> I don't hinkt that 
>        +-04.0-[02]----00.0 ..  Serial ATA 
> is on the pathway to
>        +-02.0-[01]--+-00.0 ... [Tesla M2070]
>
> as the matching 02 fooled me.

Starting at root complex [0000:00] represents the domain and bus number,
the root complex usually starts out as 0 for both.

The numbers after the +- represent the device and function number.  So
for example +-04.0 represent device 4 function 0.  If you want you can
essentially treat those as one thing since all they represent is a
different function hanging off of the root bus.  So the full description
for the part can be found out 0000:00:04.0 in the lspci -vv trace if I
am not mistaken.

The number in the braces after that represents the next bus identifier. 
So for example [01] indicates bus 1, with the device and function
numbers following that.  So in this case 0000:01:00.0 is the Tesla
M2070.  The trick to sorting out if you have enough bus bandwidth is to
take a look at the first function on a given bus, so for example
0000:01:00.0 and to compare it to the slot it is connected to
0000:00:02.0 and see if the link capabilities for the slot are equal to
or greater than the link capabilities for the first function on that bus.

>
>
> Please, let us come back to the case where we started.
>
>> Am Mittwoch, 18. März 2015 21:22:52 schrieben Sie:
>>> On 03/18/2015 01:04 PM, Wolfgang Rosner wrote:
>>>> root@cruncher:/cluster/etc/scripts/available# lspci -vvs 00:0a
>>>> 00:0a.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890 PCI to 
>>>> PCI 
>>>> bridge (external gfx1 port A) (prog-if 00 [Normal decode])
> .....
>>>>                  LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train-
>>>> SlotClk+
>> ....
>>
>>> I think this is your problem.  Specifically the 2.5GT/s with a width of
>>> x1 can barely push 1Gb/s.  This slot needs to be at least a x4 if you
>>> want to push anything more than 1Gb/s.
>>>
>>> - Alex
>
> The assumption I made then that the '00:0a.0' in this line
>
> 00:0a.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890 PCI to PCI 
> bridge (external gfx1 port A)
>
> refers to the '0a.0' in the tree:
>
> +-0a.0-[05-08]----00.0-[06-08]--+-00.0-[07]--+-00.0
>            |                               |            \-00.1
>            |                               \-01.0-[08]--+-00.0
>            |                                            \-00.1
>
> Is this correct?
> Does this mean that in this case, the IDT switch of the NIC was connected to
> - (external gfx1 port A) located at 00:0a.0
> - part of the north bridge, named RD890

Yes, that should be correct.  So if you look at 05:00.0 which will be
the IDT switch upstream port, and 00:0a.0 which will be the downstream
port they should show the same link status as they should be connected. 
In addition 00:0a.0 should have a secondary and subordinate set of bus
numbers that should cover buses 05 - 08.

>
>> Looking things over it seems like you probably have about as much PCIe
>> throughput as you are going to get out of this system.  The layout you
>> currently have seems to be x16 for the Tesla M2070, x8 for the Quadro
>> 2000, x8 for one adapter, and x4 for the other with the x1 empty.  So if
>> 16/8/8/4/1 is all the board can support then at this point I think you
>> have probably squeezed all of the PCIe devices in their as best as
>> possible.  Not much left to explore in terms of hardware.
> I bgegin to understand.
> But this worked only by plugging the VIDEO adapter nvidia Quadro 2000
> from the PCIEx16_3 to the PCIEx16_4. 
>
> I think the core issue is that the Sabertooth board is designed as high 
> performance gamer gadget.
> With dual graphic cards in PCIEx16_1 and PCIEx16_3 ,
> this is recongized as "Dual VGA/PCIe Card" (see Manual) and
> any preference of bandwith is unconditionally assigned to the video cards.
> The NICs are starved on bandwith.
>
> --------------------------
>
> But things are rolling on (sorry ...)
> While my slower blade perform fine on bare iperf tests, they run into CPU 
> limit when I do
> chained tests like this
>       iperf -> netcat | tee | netcat -> iperf
>
> So I decided to go to ebay again and give a try to inifiniband.
> So I have to plug an  infiniband adapter in my gateway, too.
> So I need a slot with x8 PCIe lanes.....
>
> I only can hope that by not having video cards in the preferred port, 
> the board is smart enough to assign the remaining PCIe lanes according to the 
> cards needs.
>
> If not, my next hope is Asus support.
> I rephrased my question there and have put you on the CC.
> But I don't know how linux-affine people are at Asus there.
>
> What if not?
> I could look for a pcieport developers list.
> Or RTFS as the last ressource....

I wouldn't expect the PCIe drivers to do much of anything to improve the
situation for you.  The limits of the PCIe slots are what they are.  You
can take a look at the link capabilities in the lspci dumps and each
slot lists the maximum speed and width supported per slot.  From what I
can tell your board supports PCIe gen 2 and a 16/8/8/4/1 layout for PCIe
slots, which if I am not mistaken is what your documentation said it can
support.  This may just be the limits of the board and you would likely
need to look at replacing the board if you want to support more PCIe
bandwidth.

- Alex

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to