Am Sonntag, 22. März 2015 07:38:43 schrieben Sie:
> On 03/21/2015 12:38 PM, Wolfgang Rosner wrote:

> > for the first case (one thread)
> >   PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
> > 24561 root      20   0 96700 1492 1340 S  22.2  0.0   0:00.80 iperf
> >
> > for the second case (two threads)
> >   PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
> > 25086 root      20   0  166m 1516 1368 S  41.2  0.0   0:01.31 iperf
> >
> > So it's not a CPU bottleneck.
>
> You might want to also check the per-CPU statistics when running your
> test. 

Sorry, there is always this tradeoff between information and clobber.
Here is the whole story.
I hope you have set your mailer to a monospace font (i'm just looking on 
mine...)


top - 18:46:27 up 1 day, 22:25,  8 users,  load average: 0.13, 0.11, 0.14
Tasks: 266 total,   1 running, 265 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  2.7 sy,  0.0 ni, 97.0 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
%Cpu2  :  0.0 us,  6.0 sy,  0.0 ni, 93.2 id,  0.0 wa,  0.0 hi,  0.8 si,  0.0 st
%Cpu3  :  0.0 us,  3.8 sy,  0.0 ni, 96.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  :  0.0 us,  0.7 sy,  0.0 ni, 99.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  :  0.0 us,  3.0 sy,  0.0 ni, 96.2 id,  0.8 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  :  0.0 us,  4.3 sy,  0.0 ni, 95.3 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
%Cpu7  :  0.4 us,  4.3 sy,  0.0 ni, 95.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  16335308 total, 16105428 used,   229880 free,   184344 buffers
KiB Swap: 31249404 total,    69976 used, 31179428 free, 12255816 cached

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
24561 root      20   0 96700 1492 1340 S  22.2  0.0   0:00.80 iperf
    7 root      20   0     0    0    0 S   0.3  0.0   1:27.50 rcu_sched
   18 root      20   0     0    0    0 S   0.3  0.0   0:07.88 ksoftirqd/2
   23 root      20   0     0    0    0 S   0.3  0.0   0:04.94 ksoftirqd/3
   33 root      20   0     0    0    0 S   0.3  0.0   0:04.87 ksoftirqd/5



top - 18:47:24 up 1 day, 22:26,  8 users,  load average: 0.25, 0.15, 0.15
Tasks: 269 total,   2 running, 267 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.3 us,  0.7 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.4 us,  7.4 sy,  0.0 ni, 90.9 id,  0.0 wa,  0.0 hi,  1.2 si,  0.0 st
%Cpu2  :  0.0 us,  4.5 sy,  0.0 ni, 92.6 id,  0.0 wa,  0.0 hi,  2.9 si,  0.0 st
%Cpu3  :  0.0 us,  7.9 sy,  0.0 ni, 90.9 id,  0.0 wa,  0.0 hi,  1.2 si,  0.0 st
%Cpu4  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  :  0.4 us,  7.0 sy,  0.0 ni, 91.4 id,  0.0 wa,  0.0 hi,  1.2 si,  0.0 st
%Cpu6  :  0.0 us,  6.3 sy,  0.0 ni, 92.4 id,  0.0 wa,  0.0 hi,  1.3 si,  0.0 st
%Cpu7  :  0.0 us, 10.0 sy,  0.0 ni, 88.8 id,  0.0 wa,  0.0 hi,  1.2 si,  0.0 st
KiB Mem:  16335308 total, 16115900 used,   219408 free,   184376 buffers
KiB Swap: 31249404 total,    69976 used, 31179428 free, 12256524 cached

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
25086 root      20   0  166m 1516 1368 S  41.2  0.0   0:01.31 iperf
   43 root      20   0     0    0    0 S   0.7  0.0   0:04.54 ksoftirqd/7
   13 root      20   0     0    0    0 S   0.3  0.0   0:04.36 ksoftirqd/1
   23 root      20   0     0    0    0 S   0.3  0.0   0:04.96 ksoftirqd/3
 2949 nobody    20   0 12700 2020 1672 S   0.3  0.0   3:36.12 tcpspy
14830 root      20   0     0    0    0 S   0.3  0.0   0:01.21 kworker/2:1



> I normally use a tool such as 'sar' to collect the per-CPU  utilization

Sure. 
I actually turned of a collectd again, because I did'nt manage to get it to 
resolve time finer than 60 s.
But even if I could, this puts quite some background load on anything.
Nevertheless, such kind of a collect demon for the whole cluster (not just the 
single box) is still on the to-do list.
Until that, I appreciate the flexibility of top snapshotting.


> as it could be that you have one CPU core maxed out on 
> either the receiver or the transmitter and that may only show up as a
> fractional CPU since other cores would show up as being unused.

As far as I can tell, in (at least my version of) top the per-process CPU 
figures are scaled per CPU-core. 
I remeber qite sometimes having far > 100 % figures there for multithreaded 
processes.
I've cross checked this on shots similiar to the above in more unambigouous 
situations.

But lets scrutinize abobe data:

adding up all sy-Figures in the single thread case
2.7+6.0+3.8+0.7+3.0 +4.3+4.3 = 24.8 %
22.% is shown for iperf, 
+ some fractional % of nearly idling other stuff
For userland, there is only 0.4 % in Cpu 7 in the single case thread.

Compare it to two-Thread case:
0.7+7.4+4.5+7.9+7.0+6.3+10.0 = 43.8 % in sy
41,2 % is for iperf
0.3 + 0.4 +0.4 = 1.1 % in us

Can I conclude from the sy/us relation, that most of the time, CPUs are busy in 
kernel work?
And even in a distributed way. 
As load balancing is supposed to work like.
So I'd say, linux does a real good job here, doesn't it?


> One other possibility would be a socket buffer bottleneck.  By opening a
> second thread you open a second socket if I am not mistaken, and that
> could be what is allowing for the increased throughput as more packets
> are allowed to be in flight.

Agree. That's basically what I suspect.
But should'nt this go away when I increase all those memory limits in 
/proc/sys/net
and the TCP Window, too?

And why do I encounter this only when sending on the faster machine, not on the 
slower ones?
If it were a common default setting, I would expect the slower machines running 
into same bottlenecks at earlier rates,
wouldn't I ?

Or ist due to a combination of fast sender - slow receiver?

> > If it were that it takes some µsec to look up some soft IRQ pointer that
> > share the same hard IRQ, this may hurt performance??
>
> I assume the adapters are likely using message signaled interrupts so
> they shouldn't be limited by the hard wired IRQs assigned to the slots.
>

So we can safely consider the IRQ assignement table in the Asus manual pg 37 as 
irrelevant?


> >      [IDT] PES12N3A PCI Express Switch
>
> I believe there are IDT switches as a part of your quad port adapters.
> If I am not mistaken there is a pair of dual port chips that are
> connected to the IDT PCIe switch.  So what you should see is 3
> functions, one on the upstream side, and two on the downstream side for
> each adapter.

Ah, this brings light into the picture.
So may I ask you for some patience regarding the PCI bus structure issue, until 
I grasp that, please?

> This IDT switch is a part of the device listed in d:00.[01] and
> e:00.[01] below.  The status of 0b:00.0 tells us what the link speed and
> width is for the connection of the adapter to the motherboard.  There
> should also be another switch taking up 06:00.0 and 07:02.0 and 07:04.0
> for the other adapter.  There the status of 06:00.0 should tell you

Im still sitting here, comparing two windows with outpt of 

lspic -tv
lspci -vv | grep -P "[0-9a-f]{2}:[0-9a-f]{2}\.[0-9a-f]|LnkSta:"

and try to understand what your writing has tought me so far.

So the whole branch in lspci -t to the left of the NICs is located on the NIC 
card, not on the main board.
So when I've read' LnkSta x4 for some Intel 82571EB device, this relates to the 
connections
PES12N3A PCI Express Switch <-> Intel 82571EB 
which is internally to the NIC board.

But hang on, what do the numbers in brackets of lspci tree view refer to?
First I thought this would map devices on a subtree to slots as viewed in the 
main bus.
But the longer i look to the listings, the more this contradicts.

just start here:

-[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] RD890 PCI to PCI 
bridge (external gfx0 port B)
           +-02.0-[01]--+-00.0  NVIDIA Corporation GF100GL [Tesla M2070]
           |            \-00.1  NVIDIA Corporation GF100 High Definition Audio 
Controller
           +-04.0-[02]----00.0  ASMedia Technology Inc. ASM1062 Serial ATA 
Controller

I don't hinkt that 
         +-04.0-[02]----00.0 ..  Serial ATA 
is on the pathway to
         +-02.0-[01]--+-00.0 ... [Tesla M2070]

as the matching 02 fooled me.


Please, let us come back to the case where we started.

> Am Mittwoch, 18. März 2015 21:22:52 schrieben Sie:
> > On 03/18/2015 01:04 PM, Wolfgang Rosner wrote:
> > > root@cruncher:/cluster/etc/scripts/available# lspci -vvs 00:0a
> > > 00:0a.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890 PCI to 
> > > PCI 
> > > bridge (external gfx1 port A) (prog-if 00 [Normal decode])
.....
> > >                  LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train-
> > > SlotClk+
>
> ....
>
> > I think this is your problem.  Specifically the 2.5GT/s with a width of
> > x1 can barely push 1Gb/s.  This slot needs to be at least a x4 if you
> > want to push anything more than 1Gb/s.
> >
> > - Alex


The assumption I made then that the '00:0a.0' in this line

00:0a.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890 PCI to PCI 
bridge (external gfx1 port A)

refers to the '0a.0' in the tree:

+-0a.0-[05-08]----00.0-[06-08]--+-00.0-[07]--+-00.0
           |                               |            \-00.1
           |                               \-01.0-[08]--+-00.0
           |                                            \-00.1

Is this correct?
Does this mean that in this case, the IDT switch of the NIC was connected to
- (external gfx1 port A) located at 00:0a.0
- part of the north bridge, named RD890

>
> Looking things over it seems like you probably have about as much PCIe
> throughput as you are going to get out of this system.  The layout you
> currently have seems to be x16 for the Tesla M2070, x8 for the Quadro
> 2000, x8 for one adapter, and x4 for the other with the x1 empty.  So if
> 16/8/8/4/1 is all the board can support then at this point I think you
> have probably squeezed all of the PCIe devices in their as best as
> possible.  Not much left to explore in terms of hardware.

I bgegin to understand.
But this worked only by plugging the VIDEO adapter nvidia Quadro 2000
from the PCIEx16_3 to the PCIEx16_4. 

I think the core issue is that the Sabertooth board is designed as high 
performance gamer gadget.
With dual graphic cards in PCIEx16_1 and PCIEx16_3 ,
this is recongized as "Dual VGA/PCIe Card" (see Manual) and
any preference of bandwith is unconditionally assigned to the video cards.
The NICs are starved on bandwith.

--------------------------

But things are rolling on (sorry ...)
While my slower blade perform fine on bare iperf tests, they run into CPU limit 
when I do
chained tests like this
        iperf -> netcat | tee | netcat -> iperf

So I decided to go to ebay again and give a try to inifiniband.
So I have to plug an  infiniband adapter in my gateway, too.
So I need a slot with x8 PCIe lanes.....

I only can hope that by not having video cards in the preferred port, 
the board is smart enough to assign the remaining PCIe lanes according to the 
cards needs.

If not, my next hope is Asus support.
I rephrased my question there and have put you on the CC.
But I don't know how linux-affine people are at Asus there.

What if not?
I could look for a pcieport developers list.
Or RTFS as the last ressource....

>
> - Alex

-- 
Wolfgang Rosner

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to