On Thu, 4 Feb 2010 15:01:32 -0500
Hal Rosenstock <[email protected]> wrote:

> On Thu, Feb 4, 2010 at 1:00 PM, Ira Weiny <[email protected]> wrote:
> > On Thu, 4 Feb 2010 09:19:39 -0500
> > Hal Rosenstock <[email protected]> wrote:
> >
> >> On Tue, Feb 2, 2010 at 7:45 PM, Ira Weiny <[email protected]> wrote:
> >> > Sasha,
> >> >

[snip]

> >> >
> >> > real    0m2.249s
> >> > user    0m1.244s
> >> > sys     0m0.936s
> >> >
> >> > 14:40:59 > time ./ibnetdiscover -o 4 --node-name-map 
> >> > /etc/opensm/ib-node-name-map -g > new
> >> >
> >> > real    0m2.170s
> >> > user    0m1.160s
> >> > sys     0m0.933s
> >> >
> >> > 14:41:10 > /usr/sbin/ibqueryerrors  -s 
> >> > RcvErrors,SymbolErrors,RcvSwRelayErrors,XmtWait -r --data
> >> > Suppressing: RcvErrors SymbolErrors RcvSwRelayErrors XmtWait
> >> > Errors for 0x66a00d90006fb "SW19"
> >> >   GUID 0x66a00d90006fb port 9: [VL15Dropped == 3] [XmtData == 25187379] 
> >> > [RcvData == 25196688] [XmtPkts == 349861] [RcvPkts == 349954]
> >> >       Link info:    139   9[  ] ==( 4X 5.0 Gbps Active/  LinkUp)==>  
> >> > 0x0002c9030001d736    864    1[  ] "hyperion1" ( )
> >> >
> >> > Note that there were no additional VL15Dropped packets on the fabric.  I 
> >> > think 4 seems to be a good compromise.  I have not tested when there are 
> >> > errors on the fabric.  (Right now things seem to be good!)
> >>
> >> Is this just with the SM doing light sweeping ?
> >
> > Yes.
> 
> That's not a lot of SMP stress from the SM side. SMP consumers are SM,
> diags, and the unsolicited traps.

Agreed.  I hope to test this more next week.

> 
> >
> >>
> >> Is there a speedup with 4 rather than 2 ?
> >
> > There is a bit of a speed up (~0.5 to 1.0 sec).  But my main reason to want 
> > to
> > go to 4 is that if there are issues on the fabric, unresponsive nodes etc.; 
> > 4
> > will give us better parallelism to get around these issues.  I have not had
> > the chance to test this condition with the new algorithm but the original
> > ibnetdiscover would slow way down when there are nodes which have 
> > unresponsive
> > SMA's.  If there are only 2 outstanding this will not give us much speed up.
> > This was the main motivation I had for improving the library in this way.
> >
> > Also, I think you are correct that we should increase OpenSM's default from 
> > 4
> > to 8.  For the same reason as above.  Some of our clusters have worked 
> > better
> > with 8 when we are having issues.  But right now we are still running with 
> > 4.
> 
> I'm concerned about just increasing ibnetdiscover to 4 rather than 2.
> I've seen a number of clusters with SMP dropping with the current
> lower defaults.

So OpenSM is seeing dropped packets?  With 4 SMP's on the wire?  I do see some
VL15Dropped errors (maybe 2-3 a day) but I did not think that would be an
issue.  What kind of rate are you seeing?

The other question is; do people regularly run the tools which are using
libibnetdisc (ibqueryerrors, iblinkinfo, ibnetdiscover)?  We do.  If others
are not then I would say this change would have less impact as they would want
the diags to have some priority for debugging.  The other option is to change
the patch to be a default of 2 and allow user to change it depending on what
they are trying to do.  If you think that is best I will change the patch.

Ira

> 
> -- Hal
> 
> > Ira
> >
> >>
> >> -- Hal
> >>
> >> >
> >> > The first patch converts the algorithm and the second adds the 
> >> > ibnd_set_max_smps_on_wire call.
> >> >
> >> > Let me know what you think.  Because the algorithm changed so much 
> >> > testing this is a bit difficult because the order of the node discovery 
> >> > is different.  However, I have done some extensive diffing of the output 
> >> > of ibnetdiscover and things look good.
> >> >
> >> > Ira
> >> >
> >> > --
> >> > Ira Weiny
> >> > Math Programmer/Computer Scientist
> >> > Lawrence Livermore National Lab
> >> > 925-423-8008
> >> > [email protected]
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> >> > the body of a message to [email protected]
> >> > More majordomo info at  http://**vger.kernel.org/majordomo-info.html
> >> >
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> >> the body of a message to [email protected]
> >> More majordomo info at  http://**vger.kernel.org/majordomo-info.html
> >>
> >
> >
> > --
> > Ira Weiny
> > Math Programmer/Computer Scientist
> > Lawrence Livermore National Lab
> > 925-423-8008
> > [email protected]
> >
> 


-- 
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
925-423-8008
[email protected]
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to