On Thu, 4 Feb 2010 15:01:32 -0500 Hal Rosenstock <[email protected]> wrote:
> On Thu, Feb 4, 2010 at 1:00 PM, Ira Weiny <[email protected]> wrote: > > On Thu, 4 Feb 2010 09:19:39 -0500 > > Hal Rosenstock <[email protected]> wrote: > > > >> On Tue, Feb 2, 2010 at 7:45 PM, Ira Weiny <[email protected]> wrote: > >> > Sasha, > >> > [snip] > >> > > >> > real 0m2.249s > >> > user 0m1.244s > >> > sys 0m0.936s > >> > > >> > 14:40:59 > time ./ibnetdiscover -o 4 --node-name-map > >> > /etc/opensm/ib-node-name-map -g > new > >> > > >> > real 0m2.170s > >> > user 0m1.160s > >> > sys 0m0.933s > >> > > >> > 14:41:10 > /usr/sbin/ibqueryerrors -s > >> > RcvErrors,SymbolErrors,RcvSwRelayErrors,XmtWait -r --data > >> > Suppressing: RcvErrors SymbolErrors RcvSwRelayErrors XmtWait > >> > Errors for 0x66a00d90006fb "SW19" > >> > GUID 0x66a00d90006fb port 9: [VL15Dropped == 3] [XmtData == 25187379] > >> > [RcvData == 25196688] [XmtPkts == 349861] [RcvPkts == 349954] > >> > Link info: 139 9[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> > >> > 0x0002c9030001d736 864 1[ ] "hyperion1" ( ) > >> > > >> > Note that there were no additional VL15Dropped packets on the fabric. I > >> > think 4 seems to be a good compromise. I have not tested when there are > >> > errors on the fabric. (Right now things seem to be good!) > >> > >> Is this just with the SM doing light sweeping ? > > > > Yes. > > That's not a lot of SMP stress from the SM side. SMP consumers are SM, > diags, and the unsolicited traps. Agreed. I hope to test this more next week. > > > > >> > >> Is there a speedup with 4 rather than 2 ? > > > > There is a bit of a speed up (~0.5 to 1.0 sec). But my main reason to want > > to > > go to 4 is that if there are issues on the fabric, unresponsive nodes etc.; > > 4 > > will give us better parallelism to get around these issues. I have not had > > the chance to test this condition with the new algorithm but the original > > ibnetdiscover would slow way down when there are nodes which have > > unresponsive > > SMA's. If there are only 2 outstanding this will not give us much speed up. > > This was the main motivation I had for improving the library in this way. > > > > Also, I think you are correct that we should increase OpenSM's default from > > 4 > > to 8. For the same reason as above. Some of our clusters have worked > > better > > with 8 when we are having issues. But right now we are still running with > > 4. > > I'm concerned about just increasing ibnetdiscover to 4 rather than 2. > I've seen a number of clusters with SMP dropping with the current > lower defaults. So OpenSM is seeing dropped packets? With 4 SMP's on the wire? I do see some VL15Dropped errors (maybe 2-3 a day) but I did not think that would be an issue. What kind of rate are you seeing? The other question is; do people regularly run the tools which are using libibnetdisc (ibqueryerrors, iblinkinfo, ibnetdiscover)? We do. If others are not then I would say this change would have less impact as they would want the diags to have some priority for debugging. The other option is to change the patch to be a default of 2 and allow user to change it depending on what they are trying to do. If you think that is best I will change the patch. Ira > > -- Hal > > > Ira > > > >> > >> -- Hal > >> > >> > > >> > The first patch converts the algorithm and the second adds the > >> > ibnd_set_max_smps_on_wire call. > >> > > >> > Let me know what you think. Because the algorithm changed so much > >> > testing this is a bit difficult because the order of the node discovery > >> > is different. However, I have done some extensive diffing of the output > >> > of ibnetdiscover and things look good. > >> > > >> > Ira > >> > > >> > -- > >> > Ira Weiny > >> > Math Programmer/Computer Scientist > >> > Lawrence Livermore National Lab > >> > 925-423-8008 > >> > [email protected] > >> > -- > >> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > >> > the body of a message to [email protected] > >> > More majordomo info at http://**vger.kernel.org/majordomo-info.html > >> > > >> -- > >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > >> the body of a message to [email protected] > >> More majordomo info at http://**vger.kernel.org/majordomo-info.html > >> > > > > > > -- > > Ira Weiny > > Math Programmer/Computer Scientist > > Lawrence Livermore National Lab > > 925-423-8008 > > [email protected] > > > -- Ira Weiny Math Programmer/Computer Scientist Lawrence Livermore National Lab 925-423-8008 [email protected] -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
