On Fri, 2005-10-14 at 16:22, Brett Bode wrote: > On Oct 14, 2005, at 2:27 PM, Troy Benjegerdes wrote: > > > > > > > From: Hal Rosenstock <[EMAIL PROTECTED]> > > Date: October 14, 2005 12:41:13 PM CDT > > To: Troy Benjegerdes <[EMAIL PROTECTED]> > > Cc: IBMEHCA DD <[EMAIL PROTECTED]>, [email protected] > > Subject: Re: [openib-general] Re: IBM eHCA testing.. > > > > > > On Fri, 2005-10-14 at 12:08, Troy Benjegerdes wrote: > >> Hal Rosenstock wrote: > >> > >>> On Thu, 2005-10-13 at 18:46, Troy Benjegerdes wrote: > >>> > >>> > >>>> I'm also attaching part of an opensm log file. > >>>> > >>>> (the full copy is at http://scl.ameslab.gov/~troy/osm-ehca.log ) > >>>> > >>>> The IBM galaxy adapters are at: > >>>> Initial path: [0][1][16] > >>>> Initial path: [0][1][13] > >>>> > >>>> > >>>> > >>> > >>> The OpenSM is just saying that a SMP transaction it issued (in this > >>> case, SM Get P_KeyTable) is timing out (no response made it back to > >>> OpenSM). > >>> > >>> BTW, what svn rev is OpenSM up to ? > >>> > >>> -- Hal > >>> > >>> > >> So, how about a patch to opensm to report what svn rev it was built > >> from ;) > > > > Can you do svn info in the userspace/management/osm directory ? > > Path: . > URL: https://openib.org/svn/gen2/trunk/src/linux-kernel/infiniband > Repository UUID: 21a7a0b7-18d7-0310-8e21-e8b31bdbf5cd > Revision: 3493 > Node Kind: directory > Schedule: normal > Last Changed Author: roland > Last Changed Rev: 3487 > Last Changed Date: 2005-09-19 17:59:27 -0500 (Mon, 19 Sep 2005) > Properties Last Updated: 2005-02-15 16:24:20 -0600 (Tue, 15 Feb 2005)
If you update and rebuild OpenSM, you will get rid of messages like: Oct 13 10:35:38 366848 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x8:0x0] for guid:0x0002c90108ccc571. Oct 13 10:35:38 366866 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x13:0x1] for guid:0x0002550000039e80. Oct 13 10:35:38 366880 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x5:0x0] for guid:0x00066a00a0000441. Oct 13 10:35:38 366894 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x10:0x1] for guid:0x0002c90108cd0b71. Oct 13 10:35:38 366907 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x11:0x1] for guid:0x00066a00a000044e. Oct 13 10:35:38 366921 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x14:0x1] for guid:0x0002550000038500. Oct 13 10:35:38 366934 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x9:0x0] for guid:0x0002c90200402782. Oct 13 10:35:38 366948 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0xa:0x0] for guid:0x0002c90108cd98c1. Oct 13 10:35:38 366961 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0xd:0x0] for guid:0x0002c90108cd84a1. Oct 13 10:35:38 366975 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0xe:0x0] for guid:0x0002c90200402917. Oct 13 10:35:38 366988 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x1:0x0] for guid:0x0002c90200402781. Oct 13 10:35:38 367001 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0xb:0x0] for guid:0x0002c90108cd9bd1. Oct 13 10:35:38 367015 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x15:0x1] for guid:0x0002550000038580. Oct 13 10:35:38 367028 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x2:0x0] for guid:0x0002c90200402915. Oct 13 10:35:38 367042 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x6:0x0] for guid:0x00066a00a0000444. Oct 13 10:35:38 367055 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x4:0x0] for guid:0x00066a00a000043c. Oct 13 10:35:38 367068 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0xc:0x0] for guid:0x0002c90108cd85f1. Oct 13 10:35:38 367082 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x7:0x0] for guid:0x00066a00a0000458. Oct 13 10:35:38 367095 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x3:0x0] for guid:0x0002c900001cee10. I don't think this causes any harm though. There are some other fixes you will pick up. > >> I just discovered another problem.. We have been running pfvs2 over > >> IPoIB on the same subnet, and in debugging this, I restarted opensm > >> several times, and somewhere in the stack a PVFS2 write failed. I > >> wouldn't think that a short downtime of the SM from restarting it > >> would > >> cause any IPoIB TCP sessions to fall over.. > > > > As Fab indicated, there are a number of places where the SM/SA is > > needed: > > 1. SA PathRecords (used when a path to a new IP end node is needed or > > an > > existing one timesout) > > 2. SA MCMemberRecord joins, queries, and leaves (used when an interface > > is up'ed, down'ed, etc.) > > > > Is this on an existing TCP session ? Is it OpenIB IPoIB clients at each > > end ? What svn version is being used for this ? > > > > -- Hal > > > It looks like each client node maintains an open TCP stream to each of > the servers. pvfs2 appears to not be very robust to failure. However > the pvfs2 folks just released a new version which changes their network > protocol somewhat. I plan to get the new version installed next week > and will see if it handles things a bit more robustly. Is this running on top of OpenIB IPoIB ? If so, what svn version for IPoIB ? Is it the same as OpenSM (3487) ? If so, that should be recent enough and contains the SA reregistration fix for IPoIB. cd linux-kernel/infiniband/ulp/ipoib/ svn info -- Hal _______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
