Hi Jim, On Wed, 2008-06-04 at 15:49 -0600, Jim Schutt wrote: > Hi Hal, > > On Wed, 2008-06-04 at 14:40 -0600, Hal Rosenstock wrote: > > Hi Jim, > > > > On Wed, 2008-06-04 at 13:59 -0600, Jim Schutt wrote: > > > Hi Hal, > > > > > > I've just discovered that what I thought was ofed-1.2 opensm > > > is really ofed-1.2-rc2, if it matters. > > > > I don't recall what the differences were but let's assume it's not > > significant for now. > > OK. > > [snip] > > > OK; this clearly shows there's only 1 SM at a time here. > > > > Did this exact cluster work fine (with OFED 1.2 (rc2) OpenSM) when the > > end nodes were OFED 1.2 rather than 1.3 and that was the only change ? > > We ran for months with the 1.2-rc2 opensm and > 1.2.5 clients on RHEL4. > > Then we updated the clients to RHEL5 and ofed-1.3 > a little while ago. > > > > > Did the cluster size change too by any chance ? > > Nope. > > > How large a cluster is > > this ? (There were some fixes here which should help for OFED 1.3). > > 128 nodes, 2 CPUs each. > > > > > What opensm command line and config file options are being used to start > > the OpenSMs ? > > The 1.3 OpenSM is using "opensm -t 200 -g 0" with these config > file options: > > DEBUG=none > LMC=0 > MAXSMPS=4 > REASSIGN_LIDS="no" > SWEEP=10 > TIMEOUT=200 > OSM_LOG=/var/log/opensm.log > VERBOSE="none" > UPDN="off" > GUID_FILE="none" > GUID=0 > OSM_HOSTS="" > OSM_CACHE_DIR=/var/cache/opensm > CACHE_OPTIONS="none" > HONORE_GUID2LID="none" > RCP=/usr/bin/scp > RSH=/usr/bin/ssh > RESCAN_TIME=60 > PORT_NUM=1 > ONBOOT=no > > while the 1.2-rc2 is using "opensm -maxsmps 0 -t 200 -g 0" with > these config file options: > > DEBUG=none > LMC=0 > MAXSMPS=0 > REASSIGN_LIDS="no" > SWEEP=10 > TIMEOUT=200 > OSM_LOG=/tmp/osm.log > VERBOSE="none" > UPDN="off" > GUID_FILE="none" > GUID=0 > OSM_HOSTS="" > OSM_CACHE_DIR=/var/cache/osm > CACHE_OPTIONS="none" > HONORE_GUID2LID="none" > RCP=/usr/bin/scp > RSH=/usr/bin/ssh > RESCAN_TIME=60 > PORT_NUM=1 > ONBOOT=no > > So from your other email, the "-maxsmps 0" is problematic?
In a large subnet; not sure yours is large enough where that might cause this but you can dial this back to say 15 and see if that makes a difference. > > Are you trying to stick with the OFED 1.2 (rc2) OpenSM or would the OFED > > 1.3 OpenSM be OK if it worked in your environment ? > > The 1.3 is OK. We run diskless, and in the process of updating > the clients it was easiest to leave the admin node, which was > running the OpenSM, alone at 1.2-rc2. I can run the sm on one of > the compute nodes until I have a chance to update the admin > node as well. > > It's just that it was unexpected to see the logs spammed with > those errors, and it seemed like a good idea to report it. OpenSM doesn't do a good job of filtering out repetitive errors like this but this type of error is unusual. > Maybe there's some lurking issue that hasn't shown up any > other way yet? I'm hoping it's the maxsmp issue. Can you change it from infinite and see ? > In any event, the OpenSM from 1.3 seems to be working > just fine for us. That's good to know. There were many improvements made for this (from 1.2). -- Hal > > Sorry for all the questions but I'm trying to come up with a theory on > > what's not right. > > No worries. Thanks for checking into it. > > -- Jim > > > > > -- Hal > > > _______________________________________________ ewg mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
