On Thu, May 28, 2009 at 7:41 PM, Bob Ciotti <bob.cio...@nasa.gov> wrote: > On Thu, May 28, 2009 at 02:06:38PM -0500, Hal Rosenstock wrote: >> On Thu, May 28, 2009 at 1:57 PM, Bob Ciotti <bob.cio...@nasa.gov> wrote: >> > >> > Sorry to bounce this off the list - should it be too remedial. I promise >> > that I've been consuming a lot of the spec and OFA code. Maybe you consider >> > that a promise or a warning we will be more active :| >> > >> > Our configuration is >6000 CA in a mix of infinihostIII/connectx and >> > longbow extenders and >800 24 port switches on a single subnet. (SGI ICE >> > with lots of other stuff plugged in). Its DDR everywhere except across the >> > longbows. Hosts range from a few different generations of x86 xeon, x86 >> > opteron and itanium. We use lustre but have the srp traffic on a separate >> > subnet. >> > >> > A few weeks ago connection setup times were mentioned on this list along >> > with ARP and path record lookups not being scalable. We experience these >> > problems as well and need to address these scalability issues. I have a >> > quite >> > a bit of test data and a few different ideas to bounce off the list RE path >> > records, once I am a little more versed in the spec. There has already been >> > some work done to limit ARP traffic. >> > >> > Todays question has to do with SM errors. >> > We have been seeing lots of these - sometimes more than others. Digging >> > around some it appears that the 6777 represents the number of duplicates? >> > This value fluctuates around some, but not alot. Comments in the code >> > indicate that any valuse >1 is a problem. Question is, should or is this >> > OK to be happening and how does it occur? >> >> It's an error (and error status of too many records is returned to the >> SA client in the end node). >> >> Gets are only allowed to return 1 record (GetTable requests can deal >> with more than 1 record in the response) yet many were found by the SA >> that satisfied the request in responding to the Get. Any idea on what >> the specific get is that causes this to occur ? > > Thats the problem. The at the debug level we are running at I can pin down > the source.
Can you change the debug level ? If not, can you instrument OpenSM (add some debug info into osm_sa_path_record.c) ? > Is there a state I can go look for on the clients to see what > its trying to do? Perhaps use madeye. -- Hal > bob > > >> -- Hal >> >> > We will probably do an update to the 1.4 or 1.4.1 SM in the next few days. >> > We are currently running a pre 1.4 top of tree pull from back in dec. bob >> > >> > >> > May 28 00:07:13 324705 [649EF940] 0x01 -> osm_sa_respond: ERR 4C05: Got >> > more than one record for SubnAdmGet (6777) >> > May 28 00:07:13 336912 [76EE7940] 0x01 -> osm_sa_respond: ERR 4C05: Got >> > more than one record for SubnAdmGet (6777) >> > May 28 00:07:13 338067 [CFF50940] 0x01 -> osm_sa_respond: ERR 4C05: Got >> > more than one record for SubnAdmGet (6777) >> > May 28 00:07:13 570497 [E2448940] 0x01 -> osm_sa_respond: ERR 4C05: Got >> > more than one record for SubnAdmGet (6777) >> > May 28 00:07:14 417242 [BDA58940] 0x01 -> osm_sa_respond: ERR 4C05: Got >> > more than one record for SubnAdmGet (6777) >> > May 28 00:07:14 899329 [19330940] 0x01 -> osm_sa_respond: ERR 4C05: Got >> > more than one record for SubnAdmGet (6777) >> > May 28 00:07:15 245929 [F4940940] 0x01 -> osm_sa_respond: ERR 4C05: Got >> > more than one record for SubnAdmGet (6777) >> > May 28 00:07:17 076558 [6E38940] 0x01 -> osm_sa_respond: ERR 4C05: Got >> > more than one record for SubnAdmGet (6777) >> > May 28 00:07:18 118151 [649EF940] 0x01 -> osm_sa_respond: ERR 4C05: Got >> > more than one record for SubnAdmGet (6777) >> > May 28 00:07:18 328783 [76EE7940] 0x01 -> osm_sa_respond: ERR 4C05: Got >> > more than one record for SubnAdmGet (6777) >> > May 28 00:07:18 341440 [CFF50940] 0x01 -> osm_sa_respond: ERR 4C05: Got >> > more than one record for SubnAdmGet (6777) >> > May 28 00:07:18 578154 [E2448940] 0x01 -> osm_sa_respond: ERR 4C05: Got >> > more than one record for SubnAdmGet (6777) >> > May 28 00:07:19 425249 [BDA58940] 0x01 -> osm_sa_respond: ERR 4C05: Got >> > more than one record for SubnAdmGet (6777) >> > May 28 00:07:19 907407 [19330940] 0x01 -> osm_sa_respond: ERR 4C05: Got >> > more than one record for SubnAdmGet (6777) >> > May 28 00:07:20 267818 [F4940940] 0x01 -> osm_sa_respond: ERR 4C05: Got >> > more than one record for SubnAdmGet (6777) >> > >> > .... >> > >> > >> > >> > ------------------------------------------------------------------------- >> > Robert B. Ciotti ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? >> > ??Supercomputing Systems Lead >> > NASA Advanced Supercomputing (NAS) Division ?? ?? ?? ?? ?? ??TEL (650) >> > 604-4408 >> > NASA Ames Research Center ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??FAX >> > (650) 604-4377 >> > Moffett Field, CA 94035-1000 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? >> > ??bob.cio...@nasa.gov >> > ------------------------------------------------------------------------- >> > >> > _______________________________________________ >> > general mailing list >> > general@lists.openfabrics.org >> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> > >> > To unsubscribe, please visit >> > http://openib.org/mailman/listinfo/openib-general >> > > _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general