On Wed, 2005-02-16 at 11:45, Ronald G. Minnich wrote: > On Tue, 16 Feb 2005, Hal Rosenstock wrote: > > > On Tue, 2005-02-15 at 22:22, Ronald G. Minnich wrote: > > > On Tue, 15 Feb 2005, Hal Rosenstock wrote: > > > > > > > I presume your subnet has 179 HCAs ? Do you know ? > > > > > > no errors. It's just that opensm won't run. > > > > Won't run or won't do anything on the subnet ? > > > > Not sure what you mean by won't run ? > > ok, just found it. > > There is a sys fail red light on the CPU on the 96-port switch that the > opensm host attaches to. > > What's weird is none of the ib admin tools found anything. ibnetdiscover > happily walked the whole subnet. The only problem was that opensm would > not run, but the errors were unclear. So many things appeared to be > working that it did not occur to me to walk over and look at the switch. > Stupid of me.
Still not 100% clear on the failure mode. I don't know what the sys fail light on the CPU means. It may mean that things partially work. By that, I mean the CPU might crash but the IB chips continue to function based on their current setup. It would depend on the split of functionality between the CPU and the IB chip firmware (which may depend on vendor). If you were able to walk the subnet with the (SMP based) diags, the SM port had to be at least in init (ibstat/ibstatus). The "keys" are what was the failure mode so we can see how this can be detected better in the future, and what caused the switch CPU to crash in the first place. -- Hal > Now that I've turned that switch off I get this: > [1108572233:000155763][40BFF970] -> __osm_state_mgr_sm_port_down_msg: > > > ****************************************************************** > ************************** SM PORT DOWN ************************** > ****************************************************************** > > > [1108572233:000155778][40BFF970] -> __osm_sm_state_mgr_signal_error: ERR > 3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state > IB_SMINFO_STATE_DISCOVERING. > > which I assume is its way of telling me that the switch port is down. > > ron _______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
