On 09:10 Wed 05 Mar , Al Chu wrote: > > I can't restart opensm on that cluster at this time. I don't recall any > port errors. However, I do recall seeing this output from > __osm_state_mgr_light_sweep_start(): > > OSM_LOG(sm->p_log, OSM_LOG_ERROR, > "ERR 0108: " > "Unknown remote side for node 0x%016" > PRIx64 > "(%s) port %u. Adding to light sweep sampling list\n", > cl_ntoh64(osm_node_get_node_guid > (p_node)), > p_node->print_desc, port_num); > > leading to a call to __osm_state_mgr_get_remote_port_info(), leading to > what I fixed in osm_pi_rcv_process().
Yes, this is valid (handled) scenario. What I cannot understand is why it doesn't reach __osm_pi_rcv_process_switch_port() (where ignore_existing_lfts flag should be enforced in accordance with port state) after querying port with "unknown" remotes during a light sweep. I did some experiments with ibsim and still not be able to reproduce this. I'm afraid there could be some hidden bug which I'm not able to catch yet. > My original assumption was that the remote side for some ports wasn't > known b/c the remote side ports were down. Is it possible for opensm to > not know about a remote side even if that remote side port is up/active? I think yes, some ports could be DOWN during initial discovery and become INIT later during LID assignment and/or link state setup. Normally (as in your scenario) next light sweep catches this and enforce heavy sweep. Sasha _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
