On Tue, Mar 22, 2011 at 9:23 PM, Albert Chu <[email protected]> wrote: > Hey Jim, Alex, > > Just hit a segfault on the main tree. It appears patch > > commit 9ddcf3419eade13bdc0a54f93930c49fe67efd63 > Author: Jim Schutt <[email protected]> > Date: Fri Sep 3 10:43:12 2010 -0600 > > opensm: Avoid havoc in minhop caused by torus-2QoS persistent use of > osm_port_t:priv. > > segfaults opensm on one of our systems w/ updn routing and lmc > 0 > (would likely segfault dor, minhop, and maybe others too). Our system > has older switches that do not support enhanced port zero, thus do not > support LMC > 0. (I imagine setting lmc_esp0 to FALSE, results in the > same behavior.) Subsequently even if you set LMC > 0 in your opensm > config file, there can be ports with LMC = 0 and LMC != 0 (e.g. from > HCAs). Subsequently in alloc_ports_priv(), some ports will have priv set > to NULL and some will not. Because of assumptions in osm_switch.c about > priv != NULL when lmc > 0, we hit a segfault. The issue didn't exist > before b/c we allocated p_port->priv non-NULL no matter what. > > The attached patch fixes the problem w/ updn. I haven't looked through > all of the 2Qos code thoroughly to figure out the consequences of this > change, so I'm just considering this a starting point for discussion. > > In addition, with the possibility that SP0 ports will be LMC = 0, this > code in osm_ucast_mgr.c ucast_mgr_process_tbl() does not look good. > > lids_per_port = 1 << p_mgr->p_subn->opt.lmc; > for (i = 0; i < lids_per_port; i++) { > cl_qlist_t *list = &p_mgr->port_order_list; > cl_list_item_t *item; > for (item = cl_qlist_head(list); item != cl_qlist_end(list); > item = cl_qlist_next(item)) { > osm_port_t *port = cl_item_obj(item, port, list_item); > ucast_mgr_process_port(p_mgr, p_sw, port, i); > } > } > > It iterates over all ports with the configured LMC, not the LMC of the > port?
Yes, base SP0 is always LMC 0 and either LMC of port or perhaps 0 when base SP0 and configured LMC otherwise (assuming it's an endport) could be used for such loops. There used to be cases that used the latter approach. I'm not sure which is more appropriate now. -- Hal > I haven't thought about this too deeply or investigated deeply, > so consider this another starting point for discussion. > > Al > > -- > Albert Chu > [email protected] > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
