Hey Jim,
On Wed, 2011-03-23 at 09:01 -0700, Jim Schutt wrote:
> Hi Al,
>
> Albert Chu wrote:
> > Hey Jim, Alex,
> >
> > Just hit a segfault on the main tree. It appears patch
> >
> > commit 9ddcf3419eade13bdc0a54f93930c49fe67efd63
> > Author: Jim Schutt <[email protected]>
> > Date: Fri Sep 3 10:43:12 2010 -0600
> >
> > opensm: Avoid havoc in minhop caused by torus-2QoS persistent use of
> > osm_port_t:priv.
> >
> > segfaults opensm on one of our systems w/ updn routing and lmc > 0
> > (would likely segfault dor, minhop, and maybe others too). Our system
> > has older switches that do not support enhanced port zero, thus do not
> > support LMC > 0. (I imagine setting lmc_esp0 to FALSE, results in the
> > same behavior.) Subsequently even if you set LMC > 0 in your opensm
> > config file, there can be ports with LMC = 0 and LMC != 0 (e.g. from
> > HCAs). Subsequently in alloc_ports_priv(), some ports will have priv set
> > to NULL and some will not. Because of assumptions in osm_switch.c about
> > priv != NULL when lmc > 0, we hit a segfault. The issue didn't exist
> > before b/c we allocated p_port->priv non-NULL no matter what.
>
> OK, I think I see. But this segfault can only occur in
> the case where LMC is configured > 0, right?
>
> The issue is in osm_switch_recommend_path() when
> routing_for_lmc is true, but p_port->priv is NULL, right?
Yup.
> >
> > The attached patch fixes the problem w/ updn. I haven't looked through
> > all of the 2Qos code thoroughly to figure out the consequences of this
> > change, so I'm just considering this a starting point for discussion.
>
> Torus-2QoS's use of port->priv is unique because it persists
> between routing sweeps. So if another routing engine runs
> after torus-2QoS and uses port->priv without having ensured
> that it set it itself, there will be trouble. 9ddcf3419ea
> was fixing such an issue.
>
> I can find only two calls of osm_switch_recommend_path(),
> and both seem to be to do the right thing, so I think
> your patch is OK.
Sounds good. When reading over your comments about the 2Qos patches
that affected this area, I wasn't quite sure how you were dealing with
the p_port->priv, so I was unsure how my patch would affect things.
> >
> > In addition, with the possibility that SP0 ports will be LMC = 0, this
> > code in osm_ucast_mgr.c ucast_mgr_process_tbl() does not look good.
> >
> > lids_per_port = 1 << p_mgr->p_subn->opt.lmc;
> > for (i = 0; i < lids_per_port; i++) {
> > cl_qlist_t *list = &p_mgr->port_order_list;
> > cl_list_item_t *item;
> > for (item = cl_qlist_head(list); item != cl_qlist_end(list);
> > item = cl_qlist_next(item)) {
> > osm_port_t *port = cl_item_obj(item, port, list_item);
> > ucast_mgr_process_port(p_mgr, p_sw, port, i);
> > }
> > }
> >
> > It iterates over all ports with the configured LMC, not the LMC of the
> > port? I haven't thought about this too deeply or investigated deeply,
> > so consider this another starting point for discussion.
>
> Hmm, looks like ucast_mgr_process_port() DTRT, though;
> it ignores lids that aren't in the range configured on
> the port?
Ahh, I think you're right. It does appear to do the right thing there.
I don't think it's a problem afterall.
Al
> >
> > Al
> >
> >
> >
> > ------------------------------------------------------------------------
> >
> > Subject:
> > [PATCH] fix segfault corner case w/ updn routing and LMC > 0
> > From:
> > Albert L.Chu <[email protected]>
> > Date:
> > Tue, 22 Mar 2011 17:36:16 -0700
> >
> >
> > Signed-off-by: Albert L. Chu <[email protected]>
>
> Reviewed-by: Jim Schutt <[email protected]>
>
> -- Jim
>
> > ---
> > opensm/osm_ucast_mgr.c | 4 ----
> > 1 files changed, 0 insertions(+), 4 deletions(-)
> >
> > diff --git a/opensm/osm_ucast_mgr.c b/opensm/osm_ucast_mgr.c
> > index 4019589..211d6e0 100644
> > --- a/opensm/osm_ucast_mgr.c
> > +++ b/opensm/osm_ucast_mgr.c
> > @@ -318,10 +318,6 @@ static void alloc_ports_priv(osm_ucast_mgr_t * mgr)
> > item = cl_qmap_next(item)) {
> > port = (osm_port_t *) item;
> > lmc = ib_port_info_get_lmc(&port->p_physp->port_info);
> > - if (!lmc) {
> > - port->priv = NULL;
> > - continue;
> > - }
> > r = malloc(sizeof(*r) + sizeof(r->guids[0]) * (1 << lmc));
> > if (!r) {
> > OSM_LOG(mgr->p_log, OSM_LOG_ERROR, "ERR 3A09: "
>
--
Albert Chu
[email protected]
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html