Hi Al,

Albert Chu wrote:
Hey Jim, Alex,

Just hit a segfault on the main tree. It appears patch
commit 9ddcf3419eade13bdc0a54f93930c49fe67efd63
Author: Jim Schutt <[email protected]>
Date:   Fri Sep 3 10:43:12 2010 -0600

opensm: Avoid havoc in minhop caused by torus-2QoS persistent use of
osm_port_t:priv.

segfaults opensm on one of our systems w/ updn routing and lmc > 0
(would likely segfault dor, minhop, and maybe others too).  Our system
has older switches that do not support enhanced port zero, thus do not
support LMC > 0.  (I imagine setting lmc_esp0 to FALSE, results in the
same behavior.)  Subsequently even if you set LMC > 0 in your opensm
config file, there can be ports with LMC = 0 and LMC != 0 (e.g. from
HCAs). Subsequently in alloc_ports_priv(), some ports will have priv set
to NULL and some will not.  Because of assumptions in osm_switch.c about
priv != NULL when lmc > 0, we hit a segfault.  The issue didn't exist
before b/c we allocated p_port->priv non-NULL no matter what.

OK, I think I see.  But this segfault can only occur in
the case where LMC is configured > 0, right?

The issue is in osm_switch_recommend_path() when
routing_for_lmc is true, but p_port->priv is NULL, right?


The attached patch fixes the problem w/ updn.  I haven't looked through
all of the 2Qos code thoroughly to figure out the consequences of this
change, so I'm just considering this a starting point for discussion.

Torus-2QoS's use of port->priv is unique because it persists
between routing sweeps.  So if another routing engine runs
after torus-2QoS and uses port->priv without having ensured
that it set it itself, there will be trouble.  9ddcf3419ea
was fixing such an issue.

I can find only two calls of osm_switch_recommend_path(),
and both seem to be to do the right thing, so I think
your patch is OK.


In addition, with the possibility that SP0 ports will be LMC = 0, this
code in osm_ucast_mgr.c ucast_mgr_process_tbl() does not look good.

lids_per_port = 1 << p_mgr->p_subn->opt.lmc;
for (i = 0; i < lids_per_port; i++) {
     cl_qlist_t *list = &p_mgr->port_order_list;
     cl_list_item_t *item;
     for (item = cl_qlist_head(list); item != cl_qlist_end(list);
          item = cl_qlist_next(item)) {
          osm_port_t *port = cl_item_obj(item, port, list_item);
          ucast_mgr_process_port(p_mgr, p_sw, port, i);
     }
}

It iterates over all ports with the configured LMC, not the LMC of the
port?  I haven't thought about this too deeply or investigated deeply,
so consider this another starting point for discussion.

Hmm, looks like ucast_mgr_process_port() DTRT, though;
it ignores lids that aren't in the range configured on
the port?


Al



------------------------------------------------------------------------

Subject:
[PATCH] fix segfault corner case w/ updn routing and LMC > 0
From:
Albert L.Chu <[email protected]>
Date:
Tue, 22 Mar 2011 17:36:16 -0700


Signed-off-by: Albert L. Chu <[email protected]>

Reviewed-by: Jim Schutt <[email protected]>

-- Jim

---
 opensm/osm_ucast_mgr.c |    4 ----
 1 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/opensm/osm_ucast_mgr.c b/opensm/osm_ucast_mgr.c
index 4019589..211d6e0 100644
--- a/opensm/osm_ucast_mgr.c
+++ b/opensm/osm_ucast_mgr.c
@@ -318,10 +318,6 @@ static void alloc_ports_priv(osm_ucast_mgr_t * mgr)
             item = cl_qmap_next(item)) {
                port = (osm_port_t *) item;
                lmc = ib_port_info_get_lmc(&port->p_physp->port_info);
-               if (!lmc) {
-                       port->priv = NULL;
-                       continue;
-               }
                r = malloc(sizeof(*r) + sizeof(r->guids[0]) * (1 << lmc));
                if (!r) {
                        OSM_LOG(mgr->p_log, OSM_LOG_ERROR, "ERR 3A09: "

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to