Hey Hal, > Other than the average bandwidth numbers at the top which show PL > > non PL, the ones in the table seem to be the other way around. Do > those numbers indicate bandwidth ?
Sorry, the numbers for the mvapich/openmpi numbers are in units of time. I think microseconds. So lower is better. The "degradation" columns are the percent worse than LMC=0. The key thing to see is that originally, when LMC > 0 and multiple paths are not used, the numbers go up/it takes more time. When the preserving-base-lids patch is put in, the LMC > 0 numbers are about the same as the LMC=0 numbers. Al On Tue, 2008-06-17 at 07:51 -0700, Hal Rosenstock wrote: > Al, > > On Mon, 2008-06-16 at 10:56 -0700, Al Chu wrote: > > Hey Sasha, > > > > patch series seems to perform well. As a reminder mvapich 0.9.9 does > > not how to deal with multiple lids, openMPI 1.2.6 does. The "preserve > > lids" condition is labeled "PL". > > > > With LMC > 0, the preserve lids seems to do its job under mvapich 0.9.9, > > and atleast maintains (sometimes increases) performance of openmpi 1.2.6 > > with LMC > 0. Numbers are attached. > > Maybe I'm not reading this right but here's what I see: > > Other than the average bandwidth numbers at the top which show PL > non > PL, the ones in the table seem to be the other way around. Do those > numbers indicate bandwidth ? > > Also, you mentioned that mvapich doesn't know how to deal with LMC yet > the numbers seem to go up as LMC does. > > -- Hal > > > I'd give my thumbs up to the patch. Lets commit it. > > > > Al > > > > On Tue, 2008-06-10 at 05:59 +0300, Sasha Khapyorsky wrote: > > > Basically this addresses the problem described by Al Chu in: > > > > > > http://lists.openfabrics.org/pipermail/general/2008-April/049132.html > > > > > > When base lid paths become completely disbalanced on a fabrics with > > > lmc > 0. > > > > > > One feedback was from Yiftah Shahar: > > > > > > "I think that our requirements should be that even when you are working > > > with LMC>0 then the base LID routing should not be affected. > > > One way to achieve this goal is to first run the base-LID routing (so > > > all base LID improvement will be also in LMC>0) and then start with the > > > other LIDs as round-robbing starting from the base-lid-port + 1 > > > according current routing algorithm rules (keeping min-hop, up/down...)." > > > > > > We had some discussion with Al and Yiftah about this and considered that > > > in addition to "pure" base lid paths preservation (which is good thing by > > > itself) proposed method solves original lid disbalancing problem as well. > > > > > > This patch is implementation of the idea above. > > > > > > Signed-off-by: Sasha Khapyorsky <[EMAIL PROTECTED]> > > > --- > > > opensm/include/opensm/osm_switch.h | 4 + > > > opensm/opensm/osm_dump.c | 3 +- > > > opensm/opensm/osm_switch.c | 8 ++- > > > opensm/opensm/osm_ucast_mgr.c | 163 > > > ++++++++++++++++-------------------- > > > 4 files changed, 86 insertions(+), 92 deletions(-) > > > > > > diff --git a/opensm/include/opensm/osm_switch.h > > > b/opensm/include/opensm/osm_switch.h > > > index 0e9c5fa..c1521a6 100644 > > > --- a/opensm/include/opensm/osm_switch.h > > > +++ b/opensm/include/opensm/osm_switch.h > > > @@ -981,6 +981,7 @@ uint8_t > > > osm_switch_recommend_path(IN const osm_switch_t * const p_sw, > > > IN osm_port_t * p_port, > > > IN const uint16_t lid_ho, > > > + IN unsigned start_from, > > > IN const boolean_t ignore_existing, > > > IN const boolean_t dor); > > > /* > > > @@ -995,6 +996,9 @@ osm_switch_recommend_path(IN const osm_switch_t * > > > const p_sw, > > > * lid_ho > > > * [in] LID value (host order) for which to get a path > > > advisory. > > > * > > > +* start_from > > > +* [in] Port number from where to start balance counting. > > > +* > > > * ignore_existing > > > * [in] Set to cause the switch to choose the optimal route > > > * regardless of existing paths. > > > diff --git a/opensm/opensm/osm_dump.c b/opensm/opensm/osm_dump.c > > > index b96984b..60c6d25 100644 > > > --- a/opensm/opensm/osm_dump.c > > > +++ b/opensm/opensm/osm_dump.c > > > @@ -218,7 +218,8 @@ static void dump_ucast_routes(cl_map_item_t > > > *p_map_item, FILE *file, void *cxt) > > > else { > > > /* No LMC Optimization */ > > > best_port = osm_switch_recommend_path(p_sw, p_port, > > > - lid_ho, TRUE, > > > dor); > > > + lid_ho, 1, TRUE, > > > + dor); > > > fprintf(file, "No %u hop path possible via port %u!", > > > best_hops, best_port); > > > } > > > diff --git a/opensm/opensm/osm_switch.c b/opensm/opensm/osm_switch.c > > > index 58936e3..a9d13c8 100644 > > > --- a/opensm/opensm/osm_switch.c > > > +++ b/opensm/opensm/osm_switch.c > > > @@ -274,6 +274,7 @@ uint8_t > > > osm_switch_recommend_path(IN const osm_switch_t * const p_sw, > > > IN osm_port_t * p_port, > > > IN const uint16_t lid_ho, > > > + IN unsigned start_from, > > > IN const boolean_t ignore_existing, > > > IN const boolean_t dor) > > > { > > > @@ -294,6 +295,7 @@ osm_switch_recommend_path(IN const osm_switch_t * > > > const p_sw, > > > uint8_t port_num; > > > uint8_t num_ports; > > > uint32_t least_paths = 0xFFFFFFFF; > > > + unsigned i; > > > /* > > > The follwing will track the least paths if the > > > route should go through a new system/node > > > @@ -397,8 +399,10 @@ osm_switch_recommend_path(IN const osm_switch_t * > > > const p_sw, > > > */ > > > > > > /* port number starts with one and num_ports is 1 + num phys ports */ > > > - for (port_num = 1; port_num < num_ports; port_num++) { > > > - if (osm_switch_get_hop_count(p_sw, base_lid, port_num) != > > > + for (i = start_from; i < start_from + num_ports; i++) { > > > + port_num = i%num_ports; > > > + if (!port_num || > > > + osm_switch_get_hop_count(p_sw, base_lid, port_num) != > > > least_hops) > > > continue; > > > > > > diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c > > > index c073037..2aae6d5 100644 > > > --- a/opensm/opensm/osm_ucast_mgr.c > > > +++ b/opensm/opensm/osm_ucast_mgr.c > > > @@ -208,7 +208,8 @@ find_and_add_remote_sys(osm_switch_t *sw, uint8_t > > > port, > > > static void > > > __osm_ucast_mgr_process_port(IN osm_ucast_mgr_t * const p_mgr, > > > IN osm_switch_t * const p_sw, > > > - IN osm_port_t * const p_port) > > > + IN osm_port_t * const p_port, > > > + IN unsigned lid_offset) > > > { > > > uint16_t min_lid_ho; > > > uint16_t max_lid_ho; > > > @@ -217,19 +218,14 @@ __osm_ucast_mgr_process_port(IN osm_ucast_mgr_t * > > > const p_mgr, > > > boolean_t is_ignored_by_port_prof; > > > ib_net64_t node_guid; > > > struct osm_routing_engine *p_routing_eng; > > > - /* > > > - The following are temporary structures that will aid > > > - in providing better routing in LMC > 0 situations > > > - */ > > > - uint16_t lids_per_port = 1 << p_mgr->p_subn->opt.lmc; > > > - struct osm_remote_node *p_remote_guid_used = NULL; > > > + unsigned start_from = 1; > > > > > > OSM_LOG_ENTER(p_mgr->p_log); > > > > > > osm_port_get_lid_range_ho(p_port, &min_lid_ho, &max_lid_ho); > > > > > > - /* If the lids are zero - then there was some problem with the > > > initialization. > > > - Don't handle this port. */ > > > + /* If the lids are zero - then there was some problem with > > > + * the initialization. Don't handle this port. */ > > > if (min_lid_ho == 0 || max_lid_ho == 0) { > > > OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR 3A04: " > > > "Port 0x%" PRIx64 " has LID 0. An initialization " > > > @@ -238,16 +234,22 @@ __osm_ucast_mgr_process_port(IN osm_ucast_mgr_t * > > > const p_mgr, > > > goto Exit; > > > } > > > > > > - if (osm_log_is_active(p_mgr->p_log, OSM_LOG_DEBUG)) { > > > + lid_ho = min_lid_ho + lid_offset; > > > + > > > + if (lid_ho > max_lid_ho) > > > + goto Exit; > > > + > > > + if (lid_offset) > > > + /* ignore potential overflow - it is handled in osm_switch.c */ > > > + start_from = osm_switch_get_port_by_lid(p_sw, lid_ho - 1) + 1; > > > + > > > + if (osm_log_is_active(p_mgr->p_log, OSM_LOG_DEBUG)) > > > OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, > > > - "Processing port 0x%" PRIx64 ", LIDs [0x%X,0x%X]\n", > > > - cl_ntoh64(osm_port_get_guid(p_port)), > > > + "Processing port 0x%" PRIx64 ", LID %u [0x%X,0x%X]\n", > > > + cl_ntoh64(osm_port_get_guid(p_port)), lid_ho, > > > min_lid_ho, max_lid_ho); > > > - } > > > > > > - /* > > > - TO DO - This should be runtime error, not a CL_ASSERT() > > > - */ > > > + /* TODO - This should be runtime error, not a CL_ASSERT() */ > > > CL_ASSERT(max_lid_ho < osm_switch_get_fwd_tbl_size(p_sw)); > > > > > > node_guid = osm_node_get_node_guid(p_sw->p_node); > > > @@ -260,80 +262,62 @@ __osm_ucast_mgr_process_port(IN osm_ucast_mgr_t * > > > const p_mgr, > > > how best to distribute the LID range across the ports > > > that can reach those LIDs. > > > */ > > > - for (lid_ho = min_lid_ho; lid_ho <= max_lid_ho; lid_ho++) { > > > - /* Use the enhanced algorithm only for LMC > 0 */ > > > - if (lids_per_port > 1) { > > > - port = osm_switch_recommend_path(p_sw, p_port, lid_ho, > > > - p_mgr->p_subn-> > > > - ignore_existing_lfts, > > > - p_mgr->is_dor); > > > - if (port > 0 && port != OSM_NO_PATH && p_port->priv) > > > - p_remote_guid_used = > > > - find_and_add_remote_sys(p_sw, port, > > > - p_port->priv); > > > + port = osm_switch_recommend_path(p_sw, p_port, lid_ho, start_from, > > > + p_mgr->p_subn->ignore_existing_lfts, > > > + p_mgr->is_dor); > > > + > > > + if (port == OSM_NO_PATH) { > > > + /* do not try to overwrite the ppro of non existing port ... */ > > > + is_ignored_by_port_prof = TRUE; > > > + > > > + /* Up/Down routing can cause unreachable routes between some > > > + switches so we do not report that as an error in that case */ > > > + if (!p_routing_eng->build_lid_matrices) { > > > + OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR 3A08: " > > > + "No path to get to LID 0x%X from switch 0x%" > > > + PRIx64 "\n", lid_ho, cl_ntoh64(node_guid)); > > > + /* trigger a new sweep - try again ... */ > > > + p_mgr->p_subn->subnet_initialization_error = TRUE; > > > } else > > > - port = osm_switch_recommend_path(p_sw, p_port, lid_ho, > > > - p_mgr->p_subn-> > > > - ignore_existing_lfts, > > > - p_mgr->is_dor); > > > + OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, > > > + "No path to get to LID 0x%X from switch 0x%" > > > + PRIx64 "\n", lid_ho, cl_ntoh64(node_guid)); > > > + } else { > > > + OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, > > > + "Routing LID 0x%X to port 0x%X" > > > + " for switch 0x%" PRIx64 "\n", > > > + lid_ho, port, cl_ntoh64(node_guid)); > > > > > > /* > > > - There might be no path to the target > > > + we would like to optionally ignore this port in equalization > > > + as in the case of the Mellanox Anafa Internal PCI TCA port > > > */ > > > - if (port == OSM_NO_PATH) { > > > - /* do not try to overwrite the ppro of non existing > > > port ... */ > > > - is_ignored_by_port_prof = TRUE; > > > - > > > - /* Up/Down routing can cause unreachable routes between > > > some > > > - switches so we do not report that as an error in > > > that case */ > > > - if (!p_routing_eng->build_lid_matrices) { > > > - OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR 3A08: > > > " > > > - "No path to get to LID 0x%X from switch > > > 0x%" > > > - PRIx64 "\n", lid_ho, > > > - cl_ntoh64(node_guid)); > > > - /* trigger a new sweep - try again ... */ > > > - p_mgr->p_subn->subnet_initialization_error = > > > - TRUE; > > > - } else > > > - OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, > > > - "No path to get to LID 0x%X from switch > > > 0x%" > > > - PRIx64 "\n", lid_ho, > > > - cl_ntoh64(node_guid)); > > > - } else { > > > - OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, > > > - "Routing LID 0x%X to port 0x%X" > > > - " for switch 0x%" PRIx64 "\n", > > > - lid_ho, port, cl_ntoh64(node_guid)); > > > - > > > - /* > > > - we would like to optionally ignore this port in > > > equalization > > > - as in the case of the Mellanox Anafa Internal PCI > > > TCA port > > > - */ > > > - is_ignored_by_port_prof = > > > - osm_port_prof_is_ignored_port(p_mgr->p_subn, > > > - node_guid, port); > > > - > > > - /* > > > - We also would ignore this route if the target lid is > > > of a switch > > > - and the port_profile_switch_node is not TRUE > > > - */ > > > - if (!p_mgr->p_subn->opt.port_profile_switch_nodes) { > > > - is_ignored_by_port_prof |= > > > - (osm_node_get_type(p_port->p_node) == > > > - IB_NODE_TYPE_SWITCH); > > > - } > > > - } > > > + is_ignored_by_port_prof = > > > + osm_port_prof_is_ignored_port(p_mgr->p_subn, > > > + node_guid, port); > > > > > > /* > > > - We have selected the port for this LID. > > > - Write it to the forwarding tables. > > > + We also would ignore this route if the target lid is of > > > + a switch and the port_profile_switch_node is not TRUE > > > */ > > > - p_mgr->lft_buf[lid_ho] = port; > > > - if (!is_ignored_by_port_prof) { > > > - osm_switch_count_path(p_sw, port); > > > - if (p_remote_guid_used) > > > - p_remote_guid_used->forwarded_to++; > > > - } > > > + if (!p_mgr->p_subn->opt.port_profile_switch_nodes) > > > + is_ignored_by_port_prof |= > > > + (osm_node_get_type(p_port->p_node) == > > > + IB_NODE_TYPE_SWITCH); > > > + } > > > + > > > + /* > > > + We have selected the port for this LID. > > > + Write it to the forwarding tables. > > > + */ > > > + p_mgr->lft_buf[lid_ho] = port; > > > + if (!is_ignored_by_port_prof) { > > > + struct osm_remote_node *rem_node_used; > > > + osm_switch_count_path(p_sw, port); > > > + if (port > 0 && p_port->priv && > > > + (rem_node_used = find_and_add_remote_sys(p_sw, port, > > > + p_port->priv))) > > > + rem_node_used->forwarded_to++; > > > } > > > > > > Exit: > > > @@ -512,6 +496,7 @@ __osm_ucast_mgr_process_tbl(IN cl_map_item_t * const > > > p_map_item, > > > osm_node_t *p_node; > > > osm_port_t *p_port; > > > const cl_qmap_t *p_port_tbl; > > > + unsigned i, lids_per_port; > > > > > > OSM_LOG_ENTER(p_mgr->p_log); > > > > > > @@ -538,12 +523,12 @@ __osm_ucast_mgr_process_tbl(IN cl_map_item_t * > > > const p_map_item, > > > Iterate through every port setting LID routes for each > > > port based on base LID and LMC value. > > > */ > > > - > > > - for (p_port = (osm_port_t *) cl_qmap_head(p_port_tbl); > > > - p_port != (osm_port_t *) cl_qmap_end(p_port_tbl); > > > - p_port = (osm_port_t *) cl_qmap_next(&p_port->map_item)) { > > > - __osm_ucast_mgr_process_port(p_mgr, p_sw, p_port); > > > - } > > > + lids_per_port = 1 << p_mgr->p_subn->opt.lmc; > > > + for (i = 0; i < lids_per_port; i++) > > > + for (p_port = (osm_port_t *) cl_qmap_head(p_port_tbl); > > > + p_port != (osm_port_t *) cl_qmap_end(p_port_tbl); > > > + p_port = (osm_port_t *) cl_qmap_next(&p_port->map_item)) > > > + __osm_ucast_mgr_process_port(p_mgr, p_sw, p_port, i); > > > > > > osm_ucast_mgr_set_fwd_table(p_mgr, p_sw); > > > > -- Albert Chu [EMAIL PROTECTED] 925-422-5311 Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
