I mean that if you still get the traps in high rate (After verifying the OPVL) than you should consider disabling the link
-----Original Message----- From: Hal Rosenstock [mailto:[EMAIL PROTECTED] Sent: Monday, July 09, 2007 7:58 PM To: Amit Krig Cc: [email protected]; Eitan Zahavi; Yevgeny Kliteynik Subject: RE: [PATCH] OpenSM handling of "Babbling" Ports Hi Amit, On Mon, 2007-07-09 at 12:40, Amit Krig wrote: > Hi Hal > > I was only talking on logical link == Active state. > In this state the watchdog can bring the physical link to recovery > state while the logical link will bounce between Active and ActiveDefer. OK; I follow this but I'm not sure what you are saying about "applying" it to the patch in question. > Regarding the responsive issue, OpenSM in this scenario should move > the logical link in the responsive side to Init state rather than disabling it on some threshold. What about the other similar traps 129 and 130 ? How should they be handled ? > that way the watchdog will stop bringing down the link and then do the > checks I think the checks will still fail but this seems like it would stop the traps from being generated (so fast). -- Hal > Amit > > -----Original Message----- > From: Hal Rosenstock [mailto:[EMAIL PROTECTED] > Sent: Monday, July 09, 2007 4:43 PM > To: Amit Krig > Cc: [email protected]; Eitan Zahavi; Yevgeny Kliteynik > Subject: RE: [PATCH] OpenSM handling of "Babbling" Ports > > Hi Amit, > > On Mon, 2007-07-09 at 09:27, Amit Krig wrote: > > Hi Hal, > > > > In such case OpenSM should first check that the OPVL fields of the > > ports (the one that sends the traps and its peer) are identical, If > > you have a mismatch in the OPVL field, the link watchdog mechanism > > will retrain the logical link in high rate > > OpVLs only takes "effect" if set after link active only if the link is > bounced (not if it stays active). > > Also and more significantly, in terms of the specific issue, the peer > SMA is often non responsive or shortly becomes non responsive so the > peer OpVLs cannot readily be verified post this being detected. > > -- Hal > > > Amit > > > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:[EMAIL PROTECTED] > > Sent: Thursday, July 05, 2007 3:58 PM > > To: [email protected] > > Cc: Eitan Zahavi; Yevgeny Kliteynik > > Subject: [PATCH] OpenSM handling of "Babbling" Ports > > > > A "babbling" port is a port which causes traps to be generated > > frequently. > > It may directly be "this" port which generates the traps or the peer > > port detecting the issue and that the SMA on switch port 0 generates > > the traps. > > This has only currently been observed for trap 131 but will also > > apply > > > for traps 129 and 130 as well which are other urgent and similar > traps. > > > > Note that there appears to be a bug in Mellanox firmware for both > > Anafa-2 and Tavor at a minimum which causes the max trap rate not to > > be adhered to and relief for this does not appear to be in short > > term sight. > > > > Policy > > When a bablbing port is detected, OpenSM will disable the port or > > its peer switch port (depending on which trap) which should > > terminate the trap storm. > > > > Detection > > 250 consecutive traps of this type will be used as the (initial) > > threshold. The reason for this is so as to not prematurely detect > > this > > > and disable a port. > > > > Recovery > > Admin would reenable port when OK again. (This usually involves > > rebooting the node causing the trap to be indicated.) > > > > Signed-off-by: Hal Rosenstock <[EMAIL PROTECTED]> > > > > diff --git a/opensm/include/opensm/osm_subnet.h > > b/opensm/include/opensm/osm_subnet.h > > index bedd63f..1150703 100644 > > --- a/opensm/include/opensm/osm_subnet.h > > +++ b/opensm/include/opensm/osm_subnet.h > > @@ -286,6 +286,7 @@ typedef struct _osm_subn_opt > > boolean_t honor_guid2lid_file; > > boolean_t daemon; > > boolean_t sm_inactive; > > + boolean_t babbling_port_policy; > > osm_qos_options_t qos_options; > > osm_qos_options_t qos_ca_options; > > osm_qos_options_t qos_sw0_options; > > @@ -487,6 +488,9 @@ typedef struct _osm_subn_opt > > * > > * sm_inactive > > * OpenSM will start with SM in not active state. > > +* > > +* babbling_port_policy > > +* OpenSM will enforce its "babbling" port policy. > > * > > * perfmgr > > * Enable or disable the performance manager > > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c > > index 726b665..87b71e5 100644 > > --- a/opensm/opensm/osm_subnet.c > > +++ b/opensm/opensm/osm_subnet.c > > @@ -472,6 +472,7 @@ osm_subn_set_default_opt( > > p_opt->honor_guid2lid_file = FALSE; > > p_opt->daemon = FALSE; > > p_opt->sm_inactive = FALSE; > > + p_opt->babbling_port_policy = FALSE; > > #ifdef ENABLE_OSM_PERF_MGR > > p_opt->perfmgr = FALSE; > > p_opt->perfmgr_sweep_time_s = OSM_PERFMGR_DEFAULT_SWEEP_TIME_S; > > @@ > > -1358,6 +1359,10 @@ osm_subn_parse_conf_file( > > "sm_inactive", > > p_key, p_val, &p_opts->sm_inactive); > > > > + __osm_subn_opts_unpack_boolean( > > + "babbling_port_policy", > > + p_key, p_val, &p_opts->babbling_port_policy); > > + > > #ifdef ENABLE_OSM_PERF_MGR > > __osm_subn_opts_unpack_boolean( > > "perfmgr", > > @@ -1631,9 +1636,12 @@ osm_subn_write_conf_file( > > "# Daemon mode\n" > > "daemon %s\n\n" > > "# SM Inactive\n" > > - "sm_inactive %s\n\n", > > + "sm_inactive %s\n\n" > > + "# Babbling Port Policy\n" > > + "babbling_port_policy %s\n\n", > > p_opts->daemon ? "TRUE" : "FALSE", > > - p_opts->sm_inactive ? "TRUE" : "FALSE" > > + p_opts->sm_inactive ? "TRUE" : "FALSE", > > + p_opts->babbling_port_policy ? "TRUE" : "FALSE" > > ); > > > > #ifdef ENABLE_OSM_PERF_MGR > > diff --git a/opensm/opensm/osm_trap_rcv.c > > b/opensm/opensm/osm_trap_rcv.c index 5900c51..fbb6dac 100644 > > --- a/opensm/opensm/osm_trap_rcv.c > > +++ b/opensm/opensm/osm_trap_rcv.c > > @@ -1,5 +1,5 @@ > > /* > > - * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. > > + * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. > > * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights > > reserved. > > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > > * > > @@ -548,6 +548,61 @@ __osm_trap_rcv_process_request( > > } > > else > > { > > + /* When babbling port policy option is enabled and > > + Threshold for disabling a "babbling" port is exceeded */ > > + if ( p_rcv->p_subn->opt.babbling_port_policy && > > + num_received >= 250 ) > > + { > > + uint8_t payload[IB_SMP_DATA_SIZE]; > > + ib_port_info_t* p_pi = (ib_port_info_t*)payload; > > + const ib_port_info_t* p_old_pi; > > + osm_madw_context_t context; > > + > > + /* If trap 131, might want to disable peer port if > > available */ > > + /* but peer port has been observed not to respond to SM > > + requests */ > > + > > + osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > + "__osm_trap_rcv_process_request: ERR 3810: " > > + " Disabling physical port lid:0x%02X num:%u\n", > > + cl_ntoh16(p_ntci->data_details.ntc_129_131.lid), > > + p_ntci->data_details.ntc_129_131.port_num > > + ); > > + > > + p_old_pi = &p_physp->port_info; > > + memcpy( payload, p_old_pi, sizeof(ib_port_info_t) ); > > + > > + /* Set port to disabled/down */ > > + ib_port_info_set_port_state( p_pi, IB_LINK_DOWN ); > > + ib_port_info_set_port_phys_state( > > + IB_PORT_PHYS_STATE_DISABLED, p_pi ); > > + > > + context.pi_context.node_guid = osm_node_get_node_guid( > > osm_physp_get_node_ptr( p_physp ) ); > > + context.pi_context.port_guid = osm_physp_get_port_guid( > > p_physp ); > > + context.pi_context.set_method = TRUE; > > + context.pi_context.update_master_sm_base_lid = FALSE; > > + context.pi_context.light_sweep = FALSE; > > + context.pi_context.active_transition = FALSE; > > + > > + status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req, > > + osm_physp_get_dr_path_ptr( > > + p_physp > > ), > > + payload, > > + sizeof(payload), > > + IB_MAD_ATTR_PORT_INFO, > > + > > + cl_hton32(osm_physp_get_port_num( > > p_physp )), > > + CL_DISP_MSGID_NONE, > > + &context ); > > + > > + if( status == IB_SUCCESS ) > > + { > > + goto Exit; > > + } > > + else > > + { > > + osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > + "__osm_trap_rcv_process_request: ERR 3811: " > > + "Request to set PortInfo failed\n" ); > > + } > > + } > > + > > osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, > > "__osm_trap_rcv_process_request: " > > "Marking unhealthy physical port by lid:0x%02X > > num:%u\n", > > > > > > > > > _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
