Hector, On 12/14/2011 1:41 PM, Hector Abrach wrote: > Hal, > > Sorry for the multiple emails, but I was thinking how it may be a > "freeze /stall" rather than a time out. One reason is that it doesn't > send an error message, is as if the log completely dies.
So nothing interesting in the log... > However, in > file osm_vendor_ibumad.c under function umad_receiver there is an > infinite loop "for(;;)" which seems to die when I get to that previously > discussed vl15_poller. I checked to see if it breaks out of the loop but > it doesn't seem to. It never breaks out of that loop except when OpenSM is shutting down. That's the basic receive loop. -- Hal > I'm not sure if this may be an additional hint. > Thank you > > Hector Abrach > > > From: Hector Abrach <[email protected]> > To: Hal Rosenstock <[email protected]> > Cc: [email protected] > Date: 12/14/2011 11:15 AM > Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem > Sent by: [email protected] > > > ------------------------------------------------------------------------ > > > > Hal, > > Thank you very much for the support, I am the same person from the gmail > account so I will respond through here. > > Attached is a picture of the switch serial number: > > > > I am indeed using OFED 1.5.4-rc3. My experiment consists of a 7 server > system which I reboot via a script over and over again. Technically > speaking the switch is not being powered off or physically rebooted. My > server system is what is being rebooted. I am running OpenSM on one of > the 7 servers. This means I'm constantly shutting down and rebooting > OpenSM. I am running OpenSM on QNX but we have not had this problem > until we decided to upgrade to this switch. > > The problem is that every 1 out of 15 of this remote reboots OpenSM > stalls or times out because stats->qp0_mads_outstanding did not reach > zero. Please excuse my ignorance as I'm relatively new at this but how > do I verify if it is a timeout problem vs a stall? > > You also mentioned that you'd like to see the Verbose output of openSM; > however, when I run in Verbose mode I don't see the problem. It appears > as if the verbose output stalls enough time to give the switch time to > do what ever it needs to do and hence not have the problem occur. But > this is the last I see when the problem occurs: > > > > ------------------------------------------------- > OpenSM 3.3.12 > Command Line Arguments: > Log file max size is 5 MBytes > Log File: /tmp/opensm.log > ------------------------------------------------- > OpenSM 3.3.12 > > Entering DISCOVERING state > > Using default GUID 0x2c9020023277d > > > > The problem occurs in function osm_vl15intf.c -> vl15_poller in the else > statement. > > if (p_madw != (osm_madw_t *) cl_qlist_end(p_fifo)) { > OSM_LOG(p_vl->p_log, OSM_LOG_DEBUG, > "Servicing p_madw = %p\n", p_madw); > if (osm_log_is_active(p_vl->p_log, OSM_LOG_FRAMES)) > osm_dump_dr_smp(p_vl->p_log, > osm_madw_get_smp_ptr(p_madw), > OSM_LOG_FRAMES); > > vl15_send_mad(p_vl, p_madw); > } else > /* > The VL15 FIFO is empty, so we have nothing left to do. > */ > status = cl_event_wait_on(&p_vl->signal, > EVENT_NO_TIMEOUT, TRUE); > > It won't move forward from the cl_event_wait_on in this line of code. > However, there are other locations such as wait_for_pending_transactions > in the do_sweep function that won't move forward from. But I believe > this to be a side effect of the problem I'm mentioning. > > When you mention what is my timeout, I'm guessing you refer to > max_smps_timeout which is used in the second while loop within > vl15_poller? For this setting I am using the default which is defined in > osm_subnet.c as: > > p_opt->transaction_timeout = OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC; > p_opt->transaction_retries = OSM_DEFAULT_RETRY_COUNT; > p_opt->max_smps_timeout = 1000 * p_opt->transaction_timeout > *p_opt->transaction_retries; > > Would you explain to me what are the advantages or disadvantages of > OSM_DEFAULT_SMP_MAX_ON_WIRE? Does this parameter change my bandwidth > performance at all? > > I noticed that when using the default setting of 4 I get into the else > of the above if statement when there are 4 qp0_mads_outstanding. I > noticed that if I change OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 I don't get > the failure I'm mentioning at all. Partly (I think) because I don't > enter the else in the if statement until there is 1 qp0_mads_outstanding. > > I hope this explains the problem well enough and it may be a time out > problem but I'd like to understand why the problem is occurring. > Thank you very much, > > Hector Abrach > > From: Hal Rosenstock <[email protected]> > To: Hector Abrach <[email protected]> > Cc: [email protected] > Date: 12/14/2011 08:03 AM > Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem > > > > ------------------------------------------------------------------------ > > > > Hi, > > On 12/13/2011 2:35 PM, Hector Abrach wrote: >> Hello, >> >> I have a boot problem with OpenSM > > Are you saying the switch is booted rather than OpenSM ? > > What is the OpenSM running on and in what environment ? > >> the problem occurs seldomly and >> started to ocur when we started using a new Mellanox MT1118X03342 switch. >> The problem occurs during the discovery phase within > state_mgr_sweep_hop_1. >> >> However, I discovered that the actual location is because the >> qp0_mads_outsanding stalls at 1 occasionally. > > Is it stuck or after timeout/retry does this get updated properly ? > >> Within file osm_vl15intf.c in function vl15_poller it checks at the >> rfifo and if the qlist still has items it applies function vl15_send_mad >> which later on triggers the signal. >> With the current default setting of 4 for OSM_DEFAULT_SMP_MAX_ON_WIRE I >> noticed that cl_qlist_end reaches zero before >> stats->qp0_mads_outstanding does. This causes a stall in >> cl_event_wait_on. The rfifo always reaches 0 when there are 4 >> qp0_mads_outstanding however when it fails it always fails when there is >> 1 qp0_mad_outstanding. > > Is some (request) SMP that OpenSM sent timing out (not being responded to) ? > >> Have you seen this failure? By the way, I see this failure once every 15 >> reboots approximately. >> >> I discovered that changing OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 fixes the >> problem. > > What do you mean exactly by fixes the problem ? I'm not sure I > understand what the problem is yet. > > -- Hal > >> My guess is that there is a race condition when the switch sends 4 SMPs >> in parallel. Also, this failure only appears to occur at reboot. Another >> solution which is not acceptable is when I add a delay in the process >> the failure goes away. This as if the switch needed more time to do >> something. >> >> I would really appreciate your help and insight. >> Thank you >> >> Hector Abrach >> ______________________________________________________________________ >> This email has been scanned by the Symantec Email Security.cloud service. >> For more information please visit _http://www.symanteccloud.com_ > <http://www.symanteccloud.com/> >> ______________________________________________________________________ >> >> >> _______________________________________________ >> ewg mailing list >> [email protected] >> _http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg_ > > > ______________________________________________________________________ > This email has been scanned by the Symantec Email Security.cloud service. > For more information please visit _http://www.symanteccloud.com_ > <http://www.symanteccloud.com/> > ______________________________________________________________________ > > > ______________________________________________________________________ > This email has been scanned by the Symantec Email Security.cloud service. > For more information please visit http://www.symanteccloud.com > <http://www.symanteccloud.com/> > ______________________________________________________________________ > > ______________________________________________________________________ > This email has been scanned by the Symantec Email Security.cloud service. > For more information please visit http://www.symanteccloud.com > <http://www.symanteccloud.com/> > ______________________________________________________________________[attachment > "2011-12-13_10-18-25_182.jpg" deleted by Hector Abrach/Software/TMRU] > _______________________________________________ > ewg mailing list > [email protected] > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > > ______________________________________________________________________ > This email has been scanned by the Symantec Email Security.cloud service. > For more information please visit http://www.symanteccloud.com > ______________________________________________________________________ _______________________________________________ ewg mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
