Hector, On 12/15/2011 12:49 PM, Hector Abrach wrote: > Hal, > > Thank you for the response. To address your questions: > >> So the switch stays up and the servers (including the one OpenSM is on) >> is rebooted, right ? > > Right. > >> Do the servers run QNX rather than Linux ? Are you saying all OpenSM >> code is the same as stock OpenSM 3.3.12 (OFED 1.5.4-rc3) ? > > Yes, all 7 servers run QNX. The OpenSM code is 99% the same, the only > changes I had to make were made to some #define libraries. > The big changes were made for the driver, not so much OpenSM.
I would think there are also changes for porting of complib to QNX. Do you use osm_vendor_ibumad.c as the OpenSM vendor layer ? > I'm using IBNet 1.3. What's IBNet 1.3 ? I'm not familiar with that. > OpenSM always runs on the same one server, the others don't > run it. Understood. >> Is the topology the 7 servers and the 1 switch and if you use other >> switches you don't see this issue ? > > That's correct, the topology is 7 servers and 1 switch. We typically use > less servers (4) for our application but the problem is more easily > reproducible with more servers so we have a 7 server setup with 1 > switch. We don't have a great selection of switches but I know our > previous switch did not cause this problem. Our intention is to go to > production with this new switch but we can't release until we find an > acceptable solution. > >>Ican see the responses but not the requests. What verbosity level did >> you use ? > > I ran OpenSM with level -D 0x06 (error, info, verbose). I don't want to > do -D 0xFF because I know this fixes the problem for sure. I think -D 0x23 (error, info, frames) would do the trick... > ------------------------- > > In summary: > 1. knowing that the system gets stuck for sm_vendor_ibumad.c -> > umad_receiver() -> "for(;;)" but keeps running properly for function > main.c -> osm_manager_loop(). > 2. If I use -D 0xFF the problem is completely fixed > 3. if I use OSM_DEFAULT_SMP_MAX_ON_WIRE of 1 instead of any other > value the problem is completely fixed > 4. The failure always occurs with qp0_mads_outstanding of 1 > remaining > what do you think could be wrong? > Do you think the driver could be the problem? Yes; The thing that I think is a likely suspect and may be missing and causing this issue is the (built in to kernel MAD in Linux) timeout retry code for MAD transactions which if the timeout/retries are exhaused triggers a send error (callback). Is that implemented ? However, I don't have a good explanation for why you see this now and not before with your other switches but maybe that's not important. > What debug command should I use to see the sent requests? See above. -- Hal > Thank you > > Hector Abrach > > > > > From: Hal Rosenstock <[email protected]> > To: Hector Abrach <[email protected]> > Cc: [email protected] > Date: 12/14/2011 08:23 PM > Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem > > > ------------------------------------------------------------------------ > > > > Hector, > > On 12/14/2011 1:41 PM, Hector Abrach wrote: >> Hal, >> >> Sorry for the multiple emails, but I was thinking how it may be a >> "freeze /stall" rather than a time out. One reason is that it doesn't >> send an error message, is as if the log completely dies. > > So nothing interesting in the log... > >> However, in >> file osm_vendor_ibumad.c under function umad_receiver there is an >> infinite loop "for(;;)" which seems to die when I get to that previously >> discussed vl15_poller. I checked to see if it breaks out of the loop but >> it doesn't seem to. > > It never breaks out of that loop except when OpenSM is shutting down. > That's the basic receive loop. > > -- Hal > >> I'm not sure if this may be an additional hint. >> Thank you >> >> Hector Abrach >> >> >> From: Hector Abrach <[email protected]> >> To: Hal Rosenstock <[email protected]> >> Cc: [email protected] >> Date: 12/14/2011 11:15 AM >> Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem >> Sent by: [email protected] >> >> >> ------------------------------------------------------------------------ >> >> >> >> Hal, >> >> Thank you very much for the support, I am the same person from the gmail >> account so I will respond through here. >> >> Attached is a picture of the switch serial number: >> >> >> >> I am indeed using OFED 1.5.4-rc3. My experiment consists of a 7 server >> system which I reboot via a script over and over again. Technically >> speaking the switch is not being powered off or physically rebooted. My >> server system is what is being rebooted. I am running OpenSM on one of >> the 7 servers. This means I'm constantly shutting down and rebooting >> OpenSM. I am running OpenSM on QNX but we have not had this problem >> until we decided to upgrade to this switch. >> >> The problem is that every 1 out of 15 of this remote reboots OpenSM >> stalls or times out because stats->qp0_mads_outstanding did not reach >> zero. Please excuse my ignorance as I'm relatively new at this but how >> do I verify if it is a timeout problem vs a stall? >> >> You also mentioned that you'd like to see the Verbose output of openSM; >> however, when I run in Verbose mode I don't see the problem. It appears >> as if the verbose output stalls enough time to give the switch time to >> do what ever it needs to do and hence not have the problem occur. But >> this is the last I see when the problem occurs: >> >> >> >> ------------------------------------------------- >> OpenSM 3.3.12 >> Command Line Arguments: >> Log file max size is 5 MBytes >> Log File: /tmp/opensm.log >> ------------------------------------------------- >> OpenSM 3.3.12 >> >> Entering DISCOVERING state >> >> Using default GUID 0x2c9020023277d >> >> >> >> The problem occurs in function osm_vl15intf.c -> vl15_poller in the else >> statement. >> >> if (p_madw != (osm_madw_t *) cl_qlist_end(p_fifo)) { >> OSM_LOG(p_vl->p_log, OSM_LOG_DEBUG, >> "Servicing p_madw = %p\n", p_madw); >> if (osm_log_is_active(p_vl->p_log, OSM_LOG_FRAMES)) >> osm_dump_dr_smp(p_vl->p_log, >> osm_madw_get_smp_ptr(p_madw), >> OSM_LOG_FRAMES); >> >> vl15_send_mad(p_vl, p_madw); >> } else >> /* >> The VL15 FIFO is empty, so we have nothing left to do. >> */ >> status = cl_event_wait_on(&p_vl->signal, >> EVENT_NO_TIMEOUT, TRUE); >> >> It won't move forward from the cl_event_wait_on in this line of code. >> However, there are other locations such as wait_for_pending_transactions >> in the do_sweep function that won't move forward from. But I believe >> this to be a side effect of the problem I'm mentioning. >> >> When you mention what is my timeout, I'm guessing you refer to >> max_smps_timeout which is used in the second while loop within >> vl15_poller? For this setting I am using the default which is defined in >> osm_subnet.c as: >> >> p_opt->transaction_timeout = OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC; >> p_opt->transaction_retries = OSM_DEFAULT_RETRY_COUNT; >> p_opt->max_smps_timeout = 1000 * p_opt->transaction_timeout >> *p_opt->transaction_retries; >> >> Would you explain to me what are the advantages or disadvantages of >> OSM_DEFAULT_SMP_MAX_ON_WIRE? Does this parameter change my bandwidth >> performance at all? >> >> I noticed that when using the default setting of 4 I get into the else >> of the above if statement when there are 4 qp0_mads_outstanding. I >> noticed that if I change OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 I don't get >> the failure I'm mentioning at all. Partly (I think) because I don't >> enter the else in the if statement until there is 1 qp0_mads_outstanding. >> >> I hope this explains the problem well enough and it may be a time out >> problem but I'd like to understand why the problem is occurring. >> Thank you very much, >> >> Hector Abrach >> >> From: Hal Rosenstock <[email protected]> >> To: Hector Abrach <[email protected]> >> Cc: [email protected] >> Date: 12/14/2011 08:03 AM >> Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem >> >> >> >> ------------------------------------------------------------------------ >> >> >> >> Hi, >> >> On 12/13/2011 2:35 PM, Hector Abrach wrote: >>> Hello, >>> >>> I have a boot problem with OpenSM >> >> Are you saying the switch is booted rather than OpenSM ? >> >> What is the OpenSM running on and in what environment ? >> >>> the problem occurs seldomly and >>> started to ocur when we started using a new Mellanox MT1118X03342 switch. >>> The problem occurs during the discovery phase within >> state_mgr_sweep_hop_1. >>> >>> However, I discovered that the actual location is because the >>> qp0_mads_outsanding stalls at 1 occasionally. >> >> Is it stuck or after timeout/retry does this get updated properly ? >> >>> Within file osm_vl15intf.c in function vl15_poller it checks at the >>> rfifo and if the qlist still has items it applies function vl15_send_mad >>> which later on triggers the signal. >>> With the current default setting of 4 for OSM_DEFAULT_SMP_MAX_ON_WIRE I >>> noticed that cl_qlist_end reaches zero before >>> stats->qp0_mads_outstanding does. This causes a stall in >>> cl_event_wait_on. The rfifo always reaches 0 when there are 4 >>> qp0_mads_outstanding however when it fails it always fails when there is >>> 1 qp0_mad_outstanding. >> >> Is some (request) SMP that OpenSM sent timing out (not being responded > to) ? >> >>> Have you seen this failure? By the way, I see this failure once every 15 >>> reboots approximately. >>> >>> I discovered that changing OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 fixes the >>> problem. >> >> What do you mean exactly by fixes the problem ? I'm not sure I >> understand what the problem is yet. >> >> -- Hal >> >>> My guess is that there is a race condition when the switch sends 4 SMPs >>> in parallel. Also, this failure only appears to occur at reboot. Another >>> solution which is not acceptable is when I add a delay in the process >>> the failure goes away. This as if the switch needed more time to do >>> something. >>> >>> I would really appreciate your help and insight. >>> Thank you >>> >>> Hector Abrach >>> ______________________________________________________________________ >>> This email has been scanned by the Symantec Email Security.cloud service. >>> For more information please visit _http://www.symanteccloud.com_ >> <http://www.symanteccloud.com/> >>> ______________________________________________________________________ >>> >>> >>> _______________________________________________ >>> ewg mailing list >>> [email protected] >>> _http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg_ >> >> >> ______________________________________________________________________ >> This email has been scanned by the Symantec Email Security.cloud service. >> For more information please visit _http://www.symanteccloud.com_ >> <http://www.symanteccloud.com/> >> ______________________________________________________________________ >> >> >> ______________________________________________________________________ >> This email has been scanned by the Symantec Email Security.cloud service. >> For more information please visit http://www.symanteccloud.com > <http://www.symanteccloud.com/> >> <http://www.symanteccloud.com/> >> ______________________________________________________________________ >> >> ______________________________________________________________________ >> This email has been scanned by the Symantec Email Security.cloud service. >> For more information please visit http://www.symanteccloud.com > <http://www.symanteccloud.com/> >> <http://www.symanteccloud.com/> >> > ______________________________________________________________________[attachment >> "2011-12-13_10-18-25_182.jpg" deleted by Hector Abrach/Software/TMRU] >> _______________________________________________ >> ewg mailing list >> [email protected] >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg >> >> ______________________________________________________________________ >> This email has been scanned by the Symantec Email Security.cloud service. >> For more information please visit http://www.symanteccloud.com > <http://www.symanteccloud.com/> >> ______________________________________________________________________ > > > ______________________________________________________________________ > This email has been scanned by the Symantec Email Security.cloud service. > For more information please visit http://www.symanteccloud.com > <http://www.symanteccloud.com/> > ______________________________________________________________________ > > > ______________________________________________________________________ > This email has been scanned by the Symantec Email Security.cloud service. > For more information please visit http://www.symanteccloud.com > ______________________________________________________________________ _______________________________________________ ewg mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
