Hi Hector, Few more questions. Does this happen to you only when you try to shut down the OpenSM on reboot? What is the host cpu architecture? x86/x86_64/ppc?
> -----Original Message----- > From: [email protected] [mailto:ewg- > [email protected]] On Behalf Of Hal Rosenstock > Sent: Thursday, December 15, 2011 9:06 PM > To: Hector Abrach > Cc: [email protected] > Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem > > Hector, > > On 12/15/2011 12:49 PM, Hector Abrach wrote: > > Hal, > > > > Thank you for the response. To address your questions: > > > >> So the switch stays up and the servers (including the one OpenSM is > >> on) is rebooted, right ? > > > > Right. > > > >> Do the servers run QNX rather than Linux ? Are you saying all OpenSM > >> code is the same as stock OpenSM 3.3.12 (OFED 1.5.4-rc3) ? > > > > Yes, all 7 servers run QNX. The OpenSM code is 99% the same, the only > > changes I had to make were made to some #define libraries. > > The big changes were made for the driver, not so much OpenSM. > > I would think there are also changes for porting of complib to QNX. Do you > use osm_vendor_ibumad.c as the OpenSM vendor layer ? > > > I'm using IBNet 1.3. > > What's IBNet 1.3 ? I'm not familiar with that. > > > OpenSM always runs on the same one server, the others don't run it. > > Understood. > > >> Is the topology the 7 servers and the 1 switch and if you use other > >> switches you don't see this issue ? > > > > That's correct, the topology is 7 servers and 1 switch. We typically > > use less servers (4) for our application but the problem is more > > easily reproducible with more servers so we have a 7 server setup with > > 1 switch. We don't have a great selection of switches but I know our > > previous switch did not cause this problem. Our intention is to go to > > production with this new switch but we can't release until we find an > > acceptable solution. > > > >>Ican see the responses but not the requests. What verbosity level did > >>you use ? > > > > I ran OpenSM with level -D 0x06 (error, info, verbose). I don't want > > to do -D 0xFF because I know this fixes the problem for sure. > > I think -D 0x23 (error, info, frames) would do the trick... > > > ------------------------- > > > > In summary: > > 1. knowing that the system gets stuck for sm_vendor_ibumad.c -> > > umad_receiver() -> "for(;;)" but keeps running properly for function > > main.c -> osm_manager_loop(). > > 2. If I use -D 0xFF the problem is completely fixed > > 3. if I use OSM_DEFAULT_SMP_MAX_ON_WIRE of 1 instead of any other > > value the problem is completely fixed > > 4. The failure always occurs with qp0_mads_outstanding of 1 > > remaining > > what do you think could be wrong? > > Do you think the driver could be the problem? > > Yes; The thing that I think is a likely suspect and may be missing and causing > this issue is the (built in to kernel MAD in Linux) timeout retry code for MAD > transactions which if the timeout/retries are exhaused triggers a send error > (callback). Is that implemented ? > > However, I don't have a good explanation for why you see this now and not > before with your other switches but maybe that's not important. > > > What debug command should I use to see the sent requests? > > See above. > > -- Hal > > > Thank you > > > > Hector Abrach > > > > > > > > > > From: Hal Rosenstock <[email protected]> > > To: Hector Abrach <[email protected]> > > Cc: [email protected] > > Date: 12/14/2011 08:23 PM > > Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem > > > > > > ---------------------------------------------------------------------- > > -- > > > > > > > > Hector, > > > > On 12/14/2011 1:41 PM, Hector Abrach wrote: > >> Hal, > >> > >> Sorry for the multiple emails, but I was thinking how it may be a > >> "freeze /stall" rather than a time out. One reason is that it > >> doesn't send an error message, is as if the log completely dies. > > > > So nothing interesting in the log... > > > >> However, in > >> file osm_vendor_ibumad.c under function umad_receiver there is an > >> infinite loop "for(;;)" which seems to die when I get to that > >> previously discussed vl15_poller. I checked to see if it breaks out > >> of the loop but it doesn't seem to. > > > > It never breaks out of that loop except when OpenSM is shutting down. > > That's the basic receive loop. > > > > -- Hal > > > >> I'm not sure if this may be an additional hint. > >> Thank you > >> > >> Hector Abrach > >> > >> > >> From: Hector Abrach <[email protected]> > >> To: Hal Rosenstock <[email protected]> > >> Cc: [email protected] > >> Date: 12/14/2011 11:15 AM > >> Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem > >> Sent by: [email protected] > >> > >> > >> --------------------------------------------------------------------- > >> --- > >> > >> > >> > >> Hal, > >> > >> Thank you very much for the support, I am the same person from the > >> gmail account so I will respond through here. > >> > >> Attached is a picture of the switch serial number: > >> > >> > >> > >> I am indeed using OFED 1.5.4-rc3. My experiment consists of a 7 > >> server system which I reboot via a script over and over again. > >> Technically speaking the switch is not being powered off or > >> physically rebooted. My server system is what is being rebooted. I am > >> running OpenSM on one of the 7 servers. This means I'm constantly > >> shutting down and rebooting OpenSM. I am running OpenSM on QNX but > we > >> have not had this problem until we decided to upgrade to this switch. > >> > >> The problem is that every 1 out of 15 of this remote reboots OpenSM > >> stalls or times out because stats->qp0_mads_outstanding did not reach > >> zero. Please excuse my ignorance as I'm relatively new at this but > >> how do I verify if it is a timeout problem vs a stall? > >> > >> You also mentioned that you'd like to see the Verbose output of > >> openSM; however, when I run in Verbose mode I don't see the problem. > >> It appears as if the verbose output stalls enough time to give the > >> switch time to do what ever it needs to do and hence not have the > >> problem occur. But this is the last I see when the problem occurs: > >> > >> > >> > >> ------------------------------------------------- > >> OpenSM 3.3.12 > >> Command Line Arguments: > >> Log file max size is 5 MBytes > >> Log File: /tmp/opensm.log > >> ------------------------------------------------- > >> OpenSM 3.3.12 > >> > >> Entering DISCOVERING state > >> > >> Using default GUID 0x2c9020023277d > >> > >> > >> > >> The problem occurs in function osm_vl15intf.c -> vl15_poller in the > >> else statement. > >> > >> if (p_madw != (osm_madw_t *) cl_qlist_end(p_fifo)) { > >> OSM_LOG(p_vl->p_log, OSM_LOG_DEBUG, > >> "Servicing p_madw = %p\n", p_madw); > >> if (osm_log_is_active(p_vl->p_log, OSM_LOG_FRAMES)) > >> osm_dump_dr_smp(p_vl->p_log, > >> osm_madw_get_smp_ptr(p_madw), > >> OSM_LOG_FRAMES); > >> > >> vl15_send_mad(p_vl, p_madw); > >> } else > >> /* > >> The VL15 FIFO is empty, so we have nothing left to do. > >> */ > >> status = cl_event_wait_on(&p_vl->signal, > >> EVENT_NO_TIMEOUT, TRUE); > >> > >> It won't move forward from the cl_event_wait_on in this line of code. > >> However, there are other locations such as > >> wait_for_pending_transactions in the do_sweep function that won't > >> move forward from. But I believe this to be a side effect of the problem > I'm mentioning. > >> > >> When you mention what is my timeout, I'm guessing you refer to > >> max_smps_timeout which is used in the second while loop within > >> vl15_poller? For this setting I am using the default which is defined > >> in osm_subnet.c as: > >> > >> p_opt->transaction_timeout = OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC; > >> p_opt->transaction_retries = OSM_DEFAULT_RETRY_COUNT; > >> p_opt->max_smps_timeout = 1000 * p_opt->transaction_timeout > >> *p_opt->transaction_retries; > >> > >> Would you explain to me what are the advantages or disadvantages of > >> OSM_DEFAULT_SMP_MAX_ON_WIRE? Does this parameter change my > bandwidth > >> performance at all? > >> > >> I noticed that when using the default setting of 4 I get into the > >> else of the above if statement when there are 4 qp0_mads_outstanding. > >> I noticed that if I change OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 I don't > >> get the failure I'm mentioning at all. Partly (I think) because I > >> don't enter the else in the if statement until there is 1 > qp0_mads_outstanding. > >> > >> I hope this explains the problem well enough and it may be a time out > >> problem but I'd like to understand why the problem is occurring. > >> Thank you very much, > >> > >> Hector Abrach > >> > >> From: Hal Rosenstock <[email protected]> > >> To: Hector Abrach <[email protected]> > >> Cc: [email protected] > >> Date: 12/14/2011 08:03 AM > >> Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem > >> > >> > >> > >> --------------------------------------------------------------------- > >> --- > >> > >> > >> > >> Hi, > >> > >> On 12/13/2011 2:35 PM, Hector Abrach wrote: > >>> Hello, > >>> > >>> I have a boot problem with OpenSM > >> > >> Are you saying the switch is booted rather than OpenSM ? > >> > >> What is the OpenSM running on and in what environment ? > >> > >>> the problem occurs seldomly and > >>> started to ocur when we started using a new Mellanox MT1118X03342 > switch. > >>> The problem occurs during the discovery phase within > >> state_mgr_sweep_hop_1. > >>> > >>> However, I discovered that the actual location is because the > >>> qp0_mads_outsanding stalls at 1 occasionally. > >> > >> Is it stuck or after timeout/retry does this get updated properly ? > >> > >>> Within file osm_vl15intf.c in function vl15_poller it checks at the > >>> rfifo and if the qlist still has items it applies function > >>> vl15_send_mad which later on triggers the signal. > >>> With the current default setting of 4 for > >>> OSM_DEFAULT_SMP_MAX_ON_WIRE I noticed that cl_qlist_end reaches > zero > >>> before > >>> stats->qp0_mads_outstanding does. This causes a stall in > >>> cl_event_wait_on. The rfifo always reaches 0 when there are 4 > >>> qp0_mads_outstanding however when it fails it always fails when > >>> there is > >>> 1 qp0_mad_outstanding. > >> > >> Is some (request) SMP that OpenSM sent timing out (not being > >> responded > > to) ? > >> > >>> Have you seen this failure? By the way, I see this failure once > >>> every 15 reboots approximately. > >>> > >>> I discovered that changing OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 > fixes > >>> the problem. > >> > >> What do you mean exactly by fixes the problem ? I'm not sure I > >> understand what the problem is yet. > >> > >> -- Hal > >> > >>> My guess is that there is a race condition when the switch sends 4 > >>> SMPs in parallel. Also, this failure only appears to occur at > >>> reboot. Another solution which is not acceptable is when I add a > >>> delay in the process the failure goes away. This as if the switch > >>> needed more time to do something. > >>> > >>> I would really appreciate your help and insight. > >>> Thank you > >>> > >>> Hector Abrach > >>> > ___________________________________________________________________ > _ > >>> __ This email has been scanned by the Symantec Email Security.cloud > >>> service. > >>> For more information please visit _http://www.symanteccloud.com_ > >> <http://www.symanteccloud.com/> > >>> > ___________________________________________________________________ > _ > >>> __ > >>> > >>> > >>> _______________________________________________ > >>> ewg mailing list > >>> [email protected] > >>> _http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg_ > >> > >> > >> > ___________________________________________________________________ > __ > >> _ This email has been scanned by the Symantec Email Security.cloud > >> service. > >> For more information please visit _http://www.symanteccloud.com_ > >> <http://www.symanteccloud.com/> > >> > ___________________________________________________________________ > __ > >> _ > >> > >> > >> > ___________________________________________________________________ > __ > >> _ This email has been scanned by the Symantec Email Security.cloud > >> service. > >> For more information please visit http://www.symanteccloud.com > > <http://www.symanteccloud.com/> > >> <http://www.symanteccloud.com/> > >> > ___________________________________________________________________ > __ > >> _ > >> > >> > ___________________________________________________________________ > __ > >> _ This email has been scanned by the Symantec Email Security.cloud > >> service. > >> For more information please visit http://www.symanteccloud.com > > <http://www.symanteccloud.com/> > >> <http://www.symanteccloud.com/> > >> > > > ___________________________________________________________________ > ___ > > [attachment > >> "2011-12-13_10-18-25_182.jpg" deleted by Hector > Abrach/Software/TMRU] > >> _______________________________________________ > >> ewg mailing list > >> [email protected] > >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > >> > >> > ___________________________________________________________________ > __ > >> _ This email has been scanned by the Symantec Email Security.cloud > >> service. > >> For more information please visit http://www.symanteccloud.com > > <http://www.symanteccloud.com/> > >> > ___________________________________________________________________ > __ > >> _ > > > > > > > ___________________________________________________________________ > ___ > > This email has been scanned by the Symantec Email Security.cloud service. > > For more information please visit http://www.symanteccloud.com > > <http://www.symanteccloud.com/> > > > ___________________________________________________________________ > ___ > > > > > > > ___________________________________________________________________ > ___ > > This email has been scanned by the Symantec Email Security.cloud service. > > For more information please visit http://www.symanteccloud.com > > > ___________________________________________________________________ > ___ > > _______________________________________________ > ewg mailing list > [email protected] > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg _______________________________________________ ewg mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
