Re: [ewg] OpenSM 1.5.4 Boot Problem

Alex Netes Fri, 16 Dec 2011 01:16:11 -0800

Hi Hector,

Few more questions.
Does this happen to you only when you try to shut down the OpenSM on reboot?
What is the host cpu architecture? x86/x86_64/ppc?



> -----Original Message-----
> From: [email protected] [mailto:ewg-
> [email protected]] On Behalf Of Hal Rosenstock
> Sent: Thursday, December 15, 2011 9:06 PM
> To: Hector Abrach
> Cc: [email protected]
> Subject: Re: [ewg] OpenSM 1.5.4 Boot Problem
> 
> Hector,
> 
> On 12/15/2011 12:49 PM, Hector Abrach wrote:
> > Hal,
> >
> > Thank you for the response. To address your questions:
> >
> >> So the switch stays up and the servers (including the one OpenSM is
> >> on) is rebooted, right ?
> >
> > Right.
> >
> >> Do the servers run QNX rather than Linux ? Are you saying all OpenSM
> >> code is the same as stock OpenSM 3.3.12 (OFED 1.5.4-rc3) ?
> >
> > Yes, all 7 servers run QNX. The OpenSM code is 99% the same, the only
> > changes I had to make were made to some #define libraries.
> > The big changes were made for the driver, not so much OpenSM.
> 
> I would think there are also changes for porting of complib to QNX. Do you
> use osm_vendor_ibumad.c as the OpenSM vendor layer ?
> 
> > I'm using IBNet 1.3.
> 
> What's IBNet 1.3 ? I'm not familiar with that.
> 
> > OpenSM always runs on the same one server, the others don't run it.
> 
> Understood.
> 
> >> Is the topology the 7 servers and the 1 switch and if you use other
> >> switches you don't see this issue ?
> >
> > That's correct, the topology is 7 servers and 1 switch. We typically
> > use less servers (4) for our application but the problem is more
> > easily reproducible with more servers so we have a 7 server setup with
> > 1 switch. We don't have a great selection of switches but I know our
> > previous switch did not cause this problem. Our intention is to go to
> > production with this new switch but we can't release until we find an
> > acceptable solution.
> >
> >>Ican see the responses but not the requests. What verbosity level did
> >>you use ?
> >
> > I ran OpenSM with level -D 0x06 (error, info, verbose). I don't want
> > to do -D 0xFF because I know this fixes the problem for sure.
> 
> I think -D 0x23 (error, info, frames) would do the trick...
> 
> > -------------------------
> >
> > In summary:
> > 1.        knowing that the system gets stuck for sm_vendor_ibumad.c ->
> > umad_receiver() -> "for(;;)" but keeps running properly for function
> > main.c -> osm_manager_loop().
> > 2.        If I use -D 0xFF the problem is completely fixed
> > 3.        if I use OSM_DEFAULT_SMP_MAX_ON_WIRE of 1 instead of any other
> > value the problem is completely fixed
> > 4.        The failure always occurs with qp0_mads_outstanding of 1
> > remaining
> > what do you think could be wrong?
> > Do you think the driver could be the problem?
> 
> Yes; The thing that I think is a likely suspect and may be missing and causing
> this issue is the (built in to kernel MAD in Linux) timeout retry code for MAD
> transactions which if the timeout/retries are exhaused triggers a send error
> (callback). Is that implemented ?
> 
> However, I don't have a good explanation for why you see this now and not
> before with your other switches but maybe that's not important.
> 
> > What debug command should I use to see the sent requests?
> 
> See above.
> 
> -- Hal
> 
> > Thank you
> >
> > Hector Abrach
> >
> >
> >
> >
> > From:       Hal Rosenstock <[email protected]>
> > To:         Hector Abrach <[email protected]>
> > Cc:         [email protected]
> > Date:       12/14/2011 08:23 PM
> > Subject:    Re: [ewg] OpenSM 1.5.4 Boot Problem
> >
> >
> > ----------------------------------------------------------------------
> > --
> >
> >
> >
> > Hector,
> >
> > On 12/14/2011 1:41 PM, Hector Abrach wrote:
> >> Hal,
> >>
> >> Sorry for the multiple emails, but I was thinking how it may be a
> >> "freeze /stall" rather than a time out.  One reason is that it
> >> doesn't send an error message, is as if the log completely dies.
> >
> > So nothing interesting in the log...
> >
> >> However, in
> >> file osm_vendor_ibumad.c under function umad_receiver there is an
> >> infinite loop "for(;;)" which seems to die when I get to that
> >> previously discussed vl15_poller. I checked to see if it breaks out
> >> of the loop but it doesn't seem to.
> >
> > It never breaks out of that loop except when OpenSM is shutting down.
> > That's the basic receive loop.
> >
> > -- Hal
> >
> >> I'm not sure if this may be an additional hint.
> >> Thank you
> >>
> >> Hector Abrach
> >>
> >>
> >> From:                  Hector Abrach <[email protected]>
> >> To:                  Hal Rosenstock <[email protected]>
> >> Cc:                  [email protected]
> >> Date:                  12/14/2011 11:15 AM
> >> Subject:                  Re: [ewg] OpenSM 1.5.4 Boot Problem
> >> Sent by:                  [email protected]
> >>
> >>
> >> ---------------------------------------------------------------------
> >> ---
> >>
> >>
> >>
> >> Hal,
> >>
> >> Thank you very much for the support, I am the same person from the
> >> gmail account so I will respond through here.
> >>
> >> Attached is a picture of the switch serial number:
> >>
> >>
> >>
> >> I am indeed using OFED 1.5.4-rc3. My experiment consists of a 7
> >> server system which I reboot via a script over and over again.
> >> Technically speaking the switch is not being powered off or
> >> physically rebooted. My server system is what is being rebooted. I am
> >> running OpenSM on one of the 7 servers. This means I'm constantly
> >> shutting down and rebooting OpenSM. I am running OpenSM on QNX but
> we
> >> have not had this problem until we decided to upgrade to this switch.
> >>
> >> The problem is that every 1 out of 15 of this remote reboots OpenSM
> >> stalls or times out because stats->qp0_mads_outstanding did not reach
> >> zero. Please excuse my ignorance as I'm relatively new at this but
> >> how do I verify if it is a timeout problem vs a stall?
> >>
> >> You also mentioned that you'd like to see the Verbose output of
> >> openSM; however, when I run in Verbose mode I don't see the problem.
> >> It appears as if the verbose output stalls enough time to give the
> >> switch time to do what ever it needs to do and hence not have the
> >> problem occur. But this is the last I see when the problem occurs:
> >>
> >>
> >>
> >> -------------------------------------------------
> >> OpenSM 3.3.12
> >> Command Line Arguments:
> >> Log file max size is 5 MBytes
> >> Log File: /tmp/opensm.log
> >> -------------------------------------------------
> >> OpenSM 3.3.12
> >>
> >> Entering DISCOVERING state
> >>
> >> Using default GUID 0x2c9020023277d
> >>
> >>
> >>
> >> The problem occurs in function osm_vl15intf.c -> vl15_poller in the
> >> else statement.
> >>
> >> if (p_madw != (osm_madw_t *) cl_qlist_end(p_fifo)) {
> >>        OSM_LOG(p_vl->p_log, OSM_LOG_DEBUG,
> >>        "Servicing p_madw = %p\n", p_madw);
> >>        if (osm_log_is_active(p_vl->p_log, OSM_LOG_FRAMES))
> >>        osm_dump_dr_smp(p_vl->p_log,
> >>        osm_madw_get_smp_ptr(p_madw),
> >>        OSM_LOG_FRAMES);
> >>
> >>        vl15_send_mad(p_vl, p_madw);
> >> } else
> >>        /*
> >>           The VL15 FIFO is empty, so we have nothing left to do.
> >>         */
> >>        status = cl_event_wait_on(&p_vl->signal,
> >>                  EVENT_NO_TIMEOUT, TRUE);
> >>
> >> It won't move forward from the cl_event_wait_on in this line of code.
> >> However, there are other locations such as
> >> wait_for_pending_transactions in the do_sweep function that won't
> >> move forward from. But I believe this to be a side effect of the problem
> I'm mentioning.
> >>
> >> When you mention what is my timeout, I'm guessing you refer to
> >> max_smps_timeout which is used in the second while loop within
> >> vl15_poller? For this setting I am using the default which is defined
> >> in osm_subnet.c as:
> >>
> >> p_opt->transaction_timeout = OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC;
> >>    p_opt->transaction_retries = OSM_DEFAULT_RETRY_COUNT;
> >>    p_opt->max_smps_timeout = 1000 * p_opt->transaction_timeout
> >> *p_opt->transaction_retries;
> >>
> >> Would you explain to me what are the advantages or disadvantages of
> >> OSM_DEFAULT_SMP_MAX_ON_WIRE? Does this parameter change my
> bandwidth
> >> performance at all?
> >>
> >> I noticed that when using the default setting of 4 I get into the
> >> else of the above if statement when there are 4 qp0_mads_outstanding.
> >> I noticed that if I change OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 I don't
> >> get the failure I'm mentioning at all. Partly (I think) because I
> >> don't enter the else in the if statement until there is 1
> qp0_mads_outstanding.
> >>
> >> I hope this explains the problem well enough and it may be a time out
> >> problem but I'd like to understand why the problem is occurring.
> >> Thank you very much,
> >>
> >> Hector Abrach
> >>
> >> From:                 Hal Rosenstock <[email protected]>
> >> To:                 Hector Abrach <[email protected]>
> >> Cc:                 [email protected]
> >> Date:                 12/14/2011 08:03 AM
> >> Subject:                 Re: [ewg] OpenSM 1.5.4 Boot Problem
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> ---
> >>
> >>
> >>
> >> Hi,
> >>
> >> On 12/13/2011 2:35 PM, Hector Abrach wrote:
> >>> Hello,
> >>>
> >>> I have a boot problem with OpenSM
> >>
> >> Are you saying the switch is booted rather than OpenSM ?
> >>
> >> What is the OpenSM running on and in what environment ?
> >>
> >>> the problem occurs seldomly and
> >>> started to ocur when we started using a new Mellanox MT1118X03342
> switch.
> >>> The problem occurs during the discovery phase within
> >> state_mgr_sweep_hop_1.
> >>>
> >>> However, I discovered that the actual location is because the
> >>> qp0_mads_outsanding stalls at 1 occasionally.
> >>
> >> Is it stuck or after timeout/retry does this get updated properly ?
> >>
> >>> Within file osm_vl15intf.c in function vl15_poller it checks at the
> >>> rfifo and if the qlist still has items it applies function
> >>> vl15_send_mad which later on triggers the signal.
> >>> With the current default setting of 4 for
> >>> OSM_DEFAULT_SMP_MAX_ON_WIRE I noticed that cl_qlist_end reaches
> zero
> >>> before
> >>> stats->qp0_mads_outstanding does. This causes a stall in
> >>> cl_event_wait_on. The rfifo always reaches 0 when there are 4
> >>> qp0_mads_outstanding however when it fails it always fails when
> >>> there is
> >>> 1 qp0_mad_outstanding.
> >>
> >> Is some (request) SMP that OpenSM sent timing out (not being
> >> responded
> > to) ?
> >>
> >>> Have you seen this failure? By the way, I see this failure once
> >>> every 15 reboots approximately.
> >>>
> >>> I discovered that changing OSM_DEFAULT_SMP_MAX_ON_WIRE to 1
> fixes
> >>> the problem.
> >>
> >> What do you mean exactly by fixes the problem ? I'm not sure I
> >> understand what the problem is yet.
> >>
> >> -- Hal
> >>
> >>> My guess is that there is a race condition when the switch sends 4
> >>> SMPs in parallel. Also, this failure only appears to occur at
> >>> reboot. Another solution which is not acceptable is when I add a
> >>> delay in the process the failure goes away. This as if the switch
> >>> needed more time to do something.
> >>>
> >>> I would really appreciate your help and insight.
> >>> Thank you
> >>>
> >>> Hector Abrach
> >>>
> ___________________________________________________________________
> _
> >>> __ This email has been scanned by the Symantec Email Security.cloud
> >>> service.
> >>> For more information please visit _http://www.symanteccloud.com_
> >> <http://www.symanteccloud.com/>
> >>>
> ___________________________________________________________________
> _
> >>> __
> >>>
> >>>
> >>> _______________________________________________
> >>> ewg mailing list
> >>> [email protected]
> >>> _http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg_
> >>
> >>
> >>
> ___________________________________________________________________
> __
> >> _ This email has been scanned by the Symantec Email Security.cloud
> >> service.
> >> For more information please visit _http://www.symanteccloud.com_
> >> <http://www.symanteccloud.com/>
> >>
> ___________________________________________________________________
> __
> >> _
> >>
> >>
> >>
> ___________________________________________________________________
> __
> >> _ This email has been scanned by the Symantec Email Security.cloud
> >> service.
> >> For more information please visit http://www.symanteccloud.com
> > <http://www.symanteccloud.com/>
> >> <http://www.symanteccloud.com/>
> >>
> ___________________________________________________________________
> __
> >> _
> >>
> >>
> ___________________________________________________________________
> __
> >> _ This email has been scanned by the Symantec Email Security.cloud
> >> service.
> >> For more information please visit http://www.symanteccloud.com
> > <http://www.symanteccloud.com/>
> >> <http://www.symanteccloud.com/>
> >>
> >
> ___________________________________________________________________
> ___
> > [attachment
> >> "2011-12-13_10-18-25_182.jpg" deleted by Hector
> Abrach/Software/TMRU]
> >> _______________________________________________
> >> ewg mailing list
> >> [email protected]
> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
> >>
> >>
> ___________________________________________________________________
> __
> >> _ This email has been scanned by the Symantec Email Security.cloud
> >> service.
> >> For more information please visit http://www.symanteccloud.com
> > <http://www.symanteccloud.com/>
> >>
> ___________________________________________________________________
> __
> >> _
> >
> >
> >
> ___________________________________________________________________
> ___
> > This email has been scanned by the Symantec Email Security.cloud service.
> > For more information please visit http://www.symanteccloud.com
> > <http://www.symanteccloud.com/>
> >
> ___________________________________________________________________
> ___
> >
> >
> >
> ___________________________________________________________________
> ___
> > This email has been scanned by the Symantec Email Security.cloud service.
> > For more information please visit http://www.symanteccloud.com
> >
> ___________________________________________________________________
> ___
> 
> _______________________________________________
> ewg mailing list
> [email protected]
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
_______________________________________________
ewg mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] OpenSM 1.5.4 Boot Problem

Reply via email to