Hi,

On Tue, Feb 19, 2008 at 06:49:18PM -0600, Michael Brennen wrote:
> On Tuesday 19 February 2008, Dave Blaschke wrote:
> > Dejan Muhamedagic wrote:
> > > Hi,
> > >
> > > On Tue, Feb 19, 2008 at 03:38:46AM -0600, Michael Brennen wrote:
> > >> On Sun, 17 Feb 2008, Michael Brennen wrote:
> > >>> Heartbeat 2.1.3, crm enabled
> > >>>
> > >>> I've built an initial drbd master/slave on two systems, lvc7 and lvc8,
> > >>> following http://www.linux-ha.org/DRBD/HowTov2.  The drbd is coming
> > >>> alive in P/S mode, but it will not fail over when I kill the master;
> > >>> the slave stays in secondary.  stonith was not working, so I've decided
> > >>> to make that work in v2 (I had it working in a v1 build about a month
> > >>> ago.) .....
> > >>> Then, I am trying to define the stonith device in xml; the source to
> > >>> the apc3.xml file is attached.  I am adding from a separate command
> > >>> line:
> > >>
> > >> Just to clear this up, I found my own silly problem: I had misspelled
> > >> one of the parameters (password, not passwd) for the apcmaster stonith
> > >> device in the xml.
> > >
> > > Does that mean that the patch you posted is not necessary?
> 
> See below.
> 
> > There have been several reports of this issue over the last few years
> > where the plugin is looking for the 'Escape char' string before the user
> > name but it isn't there.  I'm pretty sure it depends on what flavor of
> > telnet is being used, but I still believe the correct patch is to look
> > for both possibilities instead of just one or the other - something like
> > the following:
> >
> > static struct Etoken EscapeChar[] =     { {"Escape character is '^]'.",
> > 0, 0}
> >                                        ,       {"User Name :", 1, 0}
> >                                        ,       {NULL,0,0}};
> >
> >
> >
> > static int
> > MSLogin(struct pluginDevice * ms)
> > {
> >       int rc;
> >
> >        /*
> >         * Apparently some telnet apps display the escape character while
> >         * others don't, so we need to handle both possibilities...
> >         *
> >         * rc == 0 : "Escape character is '^]'." found
> >         * rc == 1 : "User Name :" found
> >         * rc <  0 : Neither found or timeout
> >         */
> >        if ((rc = StonithLookFor(ms->rdfd, EscapeChar, 10)) < 0) {
> >                return(errno == ETIMEDOUT ? S_TIMEOUT : S_OOPS);
> >        } else if (rc == 0) {
> >                /*
> >                 * We should be looking at something like this:
> >                 *      User Name :
> >                 */
> >                EXPECT(ms->rdfd, login, 10);
> >        }
> >        SEND(ms->wrfd, ms->user);
> >
> > I sent out a patch similar to this - or maybe exactly like this :-) - to
> > a couple folks to ask for verification that it worked, never got any
> > feedback so it never made it in to production.
> 
> Looking closer at the logs last night (there is a lot going on in there :) I 
> found that stonithd was erroring with a message that it could not find the 
> string 'User Name :', so it was sometimes, not all the time, losing sync with 
> the prompts from the device.  When that happened I think one of the stonithd 
> held onto the connection until timeout, and the other one timed out as well.
> 
> I had already found the threads where you mentioned this patch, but you must 
> have sent the patch privately. :)
> 
> I modified the apcmaster.c code with your code, and stonithd stayed up 
> several 
> minutes this time without losing sync with the prompts.  A diff -u patch is 
> attached; I think this probably should go into source.  Thanks much; this 
> helped.
> 
> stonithd did fail after several minutes, but because I had configured the apc 
> to notify me by email of events I managed to correlate the stonithd failure 
> with an internal warmstart by the apc: "Severe - System: Warmstart".  So the 
> apc crowbarred and that caused problems with stonithd, as well it might.
> 
> I was using a 15s monitor then; I am running now with a 2m monitor interval 
> to 
> see if it will stay up.  If it runs all night I might have a marginal 
> confidence in the thing.  At least I can reliably run failover testing now.

Actually, I think that a 2m interval is also too often. And this
kind of devices are generally not happy to serve requests so
often. I really don't know why is that (serial line not
reliable?), perhaps Dave could say more. Anyway, don't you think
that having a monitor run say once an hour would be enough.

> This a 9211 masterswitch with the 9606 web/telnet/snmp card; it is an older 
> device and long out of support.  I checked last night that the apc has the 
> latest firmware, so there isn't much I can do with it except replace it or 
> figure out how to prevent it from warmstarting.
> 
> 
> FWIW the apcmastersnmp.c code in 2.1.3 will not compile; it fails with a 
> reference to a missing lha_internal.h.

But apcmastersnmp is part of the package so it should compile:

$ stonith -L
apcmaster
apcmastersnmp
...

Thanks,

Dejan

> 
> > >> The stonith daemons start successfully now, but with a monitor interval
> > >> of 15s one of the two fails fairly quickly.  The apc (9211 masterswitch)
> > >> only allows a single login, and I wonder if the two daemons aren't
> > >> colliding, and one is timing out and giving up.
> > >
> > > Did you take a look at the logs to confirm this?
> >
> > You should be able to see something in the logs to this effect, did you
> > add "debug 1" to your ha.cf files and look?  Based on timestamps you
> > should be able to see if one tries to login while the other one IS
> > logged in.
> >
> > > Thanks,
> > >
> > > Dejan
> > >
> > >> Fortunately this is a test cluster;
> > >> from what I have seen I would never put this pdu in production.  I will
> > >> work with an external/ssh stonith setup to see if I can't avoid the
> > >> problems with the apc.
> 
> -- 
> 
>    -- Michael

> --- apcmaster_orig.c  2007-12-21 09:32:27.000000000 -0600
> +++ apcmaster.c       2008-02-19 18:24:48.000000000 -0600
> @@ -147,8 +147,11 @@
>  
>  #define APCMSSTR     "American Power Conversion"
>  
> -static struct Etoken EscapeChar[] =  { {"Escape character is '^]'.", 0, 0}
> -                                     ,       {NULL,0,0}};
> +static struct Etoken EscapeChar[] = { {"Escape character is '^]'.", 0, 0},
> +                                      {"User Name :", 1, 0},
> +                                      {NULL,0,0}
> +                                      };
> +
>  static struct Etoken login[] =               { {"User Name :", 0, 0}, 
> {NULL,0,0}};
>  static struct Etoken password[] =    { {"Password  :", 0, 0} ,{NULL,0,0}};
>  static struct Etoken Prompt[] =      { {"> ", 0, 0} ,{NULL,0,0}};
> @@ -187,13 +190,26 @@
>  static int
>  MSLogin(struct pluginDevice * ms)
>  {
> -        EXPECT(ms->rdfd, EscapeChar, 10);
> +     int rc;
>  
> -     /* 
> -      * We should be looking at something like this:
> -         *   User Name :
> +     /* Patch from Dave Blaschke
> +      * Apparently some telnet apps display the escape character while
> +      * others don't, so we need to handle both possibilities...
> +      *
> +      * rc == 0 : "Escape character is '^]'." found
> +      * rc == 1 : "User Name :" found
> +      * rc <  0 : Neither found or timeout
>        */
> -     EXPECT(ms->rdfd, login, 10);
> +     if ((rc = StonithLookFor(ms->rdfd, EscapeChar, 10)) < 0) {
> +             return(errno == ETIMEDOUT ? S_TIMEOUT : S_OOPS);
> +     } else if (rc == 0) {
> +             /*
> +              * We should be looking at something like this:
> +              *      User Name :
> +              */
> +             EXPECT(ms->rdfd, login, 10);
> +     }
> +
>       SEND(ms->wrfd, ms->user);       
>       SEND(ms->wrfd, "\r");
>  




> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to