Hi,
On Tue, Feb 19, 2008 at 06:49:18PM -0600, Michael Brennen wrote:
> On Tuesday 19 February 2008, Dave Blaschke wrote:
> > Dejan Muhamedagic wrote:
> > > Hi,
> > >
> > > On Tue, Feb 19, 2008 at 03:38:46AM -0600, Michael Brennen wrote:
> > >> On Sun, 17 Feb 2008, Michael Brennen wrote:
> > >>> Heartbeat 2.1.3, crm enabled
> > >>>
> > >>> I've built an initial drbd master/slave on two systems, lvc7 and lvc8,
> > >>> following http://www.linux-ha.org/DRBD/HowTov2. The drbd is coming
> > >>> alive in P/S mode, but it will not fail over when I kill the master;
> > >>> the slave stays in secondary. stonith was not working, so I've decided
> > >>> to make that work in v2 (I had it working in a v1 build about a month
> > >>> ago.) .....
> > >>> Then, I am trying to define the stonith device in xml; the source to
> > >>> the apc3.xml file is attached. I am adding from a separate command
> > >>> line:
> > >>
> > >> Just to clear this up, I found my own silly problem: I had misspelled
> > >> one of the parameters (password, not passwd) for the apcmaster stonith
> > >> device in the xml.
> > >
> > > Does that mean that the patch you posted is not necessary?
>
> See below.
>
> > There have been several reports of this issue over the last few years
> > where the plugin is looking for the 'Escape char' string before the user
> > name but it isn't there. I'm pretty sure it depends on what flavor of
> > telnet is being used, but I still believe the correct patch is to look
> > for both possibilities instead of just one or the other - something like
> > the following:
> >
> > static struct Etoken EscapeChar[] = { {"Escape character is '^]'.",
> > 0, 0}
> > , {"User Name :", 1, 0}
> > , {NULL,0,0}};
> >
> >
> >
> > static int
> > MSLogin(struct pluginDevice * ms)
> > {
> > int rc;
> >
> > /*
> > * Apparently some telnet apps display the escape character while
> > * others don't, so we need to handle both possibilities...
> > *
> > * rc == 0 : "Escape character is '^]'." found
> > * rc == 1 : "User Name :" found
> > * rc < 0 : Neither found or timeout
> > */
> > if ((rc = StonithLookFor(ms->rdfd, EscapeChar, 10)) < 0) {
> > return(errno == ETIMEDOUT ? S_TIMEOUT : S_OOPS);
> > } else if (rc == 0) {
> > /*
> > * We should be looking at something like this:
> > * User Name :
> > */
> > EXPECT(ms->rdfd, login, 10);
> > }
> > SEND(ms->wrfd, ms->user);
> >
> > I sent out a patch similar to this - or maybe exactly like this :-) - to
> > a couple folks to ask for verification that it worked, never got any
> > feedback so it never made it in to production.
>
> Looking closer at the logs last night (there is a lot going on in there :) I
> found that stonithd was erroring with a message that it could not find the
> string 'User Name :', so it was sometimes, not all the time, losing sync with
> the prompts from the device. When that happened I think one of the stonithd
> held onto the connection until timeout, and the other one timed out as well.
>
> I had already found the threads where you mentioned this patch, but you must
> have sent the patch privately. :)
>
> I modified the apcmaster.c code with your code, and stonithd stayed up
> several
> minutes this time without losing sync with the prompts. A diff -u patch is
> attached; I think this probably should go into source. Thanks much; this
> helped.
>
> stonithd did fail after several minutes, but because I had configured the apc
> to notify me by email of events I managed to correlate the stonithd failure
> with an internal warmstart by the apc: "Severe - System: Warmstart". So the
> apc crowbarred and that caused problems with stonithd, as well it might.
>
> I was using a 15s monitor then; I am running now with a 2m monitor interval
> to
> see if it will stay up. If it runs all night I might have a marginal
> confidence in the thing. At least I can reliably run failover testing now.
Actually, I think that a 2m interval is also too often. And this
kind of devices are generally not happy to serve requests so
often. I really don't know why is that (serial line not
reliable?), perhaps Dave could say more. Anyway, don't you think
that having a monitor run say once an hour would be enough.
> This a 9211 masterswitch with the 9606 web/telnet/snmp card; it is an older
> device and long out of support. I checked last night that the apc has the
> latest firmware, so there isn't much I can do with it except replace it or
> figure out how to prevent it from warmstarting.
>
>
> FWIW the apcmastersnmp.c code in 2.1.3 will not compile; it fails with a
> reference to a missing lha_internal.h.
But apcmastersnmp is part of the package so it should compile:
$ stonith -L
apcmaster
apcmastersnmp
...
Thanks,
Dejan
>
> > >> The stonith daemons start successfully now, but with a monitor interval
> > >> of 15s one of the two fails fairly quickly. The apc (9211 masterswitch)
> > >> only allows a single login, and I wonder if the two daemons aren't
> > >> colliding, and one is timing out and giving up.
> > >
> > > Did you take a look at the logs to confirm this?
> >
> > You should be able to see something in the logs to this effect, did you
> > add "debug 1" to your ha.cf files and look? Based on timestamps you
> > should be able to see if one tries to login while the other one IS
> > logged in.
> >
> > > Thanks,
> > >
> > > Dejan
> > >
> > >> Fortunately this is a test cluster;
> > >> from what I have seen I would never put this pdu in production. I will
> > >> work with an external/ssh stonith setup to see if I can't avoid the
> > >> problems with the apc.
>
> --
>
> -- Michael
> --- apcmaster_orig.c 2007-12-21 09:32:27.000000000 -0600
> +++ apcmaster.c 2008-02-19 18:24:48.000000000 -0600
> @@ -147,8 +147,11 @@
>
> #define APCMSSTR "American Power Conversion"
>
> -static struct Etoken EscapeChar[] = { {"Escape character is '^]'.", 0, 0}
> - , {NULL,0,0}};
> +static struct Etoken EscapeChar[] = { {"Escape character is '^]'.", 0, 0},
> + {"User Name :", 1, 0},
> + {NULL,0,0}
> + };
> +
> static struct Etoken login[] = { {"User Name :", 0, 0},
> {NULL,0,0}};
> static struct Etoken password[] = { {"Password :", 0, 0} ,{NULL,0,0}};
> static struct Etoken Prompt[] = { {"> ", 0, 0} ,{NULL,0,0}};
> @@ -187,13 +190,26 @@
> static int
> MSLogin(struct pluginDevice * ms)
> {
> - EXPECT(ms->rdfd, EscapeChar, 10);
> + int rc;
>
> - /*
> - * We should be looking at something like this:
> - * User Name :
> + /* Patch from Dave Blaschke
> + * Apparently some telnet apps display the escape character while
> + * others don't, so we need to handle both possibilities...
> + *
> + * rc == 0 : "Escape character is '^]'." found
> + * rc == 1 : "User Name :" found
> + * rc < 0 : Neither found or timeout
> */
> - EXPECT(ms->rdfd, login, 10);
> + if ((rc = StonithLookFor(ms->rdfd, EscapeChar, 10)) < 0) {
> + return(errno == ETIMEDOUT ? S_TIMEOUT : S_OOPS);
> + } else if (rc == 0) {
> + /*
> + * We should be looking at something like this:
> + * User Name :
> + */
> + EXPECT(ms->rdfd, login, 10);
> + }
> +
> SEND(ms->wrfd, ms->user);
> SEND(ms->wrfd, "\r");
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems