On Tuesday 19 February 2008, Dave Blaschke wrote:
> Dejan Muhamedagic wrote:
> > Hi,
> >
> > On Tue, Feb 19, 2008 at 03:38:46AM -0600, Michael Brennen wrote:
> >> On Sun, 17 Feb 2008, Michael Brennen wrote:
> >>> Heartbeat 2.1.3, crm enabled
> >>>
> >>> I've built an initial drbd master/slave on two systems, lvc7 and lvc8,
> >>> following http://www.linux-ha.org/DRBD/HowTov2.  The drbd is coming
> >>> alive in P/S mode, but it will not fail over when I kill the master;
> >>> the slave stays in secondary.  stonith was not working, so I've decided
> >>> to make that work in v2 (I had it working in a v1 build about a month
> >>> ago.) .....
> >>> Then, I am trying to define the stonith device in xml; the source to
> >>> the apc3.xml file is attached.  I am adding from a separate command
> >>> line:
> >>
> >> Just to clear this up, I found my own silly problem: I had misspelled
> >> one of the parameters (password, not passwd) for the apcmaster stonith
> >> device in the xml.
> >
> > Does that mean that the patch you posted is not necessary?

See below.

> There have been several reports of this issue over the last few years
> where the plugin is looking for the 'Escape char' string before the user
> name but it isn't there.  I'm pretty sure it depends on what flavor of
> telnet is being used, but I still believe the correct patch is to look
> for both possibilities instead of just one or the other - something like
> the following:
>
> static struct Etoken EscapeChar[] =     { {"Escape character is '^]'.",
> 0, 0}
>                                        ,       {"User Name :", 1, 0}
>                                        ,       {NULL,0,0}};
>
>
>
> static int
> MSLogin(struct pluginDevice * ms)
> {
>       int rc;
>
>        /*
>         * Apparently some telnet apps display the escape character while
>         * others don't, so we need to handle both possibilities...
>         *
>         * rc == 0 : "Escape character is '^]'." found
>         * rc == 1 : "User Name :" found
>         * rc <  0 : Neither found or timeout
>         */
>        if ((rc = StonithLookFor(ms->rdfd, EscapeChar, 10)) < 0) {
>                return(errno == ETIMEDOUT ? S_TIMEOUT : S_OOPS);
>        } else if (rc == 0) {
>                /*
>                 * We should be looking at something like this:
>                 *      User Name :
>                 */
>                EXPECT(ms->rdfd, login, 10);
>        }
>        SEND(ms->wrfd, ms->user);
>
> I sent out a patch similar to this - or maybe exactly like this :-) - to
> a couple folks to ask for verification that it worked, never got any
> feedback so it never made it in to production.

Looking closer at the logs last night (there is a lot going on in there :) I 
found that stonithd was erroring with a message that it could not find the 
string 'User Name :', so it was sometimes, not all the time, losing sync with 
the prompts from the device.  When that happened I think one of the stonithd 
held onto the connection until timeout, and the other one timed out as well.

I had already found the threads where you mentioned this patch, but you must 
have sent the patch privately. :)

I modified the apcmaster.c code with your code, and stonithd stayed up several 
minutes this time without losing sync with the prompts.  A diff -u patch is 
attached; I think this probably should go into source.  Thanks much; this 
helped.

stonithd did fail after several minutes, but because I had configured the apc 
to notify me by email of events I managed to correlate the stonithd failure 
with an internal warmstart by the apc: "Severe - System: Warmstart".  So the 
apc crowbarred and that caused problems with stonithd, as well it might.

I was using a 15s monitor then; I am running now with a 2m monitor interval to 
see if it will stay up.  If it runs all night I might have a marginal 
confidence in the thing.  At least I can reliably run failover testing now.

This a 9211 masterswitch with the 9606 web/telnet/snmp card; it is an older 
device and long out of support.  I checked last night that the apc has the 
latest firmware, so there isn't much I can do with it except replace it or 
figure out how to prevent it from warmstarting.


FWIW the apcmastersnmp.c code in 2.1.3 will not compile; it fails with a 
reference to a missing lha_internal.h.


> >> The stonith daemons start successfully now, but with a monitor interval
> >> of 15s one of the two fails fairly quickly.  The apc (9211 masterswitch)
> >> only allows a single login, and I wonder if the two daemons aren't
> >> colliding, and one is timing out and giving up.
> >
> > Did you take a look at the logs to confirm this?
>
> You should be able to see something in the logs to this effect, did you
> add "debug 1" to your ha.cf files and look?  Based on timestamps you
> should be able to see if one tries to login while the other one IS
> logged in.
>
> > Thanks,
> >
> > Dejan
> >
> >> Fortunately this is a test cluster;
> >> from what I have seen I would never put this pdu in production.  I will
> >> work with an external/ssh stonith setup to see if I can't avoid the
> >> problems with the apc.

-- 

   -- Michael
--- apcmaster_orig.c	2007-12-21 09:32:27.000000000 -0600
+++ apcmaster.c	2008-02-19 18:24:48.000000000 -0600
@@ -147,8 +147,11 @@
 
 #define APCMSSTR	"American Power Conversion"
 
-static struct Etoken EscapeChar[] =	{ {"Escape character is '^]'.", 0, 0}
-					,	{NULL,0,0}};
+static struct Etoken EscapeChar[] = { {"Escape character is '^]'.", 0, 0},
+                                      {"User Name :", 1, 0},
+                                      {NULL,0,0}
+                                      };
+
 static struct Etoken login[] = 		{ {"User Name :", 0, 0}, {NULL,0,0}};
 static struct Etoken password[] =	{ {"Password  :", 0, 0} ,{NULL,0,0}};
 static struct Etoken Prompt[] =	{ {"> ", 0, 0} ,{NULL,0,0}};
@@ -187,13 +190,26 @@
 static int
 MSLogin(struct pluginDevice * ms)
 {
-        EXPECT(ms->rdfd, EscapeChar, 10);
+	int rc;
 
-  	/* 
-	 * We should be looking at something like this:
-         *	User Name :
+ 	/* Patch from Dave Blaschke
+	 * Apparently some telnet apps display the escape character while
+	 * others don't, so we need to handle both possibilities...
+	 *
+	 * rc == 0 : "Escape character is '^]'." found
+	 * rc == 1 : "User Name :" found
+	 * rc <  0 : Neither found or timeout
 	 */
-	EXPECT(ms->rdfd, login, 10);
+	if ((rc = StonithLookFor(ms->rdfd, EscapeChar, 10)) < 0) {
+		return(errno == ETIMEDOUT ? S_TIMEOUT : S_OOPS);
+	} else if (rc == 0) {
+		/*
+		 * We should be looking at something like this:
+		 *      User Name :
+		 */
+		EXPECT(ms->rdfd, login, 10);
+	}
+
 	SEND(ms->wrfd, ms->user);       
 	SEND(ms->wrfd, "\r");
 

Attachment: signature.asc
Description: This is a digitally signed message part.

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to