On Tuesday 19 February 2008, Dave Blaschke wrote: > Dejan Muhamedagic wrote: > > Hi, > > > > On Tue, Feb 19, 2008 at 03:38:46AM -0600, Michael Brennen wrote: > >> On Sun, 17 Feb 2008, Michael Brennen wrote: > >>> Heartbeat 2.1.3, crm enabled > >>> > >>> I've built an initial drbd master/slave on two systems, lvc7 and lvc8, > >>> following http://www.linux-ha.org/DRBD/HowTov2. The drbd is coming > >>> alive in P/S mode, but it will not fail over when I kill the master; > >>> the slave stays in secondary. stonith was not working, so I've decided > >>> to make that work in v2 (I had it working in a v1 build about a month > >>> ago.) ..... > >>> Then, I am trying to define the stonith device in xml; the source to > >>> the apc3.xml file is attached. I am adding from a separate command > >>> line: > >> > >> Just to clear this up, I found my own silly problem: I had misspelled > >> one of the parameters (password, not passwd) for the apcmaster stonith > >> device in the xml. > > > > Does that mean that the patch you posted is not necessary?
See below.
> There have been several reports of this issue over the last few years
> where the plugin is looking for the 'Escape char' string before the user
> name but it isn't there. I'm pretty sure it depends on what flavor of
> telnet is being used, but I still believe the correct patch is to look
> for both possibilities instead of just one or the other - something like
> the following:
>
> static struct Etoken EscapeChar[] = { {"Escape character is '^]'.",
> 0, 0}
> , {"User Name :", 1, 0}
> , {NULL,0,0}};
>
>
>
> static int
> MSLogin(struct pluginDevice * ms)
> {
> int rc;
>
> /*
> * Apparently some telnet apps display the escape character while
> * others don't, so we need to handle both possibilities...
> *
> * rc == 0 : "Escape character is '^]'." found
> * rc == 1 : "User Name :" found
> * rc < 0 : Neither found or timeout
> */
> if ((rc = StonithLookFor(ms->rdfd, EscapeChar, 10)) < 0) {
> return(errno == ETIMEDOUT ? S_TIMEOUT : S_OOPS);
> } else if (rc == 0) {
> /*
> * We should be looking at something like this:
> * User Name :
> */
> EXPECT(ms->rdfd, login, 10);
> }
> SEND(ms->wrfd, ms->user);
>
> I sent out a patch similar to this - or maybe exactly like this :-) - to
> a couple folks to ask for verification that it worked, never got any
> feedback so it never made it in to production.
Looking closer at the logs last night (there is a lot going on in there :) I
found that stonithd was erroring with a message that it could not find the
string 'User Name :', so it was sometimes, not all the time, losing sync with
the prompts from the device. When that happened I think one of the stonithd
held onto the connection until timeout, and the other one timed out as well.
I had already found the threads where you mentioned this patch, but you must
have sent the patch privately. :)
I modified the apcmaster.c code with your code, and stonithd stayed up several
minutes this time without losing sync with the prompts. A diff -u patch is
attached; I think this probably should go into source. Thanks much; this
helped.
stonithd did fail after several minutes, but because I had configured the apc
to notify me by email of events I managed to correlate the stonithd failure
with an internal warmstart by the apc: "Severe - System: Warmstart". So the
apc crowbarred and that caused problems with stonithd, as well it might.
I was using a 15s monitor then; I am running now with a 2m monitor interval to
see if it will stay up. If it runs all night I might have a marginal
confidence in the thing. At least I can reliably run failover testing now.
This a 9211 masterswitch with the 9606 web/telnet/snmp card; it is an older
device and long out of support. I checked last night that the apc has the
latest firmware, so there isn't much I can do with it except replace it or
figure out how to prevent it from warmstarting.
FWIW the apcmastersnmp.c code in 2.1.3 will not compile; it fails with a
reference to a missing lha_internal.h.
> >> The stonith daemons start successfully now, but with a monitor interval
> >> of 15s one of the two fails fairly quickly. The apc (9211 masterswitch)
> >> only allows a single login, and I wonder if the two daemons aren't
> >> colliding, and one is timing out and giving up.
> >
> > Did you take a look at the logs to confirm this?
>
> You should be able to see something in the logs to this effect, did you
> add "debug 1" to your ha.cf files and look? Based on timestamps you
> should be able to see if one tries to login while the other one IS
> logged in.
>
> > Thanks,
> >
> > Dejan
> >
> >> Fortunately this is a test cluster;
> >> from what I have seen I would never put this pdu in production. I will
> >> work with an external/ssh stonith setup to see if I can't avoid the
> >> problems with the apc.
--
-- Michael
--- apcmaster_orig.c 2007-12-21 09:32:27.000000000 -0600
+++ apcmaster.c 2008-02-19 18:24:48.000000000 -0600
@@ -147,8 +147,11 @@
#define APCMSSTR "American Power Conversion"
-static struct Etoken EscapeChar[] = { {"Escape character is '^]'.", 0, 0}
- , {NULL,0,0}};
+static struct Etoken EscapeChar[] = { {"Escape character is '^]'.", 0, 0},
+ {"User Name :", 1, 0},
+ {NULL,0,0}
+ };
+
static struct Etoken login[] = { {"User Name :", 0, 0}, {NULL,0,0}};
static struct Etoken password[] = { {"Password :", 0, 0} ,{NULL,0,0}};
static struct Etoken Prompt[] = { {"> ", 0, 0} ,{NULL,0,0}};
@@ -187,13 +190,26 @@
static int
MSLogin(struct pluginDevice * ms)
{
- EXPECT(ms->rdfd, EscapeChar, 10);
+ int rc;
- /*
- * We should be looking at something like this:
- * User Name :
+ /* Patch from Dave Blaschke
+ * Apparently some telnet apps display the escape character while
+ * others don't, so we need to handle both possibilities...
+ *
+ * rc == 0 : "Escape character is '^]'." found
+ * rc == 1 : "User Name :" found
+ * rc < 0 : Neither found or timeout
*/
- EXPECT(ms->rdfd, login, 10);
+ if ((rc = StonithLookFor(ms->rdfd, EscapeChar, 10)) < 0) {
+ return(errno == ETIMEDOUT ? S_TIMEOUT : S_OOPS);
+ } else if (rc == 0) {
+ /*
+ * We should be looking at something like this:
+ * User Name :
+ */
+ EXPECT(ms->rdfd, login, 10);
+ }
+
SEND(ms->wrfd, ms->user);
SEND(ms->wrfd, "\r");
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
