Re: [Linux-HA] R2 Two-node apache cluster with STONITH

Bjorn Oglefjorn Tue, 03 Apr 2007 11:54:34 -0700

Anyone? Help?
--BO

On 4/2/07, Bjorn Oglefjorn <[EMAIL PROTECTED]> wrote:


Any ideas as to what's going wrong here?
--BO

On 3/30/07, Bjorn Oglefjorn <[EMAIL PROTECTED]> wrote:
>
> I've made the OCF apache RA work by editing the script's parameters for
> now.  This is just testing anyway.  Attached are my configs and a tar ball
> of the logs from the two nodes in question.  The logs show one complete run
> of heartbeat..from start to stop.  What I did during that time is as
> follows:
>
> 1. Start heartbeat
> 2. Wait for deadtime to expire and resources to start
> 3. Simulate node failure on test-2 by shutting down networking (via
> console)
> 4. Watch as STONITH fails repeatedly
> 5. Start networking on test-2
> 6. Watch cluster recover and resources move to test-1
> 7. Stop heartbeat
>
> I've made some changes to my cib.xml recently.  The largest change is
> that I've made my STONITH declarations normal primitive directives instead
> of clones.  The reason being is that each STONITH device needs a unique
> definition within the CIB.  A DRAC is embedded in the chassis of each node
> and can only work on that particular node (eg: test-1.drac can only
> reset test-1.domain).  I don't want test-1_DRAC being run on the node
> test-1.domain as a reset operation would then result in suicide (is this
> ever desirable?).
>
> Normal resource recovery and failover is working as expected as far as I
> can tell.  I'm only having a problem with STONITH.
>
> I'm struggling to see where I've gone wrong.  Please help me figure this
> out as it's integral to several projects I have on my plate at this time.
> Thanks again; you've all been great.
>
> --BO
>
>
> On 3/30/07, Bjorn Oglefjorn <[EMAIL PROTECTED] > wrote:
> >
> > I took a look at the apache RA, but it makes a lot of assumptions
> > about the environment which are mostly untrue in Red Hat.  How can I
> > configure this RA short of making changes to the script?  Can I set
> > environmental variables?  I tried setting what's shown in the 'meta-data'
> > output, but with no luck.
> >
> > Thanks as always,
> > --BO
> >
> > On 3/29/07, Alan Robertson < [EMAIL PROTECTED]> wrote:
> > >
> > > Bjorn Oglefjorn wrote:
> > > > Thanks for the reply Dejan.  My responses are inline.
> > > > --BO
> > > >
> > > > On 3/28/07, Dejan Muhamedagic < [EMAIL PROTECTED]> wrote:
> > > >>
> > > >> On Wed, Mar 28, 2007 at 11:29:35AM -0400, Bjorn Oglefjorn wrote:
> > > >> > I believe I've corrected some issues, but now I'm getting more
> > > of this:
> > > >> > Mar 28 11:02:37 test-1 lrmd: [22008]: ERROR: RA
> > > lsb:httpd:monitor
> > > >> (process
> > > >> > 24472) failed to redirect stdout for its background child
> > > (daemon)
> > > >> > processes. This will likely cause those processes to die
> > > >> mysteriously at
> > > >> > some later time (terminated by signal SIGPIPE).
> > > >>
> > > >> Hmm, I think that this has been addressed as Alan had already
> > > >> pointed out, probably after the 2.0.7 release. If you can, please
> > > >> upgrade to 2.0.8.
> > > >
> > > >
> > > > I'd prefer to stick with the package that comes from CentOS extras
> > > ( 2.0.7).
> > > > I don't get this error all the time, so I'm not sure why it's
> > > happening.
> > > > Can someone give me a deeper explanation of what the lrmd doesn't
> > > like
> > > > here?
> > > >
> > > >> When I attempt to move resources to another node (useing
> > > crm_standby) I
> > > >> get
> > > >> > these errors:
> > > >> > Mar 28 10:56:04 test-1 crmd: [22011]: info:
> > > >> do_lrm_rsc_op:lrm.cPerforming
> > > >> > op stop on httpd (interval=0ms,
> > > >> key=28:66532759-6190-4321-9be3-07730b15aeae)
> > > >> > Mar 28 10:56:04 test-1 lrmd: [22773]: WARN: For LSB init
> > > script, no
> > > >> > additional parameters are needed.
> > > >>
> > > >> Can't say unless you show me this rsc definition, but it seems
> > > >> like bad usage. I found one below, but that one should not cause
> > > >> this problem:
> > > >
> > > >
> > > > It's slightly different now (is provider="heartbeat" bad here?):
> > > >
> > > >         <primitive class="lsb" id="httpd" provider="heartbeat"
> > > > type="httpd-lsb">
> > > >           <operations>
> > > >             <op id="httpd_mon" interval="5s" name="monitor"
> > > timeout="20s"
> > > > on_fail="restart"/>
> > > >             <op id="httpd_start" name="start" timeout="20s"
> > > > on_fail="restart" prereq="fencing"/>
> > > >             <op id="httpd_stop" name="stop" timeout="20s"
> > > on_fail="restart"
> > > > prereq="fencing"/>
> > > >           </operations>
> > > >         </primitive>
> > > >
> > > >> <primitive class="lsb" id="httpd" provider="heartbeat"
> > > type="httpd">
> > > >> > <operations>
> > > >> > <op id="httpd_status" interval="5s" name="status" timeout="20s"
> > > >> on_fail="fence"/>
> > > >> > </operations>
> > > >> > </primitive>
> > > >>
> > > >> One thing that looks odd is 5s interval and 20s timeout. The
> > > >> timeout is probably OK, but the interval is a bit exaggerated.
> > > >> What I mean is that, apart from putting extra strain on your host
> > >
> > > >> which may or may not be an issue, a 5 seconds monitoring interval
> > > >> won't bring you much, or, in other words, how about your response
> > > >> time in case a problem occurs? Is it of the same order?
> > > >
> > > >
> > > > Would it make more sense to have the timeout and interval
> > > equal?  I can see
> > > > your point.
> > > >
> > > >> Mar 28 10:56:04 test-1 lrmd: [22008]: ERROR: RA lsb:httpd:stop
> > > (process
> > > >> > 22773) failed to redirect stdout for its background child
> > > (daemon)
> > > >> > processes. This will likely cause those processes to die
> > > >> mysteriously at
> > > >> > some later time (terminated by signal SIGPIPE).
> > > >> > Mar 28 10:56:04 test-1 lrmd: [22008]: info: RA output:
> > > >> (httpd:stop:stdout)
> > > >> > httpd (pid 22165 22164 22163 22162 22161 22160 22159 22157
> > > 22155) is
> > > >> > running...
> > > >> > Mar 28 10:56:04 test-1 crmd: [22011]: WARN: process_lrm_event:
> > > lrm.c LRM
> > > >> > operation (44) stop_0 on httpd Error: (1) unknown error
> > > >>
> > > >> I'd strongly recommend that you use the OCF RA in stead of your
> > > >> distributions init script. It is otherwise rather difficult to
> > > >> figure out what this error means apart from the fact that the
> > > stop
> > > >> op failed. I wonder why did it show up as WARN and not ERROR.
> > >
> > > I agree.  Also, our resource agent monitors apache much better than
> > > status on the LSB init script.
> > >
> > >
> > > --
> > >     Alan Robertson <[EMAIL PROTECTED]>
> > >
> > > "Openness is the foundation and preservative of friendship...  Let
> > > me
> > > claim from you at all times your undisguised opinions." - William
> > > Wilberforce
> > > _______________________________________________
> > > Linux-HA mailing list
> > > [email protected]
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > See also: http://linux-ha.org/ReportingProblems
> > >
> >
>

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] R2 Two-node apache cluster with STONITH

Reply via email to