Anyone? Help? --BO On 4/2/07, Bjorn Oglefjorn <[EMAIL PROTECTED]> wrote:
Any ideas as to what's going wrong here? --BO On 3/30/07, Bjorn Oglefjorn <[EMAIL PROTECTED]> wrote: > > I've made the OCF apache RA work by editing the script's parameters for > now. This is just testing anyway. Attached are my configs and a tar ball > of the logs from the two nodes in question. The logs show one complete run > of heartbeat..from start to stop. What I did during that time is as > follows: > > 1. Start heartbeat > 2. Wait for deadtime to expire and resources to start > 3. Simulate node failure on test-2 by shutting down networking (via > console) > 4. Watch as STONITH fails repeatedly > 5. Start networking on test-2 > 6. Watch cluster recover and resources move to test-1 > 7. Stop heartbeat > > I've made some changes to my cib.xml recently. The largest change is > that I've made my STONITH declarations normal primitive directives instead > of clones. The reason being is that each STONITH device needs a unique > definition within the CIB. A DRAC is embedded in the chassis of each node > and can only work on that particular node (eg: test-1.drac can only > reset test-1.domain). I don't want test-1_DRAC being run on the node > test-1.domain as a reset operation would then result in suicide (is this > ever desirable?). > > Normal resource recovery and failover is working as expected as far as I > can tell. I'm only having a problem with STONITH. > > I'm struggling to see where I've gone wrong. Please help me figure this > out as it's integral to several projects I have on my plate at this time. > Thanks again; you've all been great. > > --BO > > > On 3/30/07, Bjorn Oglefjorn <[EMAIL PROTECTED] > wrote: > > > > I took a look at the apache RA, but it makes a lot of assumptions > > about the environment which are mostly untrue in Red Hat. How can I > > configure this RA short of making changes to the script? Can I set > > environmental variables? I tried setting what's shown in the 'meta-data' > > output, but with no luck. > > > > Thanks as always, > > --BO > > > > On 3/29/07, Alan Robertson < [EMAIL PROTECTED]> wrote: > > > > > > Bjorn Oglefjorn wrote: > > > > Thanks for the reply Dejan. My responses are inline. > > > > --BO > > > > > > > > On 3/28/07, Dejan Muhamedagic < [EMAIL PROTECTED]> wrote: > > > >> > > > >> On Wed, Mar 28, 2007 at 11:29:35AM -0400, Bjorn Oglefjorn wrote: > > > >> > I believe I've corrected some issues, but now I'm getting more > > > of this: > > > >> > Mar 28 11:02:37 test-1 lrmd: [22008]: ERROR: RA > > > lsb:httpd:monitor > > > >> (process > > > >> > 24472) failed to redirect stdout for its background child > > > (daemon) > > > >> > processes. This will likely cause those processes to die > > > >> mysteriously at > > > >> > some later time (terminated by signal SIGPIPE). > > > >> > > > >> Hmm, I think that this has been addressed as Alan had already > > > >> pointed out, probably after the 2.0.7 release. If you can, please > > > >> upgrade to 2.0.8. > > > > > > > > > > > > I'd prefer to stick with the package that comes from CentOS extras > > > ( 2.0.7). > > > > I don't get this error all the time, so I'm not sure why it's > > > happening. > > > > Can someone give me a deeper explanation of what the lrmd doesn't > > > like > > > > here? > > > > > > > >> When I attempt to move resources to another node (useing > > > crm_standby) I > > > >> get > > > >> > these errors: > > > >> > Mar 28 10:56:04 test-1 crmd: [22011]: info: > > > >> do_lrm_rsc_op:lrm.cPerforming > > > >> > op stop on httpd (interval=0ms, > > > >> key=28:66532759-6190-4321-9be3-07730b15aeae) > > > >> > Mar 28 10:56:04 test-1 lrmd: [22773]: WARN: For LSB init > > > script, no > > > >> > additional parameters are needed. > > > >> > > > >> Can't say unless you show me this rsc definition, but it seems > > > >> like bad usage. I found one below, but that one should not cause > > > >> this problem: > > > > > > > > > > > > It's slightly different now (is provider="heartbeat" bad here?): > > > > > > > > <primitive class="lsb" id="httpd" provider="heartbeat" > > > > type="httpd-lsb"> > > > > <operations> > > > > <op id="httpd_mon" interval="5s" name="monitor" > > > timeout="20s" > > > > on_fail="restart"/> > > > > <op id="httpd_start" name="start" timeout="20s" > > > > on_fail="restart" prereq="fencing"/> > > > > <op id="httpd_stop" name="stop" timeout="20s" > > > on_fail="restart" > > > > prereq="fencing"/> > > > > </operations> > > > > </primitive> > > > > > > > >> <primitive class="lsb" id="httpd" provider="heartbeat" > > > type="httpd"> > > > >> > <operations> > > > >> > <op id="httpd_status" interval="5s" name="status" timeout="20s" > > > >> on_fail="fence"/> > > > >> > </operations> > > > >> > </primitive> > > > >> > > > >> One thing that looks odd is 5s interval and 20s timeout. The > > > >> timeout is probably OK, but the interval is a bit exaggerated. > > > >> What I mean is that, apart from putting extra strain on your host > > > > > > >> which may or may not be an issue, a 5 seconds monitoring interval > > > >> won't bring you much, or, in other words, how about your response > > > >> time in case a problem occurs? Is it of the same order? > > > > > > > > > > > > Would it make more sense to have the timeout and interval > > > equal? I can see > > > > your point. > > > > > > > >> Mar 28 10:56:04 test-1 lrmd: [22008]: ERROR: RA lsb:httpd:stop > > > (process > > > >> > 22773) failed to redirect stdout for its background child > > > (daemon) > > > >> > processes. This will likely cause those processes to die > > > >> mysteriously at > > > >> > some later time (terminated by signal SIGPIPE). > > > >> > Mar 28 10:56:04 test-1 lrmd: [22008]: info: RA output: > > > >> (httpd:stop:stdout) > > > >> > httpd (pid 22165 22164 22163 22162 22161 22160 22159 22157 > > > 22155) is > > > >> > running... > > > >> > Mar 28 10:56:04 test-1 crmd: [22011]: WARN: process_lrm_event: > > > lrm.c LRM > > > >> > operation (44) stop_0 on httpd Error: (1) unknown error > > > >> > > > >> I'd strongly recommend that you use the OCF RA in stead of your > > > >> distributions init script. It is otherwise rather difficult to > > > >> figure out what this error means apart from the fact that the > > > stop > > > >> op failed. I wonder why did it show up as WARN and not ERROR. > > > > > > I agree. Also, our resource agent monitors apache much better than > > > status on the LSB init script. > > > > > > > > > -- > > > Alan Robertson <[EMAIL PROTECTED]> > > > > > > "Openness is the foundation and preservative of friendship... Let > > > me > > > claim from you at all times your undisguised opinions." - William > > > Wilberforce > > > _______________________________________________ > > > Linux-HA mailing list > > > [email protected] > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha > > > See also: http://linux-ha.org/ReportingProblems > > > > > >
_______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
