Re: [Linux-HA] R2 Two-node apache cluster with STONITH

Dejan Muhamedagic Wed, 28 Mar 2007 12:34:01 -0800

On Wed, Mar 28, 2007 at 02:33:28PM -0400, Bjorn Oglefjorn wrote:
> Thanks for the reply Dejan.  My responses are inline.
> --BO
> 
> On 3/28/07, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
> >
> >On Wed, Mar 28, 2007 at 11:29:35AM -0400, Bjorn Oglefjorn wrote:
> >> I believe I've corrected some issues, but now I'm getting more of this:
> >> Mar 28 11:02:37 test-1 lrmd: [22008]: ERROR: RA lsb:httpd:monitor
> >(process
> >> 24472) failed to redirect stdout for its background child (daemon)
> >> processes. This will likely cause those processes to die mysteriously at
> >> some later time (terminated by signal SIGPIPE).
> >
> >Hmm, I think that this has been addressed as Alan had already
> >pointed out, probably after the 2.0.7 release. If you can, please
> >upgrade to 2.0.8.
> 
> 
> I'd prefer to stick with the package that comes from CentOS extras (2.0.7).


I'd prefer the other way around :) Unless you have a very good
support contract with your supplier, but in that case we probably
would be hearing from them and not from you :)  For better or
worse (probably the latter, because you people would prefer a more
stable thing :-/), heartbeat development is very fast and the
number of things fixed from one to the next release is
substantial.

> I don't get this error all the time, so I'm not sure why it's happening.
> Can someone give me a deeper explanation of what the lrmd doesn't like here?

I can't. If popen(3), read(2), EAGAIN, and SIGPIPE make any sense
to you, perhaps you can figure it out :) Seriously though, I think
it was a bug to treat the EAGAIN the way it has been treated in
2.0.7 in lrmd.

> >When I attempt to move resources to another node (useing crm_standby) I
> >get
> >> these errors:
> >> Mar 28 10:56:04 test-1 crmd: [22011]: info: do_lrm_rsc_op:lrm.cPerforming
> >> op stop on httpd (interval=0ms,
> >key=28:66532759-6190-4321-9be3-07730b15aeae)
> >> Mar 28 10:56:04 test-1 lrmd: [22773]: WARN: For LSB init script, no
> >> additional parameters are needed.
> >
> >Can't say unless you show me this rsc definition, but it seems
> >like bad usage. I found one below, but that one should not cause
> >this problem:
> 
> 
> It's slightly different now (is provider="heartbeat" bad here?):
> 
>         <primitive class="lsb" id="httpd" provider="heartbeat"
> type="httpd-lsb">
>           <operations>
>             <op id="httpd_mon" interval="5s" name="monitor" timeout="20s"
> on_fail="restart"/>
>             <op id="httpd_start" name="start" timeout="20s"
> on_fail="restart" prereq="fencing"/>
>             <op id="httpd_stop" name="stop" timeout="20s" on_fail="restart"
> prereq="fencing"/>
>           </operations>
>         </primitive>
> 
> ><primitive class="lsb" id="httpd" provider="heartbeat" type="httpd">
> >> <operations>
> >> <op id="httpd_status" interval="5s" name="status" timeout="20s"
> >on_fail="fence"/>
> >> </operations>
> >> </primitive>
> >
> >One thing that looks odd is 5s interval and 20s timeout. The
> >timeout is probably OK, but the interval is a bit exaggerated.
> >What I mean is that, apart from putting extra strain on your host
> >which may or may not be an issue, a 5 seconds monitoring interval
> >won't bring you much, or, in other words, how about your response
> >time in case a problem occurs? Is it of the same order?
> 
> 
> Would it make more sense to have the timeout and interval equal?  I can see
> your point.

Yes, it would. Actually, you should be OK with the interval set to
5min. Of course, it depends on how critical your app is and what
are _admin's_ response requirements. If you have sth like 30 mins
average response time, which would be quite good I guess, then
5mins should be fine.

The timeout then depends on what do you test. The two are not
related.

> >Mar 28 10:56:04 test-1 lrmd: [22008]: ERROR: RA lsb:httpd:stop (process
> >> 22773) failed to redirect stdout for its background child (daemon)
> >> processes. This will likely cause those processes to die mysteriously at
> >> some later time (terminated by signal SIGPIPE).
> >> Mar 28 10:56:04 test-1 lrmd: [22008]: info: RA output:
> >(httpd:stop:stdout)
> >> httpd (pid 22165 22164 22163 22162 22161 22160 22159 22157 22155) is
> >> running...
> >> Mar 28 10:56:04 test-1 crmd: [22011]: WARN: process_lrm_event:lrm.c LRM
> >> operation (44) stop_0 on httpd Error: (1) unknown error
> >
> >I'd strongly recommend that you use the OCF RA in stead of your
> >distributions init script. It is otherwise rather difficult to
> >figure out what this error means apart from the fact that the stop
> >op failed. I wonder why did it show up as WARN and not ERROR.
> 
> 
> I have written a wrapper script for the Red Hat httpd init script so that it
> conforms to the LSB standard.  I have tested it and it conforms properly.
> You can find it attached to this email.

I'm afraid that I'm not in a position to read/verify random init
scripts unless there's no other way around it---there are just too
many to choose from. That's the primary reason I suggested to use
the one delivered with HB: It has been designed to work with HB.
It has been tested with HB. And, in case it doesn't work for you,
there's a fair chance that somebody would fix it. That doesn't
sound like a bad deal to me. After all, that's why there's a bunch
of RAs delivered with the package. Besides, probably you won't
believe me, but it is hard enough to keep up the way it is.

In your case, the script fails to stop httpd, but never mentions
any reason for that, we can only see that the exit code is 1. I'm
afraid I can't be of much help there.

> >Mar 28 10:56:04 test-1 cib: [22007]: info: cib_diff_notify:notify.c Update
> >> (client: 22011, call:63): 0.1.86 -> 0.1.87 (ok)
> >> Mar 28 10:56:04 test-1 cib: [22782]: info: write_cib_contents:io.c Wrote
> >> version 0.1.87 of the CIB to disk (digest:
> >66fffd666a5f41ab337a4a6580bc4f21)
> >>
> >> I have already written a wrapper for apache since the Red Hat init
> >script
> >> does not comply with the LSB standard.  What are these errors trying to
> >tell
> >> me?
> >
> >>
> >> Thanks,
> >> --BO
> >>
> >>
> >> On 3/21/07, Bjorn Oglefjorn <[EMAIL PROTECTED]> wrote:
> >> >
> >> >Does this make more sense?  I've changed the constraints to the way
> >you've
> >> >suggested and I've also changed the scores to INFINITY.  I have since
> >added
> >> >debug logging and made some changes to my STONITH RA.  It kind of works
> >at
> >> >this point, but eventually both nodes get shot in the head if I shut
> >down
> >> >apache on the active node.  I think that the active node is fenced
> >properly
> >> >and the other node then acquires the necessary resources, but then for
> >some
> >> >reason the new active node thinks something is wrong and tries to fence
> >the
> >> >node which was just fenced but ends up fencing itself. *breathes*
> >> >
> >> >I'll have to dig up some logs to illustrate this effect, but can you
> >all
> >> >think of anything off the top of your collective head that would cause
> >> >this?
> >> >--BO
> >> >
> >> >     <constraints>
> >> >       <rsc_location id="test_group_location" rsc="test_group">
> >> >         <rule id="prefered_location_test_group" score="100">
> >> >           <expression attribute="#uname"
> >> >id="prefered_location_test_group_expr_1" operation="eq" value="
> >> >ldap-1.host.cdc.advance.net"/>
> >> >         </rule>
> >> >       </rsc_location>
> >> >       <rsc_location id="test-1_drac_location" rsc="test-1_drac">
> >> >         <rule id="prefered_location_test-1_drac" score="INFINITY">
> >> >           <expression attribute="#uname"
> >> >id="prefered_location_test-1_drac_expr_1" operation="eq" value="
> >> >ldap-2.host.cdc.advance.net"/>
> >> >         </rule>
> >> >       </rsc_location>
> >> >       <rsc_location id="test-2_drac_location" rsc="test-2_drac">
> >> >         <rule id="prefered_location_test-2_drac" score="INFINITY">
> >> >           <expression attribute="#uname"
> >> >id="prefered_location_test-2_drac_expr_1" operation="eq" value="
> >> >ldap-1.host.cdc.advance.net"/>
> >> >         </rule>
> >> >       </rsc_location>
> >> >     </constraints>
> >> >
> >> >On 3/20/07, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> I think it does, bar the constraints. Basically, you'd want to
> >> >> have test-2_drac run on test-1, right? Also, I don't know anything
> >> >> about that drac: you should pay attention to whether it supports
> >> >> simultaneous connections. Those devices are typically quite picky
> >> >> when it comes to communicating with the rest of the world.
> >> >> Finally, I think that you've mentioned implementing the RA for
> >> >> this device yourself: why don't you do some debugging output in
> >> >> it.
> >> >>
> >> >> >
> >> >> > Thanks,
> >> >> > --BO
> >> >> >
> >> >> > On 3/20/07, Bjorn Oglefjorn <[EMAIL PROTECTED]> wrote:
> >> >> > >
> >> >> > >More logs.  No debug yet.  I'll get to that today.
> >> >> > >--BO
> >> >> > >
> >> >> > >On 3/20/07, Bjorn Oglefjorn <[EMAIL PROTECTED] > wrote:
> >> >> > >>
> >> >> > >> Odd.  I've changed that op to be "monitor" and now I get this
> >> >> error:
> >> >> > >>
> >> >> > >> Mar 20 13:10:14 test-2 lrmd: [28651]: ERROR: RA
> >lsb:httpd:monitor
> >> >> > >> (process 29131) failed to redirect stdout for its background
> >child
> >> >> > >(daemon)
> >> >> > >> processes. This will likely cause those processes to die
> >> >> mysteriously at
> >> >> > >> some later time (terminated by signal SIGPIPE).
> >> >> > >>
> >> >> > >> --BO
> >> >> > >>
> >> >> > >> On 3/20/07, Andrew Beekhof < [EMAIL PROTECTED] > wrote:
> >> >> > >> >
> >> >> > >> > On 3/19/07, Bjorn Oglefjorn <[EMAIL PROTECTED]> wrote:
> >> >> > >> > > I'll give the debug logging a chance today.
> >> >> > >> > >
> >> >> > >> > > The only "status" operation I have defined is for the LSB
> >> >> apache
> >> >> > >> > script
> >> >> > >> > > here:
> >> >> > >> > >          <primitive class="lsb" id="httpd"
> >provider="heartbeat"
> >> >>
> >> >> > >> > > type="httpd">
> >> >> > >> > >            <operations>
> >> >> > >> > >              <op id="httpd_status" interval="5s"
> >name="status"
> >> >> > >> > timeout="20s"
> >> >> > >> > > on_fail="fence"/>
> >> >> > >> > >            </operations>
> >> >> > >> > >          </primitive>
> >> >> > >> > >
> >> >> > >> > > How can I tell if this is what that log entry is referring
> >> >> > >> > to?  Thanks
> >> >> > >> > > again.
> >> >> > >> > > --BO
> >> >> > >> >
> >> >> > >> > it will be that one.  just call it monitor everywhere.  we'll
> >> >> take
> >> >> > >> > care of translating that into "status" if we need to
> >> >> > >> > _______________________________________________
> >> >> > >> > Linux-HA mailing list
> >> >> > >> > [email protected]
> >> >> > >> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >> >> > >> > See also: http://linux-ha.org/ReportingProblems
> >> >> > >> >
> >> >> > >>
> >> >> > >>
> >> >> > >
> >> >> > >
> >> >> > _______________________________________________
> >> >> > Linux-HA mailing list
> >> >> > [email protected]
> >> >> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >> >> > See also: http://linux-ha.org/ReportingProblems
> >> >>
> >> >> --
> >> >> Dejan
> >> >> _______________________________________________
> >> >> Linux-HA mailing list
> >> >> [email protected]
> >> >> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >> >> See also: http://linux-ha.org/ReportingProblems
> >> >>
> >> >
> >> >
> >
> >--
> >Dejan
> >

-- 
Dejan
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] R2 Two-node apache cluster with STONITH

Reply via email to