Re: [Linux-HA] "Clones, Stonith and Suicide" The SysAdmin who had a nervous breakdown.

Dave Blaschke Wed, 03 Oct 2007 08:13:34 -0700

Dejan Muhamedagic wrote:

Hi,


On Tue, Oct 02, 2007 at 10:55:03PM +0100, Peter Farrell wrote:

On 02/10/2007, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:

Hi,

On Tue, Oct 02, 2007 at 05:17:38PM +0100, Peter Farrell wrote:

Can someone verify my CIB please?

It's not working as intended and the more I read the less I understand...
I've stared at the config for the past 2 days hoping to be struck by
sudden understanding... hasn't happened yet.

Don't worry, the learning curve is extremely steep. We all need
quite some patience.

I don't understand how you make a rule, and then call that rule as a
result of an action. I used the bit from the pingd FAQ page:
http://www.linux-ha.org/v2/faq/pingd
"Quickstart - Only Run my_resource on Nodes with Access to at Least
One Ping Node"

So - for my pingd clone, the operation is 'monitor' and 'on_fail=fence'
<op id="pingd-child-monitor" name="monitor" interval="20s"
timeout="40s" prereq="nothing" on_fail="fence"/>

I assume that this literally means:
"ask the LRM to see if pingd is running every 20s, if after 40s pingd
is not running, call it 'failed', and as it's 'failed' - fence it off,
which forces the resource to migrate to another node and marks this
one as 'degraded' and will not allow resource to run until it's been
cleaned up"

Is that right? If so, then this bit I'm OK with.

No, not exactly. The monitor operation may fail (i.e. the
resource agent says that the resource isn't running) or timeout
(that's what you described). Of course, both are considered to be
failures by CRM. on_fail=fence means that if this operation
fails, the node will be fenced, i.e. rebooted if you have an
operational stonith device. Perhaps a tad harsh for a monitor
failure.

1. The approach for me is (this is a test cluster - but I want to use
it to replace a production one) - if either of the load balancers
can't ping one or two routers in my DMZ, then this must mean they're
dead. I figured if they can't see the router - how the hell can they
see the apache servers they're meant to be managing?
Is this 'correct political thought' or a sloppy foundation to begin with?


It's just that the resources _are_ going to move. No need to kill
the cooperating node.

2. I didn't know that fence meant 'rebooted'. I thought it was sort of
'fenced off' and left in a degraded state should someone want to poke
around a bit.
RE: Perhaps a tad harsh for a monitor failure - I agree. But what's a
girl to do?
Am I on the right track here? Do I want it rebooting? Do I just want
Heartbeat to restart? Does it matter? If it comes up and the link is
still dead - will it cycle forever w/ reboots?


Not sure, but could be. Whenever a node comes up all resources
are probed, i.e. one monitor operation is fired.

3. the real bit I'm missing: Let's say I want it rebooted after
fencing.


Fencing _is_ rebooting.

Sorry to jump in the middle of this thread, but can't you also power offthe node by setting stonith_action to poweroff instead of reboot? Ofcourse you need a stonith device that supports ST_POWEROFF... I haven'tread through the code but I'd assume that option works.


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] "Clones, Stonith and Suicide" The SysAdmin who had a nervous breakdown.

Reply via email to