Re: [Linux-HA] "Clones, Stonith and Suicide" The SysAdmin who had a nervous breakdown.

Dejan Muhamedagic Wed, 03 Oct 2007 06:11:32 -0700

Hi,

On Tue, Oct 02, 2007 at 10:55:03PM +0100, Peter Farrell wrote:
> On 02/10/2007, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > On Tue, Oct 02, 2007 at 05:17:38PM +0100, Peter Farrell wrote:
> > > Can someone verify my CIB please?
> > >
> > > It's not working as intended and the more I read the less I understand...
> > > I've stared at the config for the past 2 days hoping to be struck by
> > > sudden understanding... hasn't happened yet.
> >
> > Don't worry, the learning curve is extremely steep. We all need
> > quite some patience.
> >
> > > I don't understand how you make a rule, and then call that rule as a
> > > result of an action. I used the bit from the pingd FAQ page:
> > > http://www.linux-ha.org/v2/faq/pingd
> > > "Quickstart - Only Run my_resource on Nodes with Access to at Least
> > > One Ping Node"
> > >
> > > So - for my pingd clone, the operation is 'monitor' and 'on_fail=fence'
> > > <op id="pingd-child-monitor" name="monitor" interval="20s"
> > > timeout="40s" prereq="nothing" on_fail="fence"/>
> > >
> > > I assume that this literally means:
> > > "ask the LRM to see if pingd is running every 20s, if after 40s pingd
> > > is not running, call it 'failed', and as it's 'failed' - fence it off,
> > > which forces the resource to migrate to another node and marks this
> > > one as 'degraded' and will not allow resource to run until it's been
> > > cleaned up"
> > >
> > > Is that right? If so, then this bit I'm OK with.
> >
> > No, not exactly. The monitor operation may fail (i.e. the
> > resource agent says that the resource isn't running) or timeout
> > (that's what you described). Of course, both are considered to be
> > failures by CRM. on_fail=fence means that if this operation
> > fails, the node will be fenced, i.e. rebooted if you have an
> > operational stonith device. Perhaps a tad harsh for a monitor
> > failure.
> 
> 1. The approach for me is (this is a test cluster - but I want to use
> it to replace a production one) - if either of the load balancers
> can't ping one or two routers in my DMZ, then this must mean they're
> dead. I figured if they can't see the router - how the hell can they
> see the apache servers they're meant to be managing?
> Is this 'correct political thought' or a sloppy foundation to begin with?


It's just that the resources _are_ going to move. No need to kill
the cooperating node.

> 2. I didn't know that fence meant 'rebooted'. I thought it was sort of
> 'fenced off' and left in a degraded state should someone want to poke
> around a bit.
> RE: Perhaps a tad harsh for a monitor failure - I agree. But what's a
> girl to do?
> Am I on the right track here? Do I want it rebooting? Do I just want
> Heartbeat to restart? Does it matter? If it comes up and the link is
> still dead - will it cycle forever w/ reboots?

Not sure, but could be. Whenever a node comes up all resources
are probed, i.e. one monitor operation is fired.

> 3. the real bit I'm missing: Let's say I want it rebooted after
> fencing.

Fencing _is_ rebooting.

> What 'commands it' to do so? Just the flag 'on_fail=fence'?

Yes. Some other things too. For example, one node has a quorum
and it cannot establish the state of another node. Then, to make
sure, it kills the other node.

> Does that automatically look for a started stonith device  or resource
> and if it finds one, it just uses it?

Yes.

> I mean - how does the stonith
> suicide (which doesn't work - but suppose for a minute it did) - how
> is it connected to another operational directive?

Not sure why does this confuse you. Once a decision has been
reached that a node should be fenced (rebooted), the cluster will
try to find a means to do that. That means is a stonith resource.

> > > But - the 'dampen and multiplier' - I don't get.
> > > <nvpair id="pingd-dampen" name="dampen" value="5s"/>
> > > Does this mean: Wait 5 seconds before saying "yep - pingd says there's
> > > nothing out there, once pingd says 'there's nothing out there;?" Now
> > > write it out to the CIB and let any actions take place?
> >
> > Yes, the cluster sort of stands back a bit until everything
> > settles.
> >
> > > <nvpair id="pingd-multiplier" name="multiplier" value="100"/>
> > > This is a weighted score thing right? It's adding 100 to each node
> > > that 'can' ping?
> >
> > Right.
> 
> Can you control the frequency of the pings themselves? What
> constitutes a timeout in this case?  (n) packets lost? latency?

I don't know. But it's supposed to do the "right thing".

> > > So if one can't ping, then the score gets knocked down and the
> > > resource wants to move to a "higher scoring" node?? I completely don't
> > > understand this... What if you already have a constraint set for a
> > > node preference, does this override it? Conflict with it?
> >
> > The node with the highest score is chosen to run the service. If
> > there's more than one with the same score, then one's chosen at
> > (pseudo)random. If no score is non-negative then the resource
> > can't run anywhere.
> >
> > > In any case - now that my node has no ping, and is fenced, I saw
> > > another bit of code called 'DoFencing' which I modified thinking it
> > > would now cause the node to commit suicide since it had no
> > > connectivity. But I've no idea about how it's meant to work... It's
> > > saying "your clone DoFencing is stonith via suicide" right?
> >
> > I don't know until I see the code you're talking about. Typically
> > though stonith resources are configured to reboot other nodes and
> > not commit suicides. There's a special stonith agent called
> > suicide for this purpose.
> >
> > > What do the clone_max and clone_node_max mean?
> > > Is clone_max = 2, mean that there are a maximum of 2 nodes that use
> > > it? 2 stonith daemons that run on each node? What? Ditto for
> > > clone_node_max?
> >
> > clone_max: The maximum number of instances of this clone in the cluster.
> 
> What is the guidance for this? Should you have one per machine? One in 
> general?

There's no guidance. Clones are just useful if you want to have
more than one instance of a resource. Typically this is set to
the number of nodes.

> > clone_node_max: The maximum number of instances of this clone at one node.
> 
> Ditto above: Should I have a stonith clone per resource / per node? Or
> just one?

One typical example for clones is an NFS filesystem. If one wants
it mounted on all nodes, a cloned Filesystem resource suffices.

> > > As for the operations on the DoFencing clone - what are they
> > > triggering? The timeouts are for what? the stonith daemon itself? Am I
> > > calling the stonith daemon itself to commit suicide? If so - why would
> > > I have a monitor or start operation?
> >
> > This is admitedly a bit confusing. The start operation doesn't
> > do anything with the device, just makes it available. The stop
> > operation is the opposite. In other words, in order for the
> > stonith device to be used it must first be started.
> 
> So for any stonith resource, (using suicide / ssh methods) I'll always
> want to have
> monitor, start & stop?

Just start and monitor. Normally, you don't have to use stop.

> Monitor for the cluster to use it, start to see it and stop is
> effectively the 'reboot' bit?

No, the stop bit is to stop the stonith resource.

> > The monitor operation is essential because the cluster wants to
> > make sure that the stonith device is operational. Typically, it
> > consists of logging into the device and requesting some kind of
> > status.
> >
> > The timeouts are for the operations on which they are defined.
> > The start operation implies a monitor.
> >
> > > Do you need a constraint with a rule to 'start' this resource? ie.
> > > kill myself? Does it just 'know' to do this? I'm really not getting
> > > it.
> >
> > Under some circumstances it is necessary to ensure that a node
> > has relinquished resources. A typical example is a failed stop
> > operation. In that case the CRM will issue a RESET or POWEROFF
> > request to the eligible stonith device.
> 
> So - the previous 'on_fail=fence' for the pingd clone - where would
> that go ideally?

It's really simple: on_fail instructs cluster what to do in case
this operation failed.

> (I mean - on which operation?)
> Ping needs a monitor and needs a start. Does it need a stop?

No.

> > > <clone id="DoFencing">
> > >  <instance_attributes>
> > >   <attributes>
> > >     <nvpair name="clone_max" value="2"/>
> > >     <nvpair name="clone_node_max" value="1"/>
> > >   </attributes>
> > >  </instance_attributes>
> > > <primitive class="stonith" id="child_DoFencing" type="suicide"
> > > provider="heartbeat">
> > >  <operations>
> > >   <op name="monitor" interval="5s" timeout="20s" prereq="nothing"/>
> > >   <op name="start" timeout="20s" prereq="nothing"/>
> > >  </operations>
> > > </primitive>
> > > </clone>
> >
> > The suicide stonith device is not exactly the best approach.
> > Ultimately it is not reliable, so it should not be used on the
> > production clusters. If you can afford it, get a real (hardware)
> > stonith device.
> 
> Can't. No budget. Advice taken - I'll have to kill these via SSH or suicide.

Note that in case the cluster wants to stonith (reset) a node it
will try to do that forever. Hence, if at that time your stonith
device is not operational, the cluster will basically block.
That's also why using ssh as a stonith device is dangerous. For
example, if the power supply fails, the living node will never
take over the resources.

> I set up ssh keys for every user, root, haclient, hacluster - they
> always fail authentication.
> How can you tell which user / method it's using?

ssh uses the root user. You should check yourself if it works
without password. 

> Can you set which
> interface they use (in order to force it (ssh) down the crossover
> cables?)

No. It's as if you run ssh on the command line.

> >
> > > Intended actions:
> > > > node1 loses ping, (which in my world means that it's dead)
> > > > resources migrate to node2
> > > > node1 reboots (what I really want is for the fenced resource to be 
> > > > 'cleaned up' so that it can run again on this node - I'm not fussy 
> > > > about how I achieve that)
> > > > resource migrates back to node1 once ping (connectivity is restored).
> >
> > Rebooting a node should imply a resource cleanup. In the next
> > release the cluster will also be able to "forget" after some time
> > about the failure.
> >
> > > actual actions:
> > > > node1 loses ping,
> > > > resource migrates to node2.
> >
> > And the node1 is not rebooted? Then there's a problem with the
> > stonith setup. Any errors in logs?
> 
> It's never called. I've cocked up the config by experimenting via
> 'cut-n-paste' rather than taking the time to understand the thing
> properly. Having said that I've read the docs, watching Alan down
> under, trawled the lists and re-arranged others configs, but it's
> still pretty random. Plus it's been 2 weeks and I'm an instant
> gratification kind of guy, so I'm out of my comfort zone and getting a
> little pissed (w/ myself) now :-)
> 
> I just don't get how it's (stonith) is called in relation to another
> resource failing. The mechanism, the relationships. It's not just
> stonith, for example if the ping failed and I wanted to start apache
> on a cluster node to take over all IP addresses and serve up a 'temp.
> out of service' page I wouldn't have the foggiest.

Well, it definitely takes some time to get used to it.

Thanks,

Dejan

> > > > node2 loses ping but 'resource cannot run anywhere' ensues and both 
> > > > nodes are 'active' but no resources are being ran.
> > >
> > > I think fundamentally my approach is wrong and that I should leave it
> > > to fail and have human intervention to clean it up rather than hope it
> > > will flip flop between nodes.
> >
> > That depends on your needs of course. At any rate, it should be
> > possible to configure the cluster to fit those needs.
> >
> > There is also the meatware stonith device which will prompt a
> > human to clean up/reboot.
> >
> > > But - I'd like to have a better grasp of
> > > how V2 works in general before making the choice to fall back to a
> > > simpler config.
> >
> > HTH.
> 
> It has. Thanks a lot.
> 
> -Peter
> 
> > Thanks,
> >
> > Dejan
> >
> > > -Peter
> > >
> > >
> > > Active / Passive set up.
> > > 2 nodes, one resource (ldirectord) balancing traffic for IP addresses
> > > on 2 web servers.
> > > 2 nics [eth0: dmz facing - eth1: crossover cable, on 10.0.0.1/2]
> > >
> > > This relates to the previous post:
> > > "How can you clean up a degraded node w/out killing it (and not 
> > > manually)?"
> > >
> > > Versions:
> > > heartbeat-stonith-2.1.2-3.el4.centos
> > > heartbeat-pils-2.1.2-3.el4.centos
> > > heartbeat-ldirectord-2.1.2-3.el4.centos
> > > heartbeat-2.1.2-3.el4.centos
> >
> > > <resources>
> > >       <group id="group_1">
> > >               <primitive class="ocf" id="IPaddr_212_140_130_37" 
> > > provider="heartbeat" type="IPaddr">
> > >                       <operations>
> > >                               <op id="IPaddr_212_140_130_37_mon" 
> > > interval="5s" name="monitor" timeout="5s"/>
> > >                       </operations>
> > >                       <instance_attributes 
> > > id="IPaddr_212_140_130_37_inst_attr">
> > >                               <attributes>
> > >                                       <nvpair 
> > > id="IPaddr_212_140_130_37_attr_0" name="ip" value="212.140.130.37"/>
> > >                               </attributes>
> > >                       </instance_attributes>
> > >               </primitive>
> > >               <primitive class="ocf" id="IPaddr_212_140_130_38" 
> > > provider="heartbeat" type="IPaddr">
> > >                       <operations>
> > >                               <op id="IPaddr_212_140_130_38_mon" 
> > > interval="5s" name="monitor" timeout="5s"/>
> > >                       </operations>
> > >                       <instance_attributes 
> > > id="IPaddr_212_140_130_38_inst_attr">
> > >                               <attributes>
> > >                                       <nvpair 
> > > id="IPaddr_212_140_130_38_attr_0" name="ip" value="212.140.130.38"/>
> > >                               </attributes>
> > >                       </instance_attributes>
> > >               </primitive>
> > >               <primitive class="ocf" id="ldirectord_3" 
> > > provider="heartbeat" type="ldirectord">
> > >                       <operations>
> > >                               <op id="ldirectord_3_mon" interval="120s" 
> > > name="monitor" timeout="60s"/>
> > >                       </operations>
> > >                       <instance_attributes id="ldirectord_3_inst_attr">
> > >                               <attributes>
> > >                                       <nvpair id="ldirectord_3_attr_1" 
> > > name="1" value="ldirectord.cf"/>
> > >                               </attributes>
> > >                       </instance_attributes>
> > >               </primitive>
> > >       </group>
> > >       <clone id="pingd">
> > >               <instance_attributes id="pingd">
> > >                       <attributes>
> > >                               <nvpair id="pingd-clone_node_max" 
> > > name="clone_node_max" value="1"/>
> > >                       </attributes>
> > >               </instance_attributes>
> > >               <primitive id="pingd-child" provider="heartbeat" 
> > > class="ocf" type="pingd">
> > >                       <operations>
> > >                               <op id="pingd-child-monitor" name="monitor" 
> > > interval="20s" timeout="40s" prereq="nothing" on_fail="fence"/>
> > >                       </operations>
> > >                       <instance_attributes id="pingd_inst_attr">
> > >                               <attributes>
> > >                                       <nvpair id="pingd-dampen" 
> > > name="dampen" value="5s"/>
> > >                                       <nvpair id="pingd-multiplier" 
> > > name="multiplier" value="100"/>
> > >                               </attributes>
> > >                       </instance_attributes>
> > >               </primitive>
> > >       </clone>
> > >       <clone id="DoFencing">
> > >               <instance_attributes>
> > >                       <attributes>
> > >                               <nvpair name="clone_max" value="2"/>
> > >                               <nvpair name="clone_node_max" value="1"/>
> > >                       </attributes>
> > >               </instance_attributes>
> > >               <primitive id="child_DoFencing" class="stonith" 
> > > type="suicide" provider="heartbeat">
> > >                       <operations>
> > >                               <op name="monitor" interval="5s" 
> > > timeout="20s" prereq="nothing"/>
> > >                               <op name="start" timeout="20s" 
> > > prereq="nothing"/>
> > >                       </operations>
> > >               </primitive>
> > >       </clone>
> > > </resources>
> > > <constraints>
> > >       <rsc_location rsc="group_1" id="rsc_location_group_1">
> > >               <rule id="prefered_location_group_1" score="200">
> > >                       <expression attribute="#uname" 
> > > id="prefered_location_group_1_expr" operation="eq" 
> > > value="dmz1.scarceskills.com"/>
> > >               </rule>
> > >               <rule id="group_1:connected:rule" score="-INFINITY" 
> > > boolean_op="and">
> > >                       <expression id="my_resource:connected:expr:zero" 
> > > attribute="pingd" operation="lte" value="0"/>
> > >               </rule>
> > >       </rsc_location>
> > >       <rsc_location id="cli-prefer-group_1" rsc="group_1">
> > >               <rule id="cli-prefer-rule-group_1" score="INFINITY">
> > >                       <expression id="cli-prefer-expr-group_1" 
> > > attribute="#uname" operation="eq" value="dmz1.scarceskills.com" 
> > > type="string"/>
> > >               </rule>
> > >       </rsc_location>
> > > </constraints>
> > > _______________________________________________
> > > Linux-HA mailing list
> > > [email protected]
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > See also: http://linux-ha.org/ReportingProblems
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] "Clones, Stonith and Suicide" The SysAdmin who had a nervous breakdown.

Reply via email to