Re: [Linux-HA] stonith of several failed nodes is slow - how to speed up

Andrew Beekhof Wed, 16 Apr 2008 10:25:21 -0700

On Wed, Apr 16, 2008 at 5:45 PM, Alt, Martin <[EMAIL PROTECTED]> wrote:
> Hi,
>
>  I am still experimenting with node fencing in the case of several
>  simultaneous node failures and have a question about the stonith
>  configuration.
>
>  I am failing four nodes simultaneously on a 16 nodes cluster (the
>  four nodes are four HP blades in an enclosure which is switched
>  off). Altogether, it works surprisingly well, all resources are
>  back online within 50 seconds :-)
>
>  However, looking at the logs, it looks like the four nodes are
>  shot sequentially, one after the other. The relevant log entries
>  on the DC (the node failure is detected at 17:08:50):
>  pengine[29350]: 2008/04/16_17:09:01 WARN: stage6: Scheduling Node bladed4 
> for STONITH
>  pengine[29350]: 2008/04/16_17:09:01 WARN: stage6: Scheduling Node bladed3 
> for STONITH
>  pengine[29350]: 2008/04/16_17:09:01 WARN: stage6: Scheduling Node bladed2 
> for STONITH
>  pengine[29350]: 2008/04/16_17:09:01 WARN: stage6: Scheduling Node bladed1 
> for STONITH


Actually this is the PE which only says what happens not in which order.
However, stonith operations are serialized because the stonith daemon
can't handle more than one at a time.

Once this limitation no longer exists, the PE will of course take full
advantage of the parallelism.

>  stonithd[3596]: 2008/04/16_17:09:01 info: client tengine [pid: 29349] want a 
> STONITH operation RESET to node bladed4.
>  stonithd[3596]: 2008/04/16_17:09:07 info: Succeeded to STONITH the node 
> bladed4: optype=RESET. whodoit: bladea4
>  stonithd[3596]: 2008/04/16_17:09:07 info: client tengine [pid: 29349] want a 
> STONITH operation RESET to node bladed3.
>  stonithd[3596]: 2008/04/16_17:09:14 info: Succeeded to STONITH the node 
> bladed3: optype=RESET. whodoit: bladeb4
>  stonithd[3596]: 2008/04/16_17:09:14 info: client tengine [pid: 29349] want a 
> STONITH operation RESET to node bladed2.
>  stonithd[3596]: 2008/04/16_17:09:20 info: Succeeded to STONITH the node 
> bladed2: optype=RESET. whodoit: bladea4
>  stonithd[3596]: 2008/04/16_17:09:20 info: client tengine [pid: 29349] want a 
> STONITH operation RESET to node bladed1.
>  stonithd[3596]: 2008/04/16_17:09:27 info: Succeeded to STONITH the node 
> bladed1: optype=RESET. whodoit: bladea4
>
>  Unfortunately, the stonith agents take their time and since they
>  are called one after the other, it takes some 25 seconds to bring
>  all hosts down.
>
>  Now I wonder if I can somehow configure heartbeat to reset all
>  nodes immediately, without waiting for the other nodes.
>
>  Here are the stonith-relevant parts of the cib:
>  <crm_config>
>   <cluster_property_set>
>     <attributes>
>       <nvpair name="stonith-enabled" value="true"/>
>     </attributes>
>   </cluster_property_set>
>  </crm_config>
>
>  <resources>
>   <clone id="fencing_enclosure_a">
>     <instance_attributes>
>       <attributes>
>         <nvpair name="clone_max" value="2" />
>         <nvpair name="clone_node_max" value="1" />
>       </attributes>
>     </instance_attributes>
>     <primitive id="stonith-hpoa-encla" class="stonith" provider="heartbeat" 
> type="external/hpoa-encla">
>       <operations>
>         <op id="stonih-hpoa-encla_on" name="on" timeout="15s"/>
>         <op id="stonih-hpoa-encla_off" name="off" timeout="15s"/>
>         <op id="stonih-hpoa-encla_status" name="status" timeout="15s"/>
>         <op id="stonih-hpoa-encla_reset" name="reset" timeout="15s"/>
>       </operations>
>     </primitive>
>   </clone>
>   <clone id="fencing_enclosure_b">
>     ... same for enclosure b with Stonith Agent "external/hpoa-enclb"
>   </clone>
>   <clone id="fencing_enclosure_c">
>     ...
>   </clone>
>   <clone id="fencing_enclosure_d">
>     ...
>    </clone>
>   </resources>
>
>  Is there any parameter I can tweak so nodes can be shot
>  without waiting for other stonith actions in progress?
>
>  Thanks and best regards,
>  Martin
>
>
>  --
>  Dr. Martin Alt
>  System und Softwarearchitektur
>  Plath GmbH
>  Gotenstrasse   18
>  D - 20097 Hamburg
>  Tel: +49 40/237 34-361
>  Fax: +49 40/237 34-173
>  Email: [EMAIL PROTECTED]
>  http://www.plath.de
>
>  Hamburg HRB7401
>  Geschäftsführer: Dipl.-Kfm. Nico Scharfe
>
>
>  _______________________________________________
>  Linux-HA mailing list
>  [email protected]
>  http://lists.linux-ha.org/mailman/listinfo/linux-ha
>  See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] stonith of several failed nodes is slow - how to speed up

Reply via email to