RE: [Linux-HA] Failover not working as I expected

Jerome Yanga Thu, 29 Jan 2009 10:35:15 -0800

Good evening to you, Dominik.  :)

I apologize for being persistent.  I can work around the situations that I have 
encountered via creating scripts.  However, I just thought that there may be 
something in the configuration that I can tweak to make it work.  You have been 
very helpful and that is greatly appreciated.  In fact, you have resolved all 
the situations I encountered, except the one that you had asked me to create a 
bug report on which I would so that product will be better.  Besides, you will 
probably hate this project that I am working on to fall into MSCS (Microsoft 
Cluster Service) as much as I will.  Oooh...just the thought that the project 
will resort to a Microsoft solution makes me feel like I am losing my freedom 
(I certainly do not want this to happen and will try hard for this not to 
happen).


I have submitted this to Bugzilla as you have recommended.  It is registered as 
Bug 2047.

Thank you for your support.

Regards,
jerome
        




-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Dominik Klein
Sent: Wednesday, January 28, 2009 11:19 PM
To: General Linux-HA mailing list
Subject: Re: [Linux-HA] Failover not working as I expected

Good morning Jerome

we should make this a daily thing, shouldn't we?

Jerome Yanga wrote:
> Dominik,
> 
> I apologize for leaving resource-stickiness out.  I had it there previously 
> but due to the trial and errors I had performed on the crm shell, I had 
> forgotten to re-add it.  Nevertheless, adding it to my cib.xml file does not 
> seem to work.
> 
> Here is the chain of events.  This happens on either Nomen or Rubric.
> 
> 01)  Nomen (one of the two nodes) owns the group resource, called 
> Directory_Server.  In the meantime, Rubric (the other node) is just there 
> waiting for the resources to come to him.  :)
> 02)  I stop heartbeat on Nomen and the Directory_Server resource group fails 
> over to Rubric.
> 03)  Nomen's status changes from "running(dc)" to "stopped"
> 04)  After waiting for step #3 to finish its transition, I start heartbeat 
> back up in Nomen.
> 05)  Nomen's status changes from "stopped" to "running-standby" to "running".
> 06)  Rubric retains all the resources.  However, all the resources on Rubric 
> bounces/restarts when Nomen's status changes from "running-standby" to 
> "running".

With the configuration you posted below, this should not happen. The
configuration looks good for what you want. If you're sure that is what
you do and get, please file a bug about that and include a hb_report.

http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

> Is there a way to prevent the resources in Rubric to bounce/restart when 
> Nomen rejoins the cluster?
> 
> Help.
> 
> 
> 
> On the other hand, you pointed me to the right direction regarding the MailTo 
> OCFAgent.
> 
> This is how the variable looked like in .ocf-binaries when it was not working.
> 
> rubric ~]# grep MAIL /usr/lib/ocf/resource.d/heartbeat/.ocf-binaries
> : ${MAILCMD:=}
> 
> I assigned the exact path of the mail command to the variable.  Now, I get 
> emailed every time a failover happens.  Wooot!  Wooot!  :)
> 
> rubric ~]# grep MAIL /usr/lib/ocf/resource.d/heartbeat/.ocf-binaries
> : ${MAILCMD:=/bin/mail}

Good. I think this was on the lists earlier. Apparently a packaging issue.

Regards
Dominik

> Thanks.
> 
> 
> Below is my current cib.xml file.
> 
> <cib admin_epoch="0" validate-with="pacemaker-1.0" crm_feature_set="3.0" 
> have-quorum="1" dc-uuid="27f54ec3-b626-4b4f-b8a6-4ed0b768513c" epoch="102" 
> num_updates="0" cib-last-written="Wed Jan 28 08:32:39 2009">
>   <configuration>
>     <crm_config>
>       <cluster_property_set id="cib-bootstrap-options">
>         <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" 
> value="1.0.1-node: 6fc5ce8302abf145a02891ec41e5a492efbe8efe"/>
>       </cluster_property_set>
>     </crm_config>
>     <nodes>
>       <node id="5e3e3c2d-55e7-4c51-90be-5c4a1912bf3e" uname="nomen.esri.com" 
> type="normal">
>         <instance_attributes id="nodes-5e3e3c2d-55e7-4c51-90be-5c4a1912bf3e">
>           <nvpair id="standby-5e3e3c2d-55e7-4c51-90be-5c4a1912bf3e" 
> name="standby" value="off"/>
>         </instance_attributes>
>       </node>
>       <node id="27f54ec3-b626-4b4f-b8a6-4ed0b768513c" uname="rubric.esri.com" 
> type="normal">
>         <instance_attributes id="nodes-27f54ec3-b626-4b4f-b8a6-4ed0b768513c">
>           <nvpair id="standby-27f54ec3-b626-4b4f-b8a6-4ed0b768513c" 
> name="standby" value="off"/>
>         </instance_attributes>
>       </node>
>     </nodes>
>     <resources>
>       <group id="Directory_Server">
>         <meta_attributes id="Directory_Server-meta_attributes">
>           <nvpair id="Directory_Server-meta_attributes-collocated" 
> name="collocated" value="true"/>
>           <nvpair id="Directory_Server-meta_attributes-ordered" 
> name="ordered" value="true"/>
>           <nvpair id="Directory_Server-meta_attributes-migration-threshold" 
> name="migration-threshold" value="1"/>
>           <nvpair id="Directory_Server-meta_attributes-failure-timeout" 
> name="failure-timeout" value="10s"/>
>           <nvpair id="Directory_Server-meta_attributes-resource-stickiness" 
> name="resource-stickiness" value="10"/>
>         </meta_attributes>
>         <primitive class="ocf" id="VIP" provider="heartbeat" type="IPaddr">
>           <instance_attributes id="VIP-instance_attributes">
>             <nvpair id="VIP-instance_attributes-ip" name="ip" 
> value="10.50.26.250"/>
>           </instance_attributes>
>           <operations id="VIP-ops">
>             <op id="VIP-monitor-5s" interval="5s" name="monitor" 
> timeout="5s"/>
>           </operations>
>         </primitive>
>         <primitive class="ocf" id="ECAS" provider="esri" type="ecas">
>           <operations id="ECAS-ops">
>             <op id="ECAS-monitor-3s" interval="3s" name="monitor" 
> timeout="3s"/>
>           </operations>
>         </primitive>
>         <primitive class="ocf" id="FDS_Admin" provider="esri" type="fdsadm">
>           <operations id="FDS_Admin-ops">
>             <op id="FDS_Admin-monitor-3s" interval="3s" name="monitor" 
> timeout="3s"/>
>           </operations>
>         </primitive>
>         <primitive class="ocf" id="Emergency_Contact" provider="heartbeat" 
> type="MailTo">
>           <instance_attributes id="Emergency_Contact-instance_attributes">
>             <nvpair id="Emergency_Contact-instance_attributes-email" 
> name="email" value="[email protected]"/>
>             <nvpair id="Emergency_Contact-instance_attributes-subject" 
> name="subject" value="Failover Occured"/>
>           </instance_attributes>
>           <operations id="Emergency_Contact-ops">
>             <op id="Emergency_Contact-monitor-3s" interval="3s" 
> name="monitor" timeout="3s"/>
>           </operations>
>         </primitive>
>       </group>
>     </resources>
>     <constraints/>
>   </configuration>
> </cib>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

RE: [Linux-HA] Failover not working as I expected

Reply via email to