RE: [Linux-HA] Failover not working as I expected

Jerome Yanga Wed, 28 Jan 2009 09:46:57 -0800

Dominik,

I apologize for leaving resource-stickiness out.  I had it there previously but 
due to the trial and errors I had performed on the crm shell, I had forgotten 
to re-add it.  Nevertheless, adding it to my cib.xml file does not seem to work.


Here is the chain of events.  This happens on either Nomen or Rubric.

01)  Nomen (one of the two nodes) owns the group resource, called 
Directory_Server.  In the meantime, Rubric (the other node) is just there 
waiting for the resources to come to him.  :)
02)  I stop heartbeat on Nomen and the Directory_Server resource group fails 
over to Rubric.
03)  Nomen's status changes from "running(dc)" to "stopped"
04)  After waiting for step #3 to finish its transition, I start heartbeat back 
up in Nomen.
05)  Nomen's status changes from "stopped" to "running-standby" to "running".
06)  Rubric retains all the resources.  However, all the resources on Rubric 
bounces/restarts when Nomen's status changes from "running-standby" to 
"running".

Is there a way to prevent the resources in Rubric to bounce/restart when Nomen 
rejoins the cluster?

Help.



On the other hand, you pointed me to the right direction regarding the MailTo 
OCFAgent.

This is how the variable looked like in .ocf-binaries when it was not working.

rubric ~]# grep MAIL /usr/lib/ocf/resource.d/heartbeat/.ocf-binaries
: ${MAILCMD:=}

I assigned the exact path of the mail command to the variable.  Now, I get 
emailed every time a failover happens.  Wooot!  Wooot!  :)

rubric ~]# grep MAIL /usr/lib/ocf/resource.d/heartbeat/.ocf-binaries
: ${MAILCMD:=/bin/mail}

Thanks.


Below is my current cib.xml file.

<cib admin_epoch="0" validate-with="pacemaker-1.0" crm_feature_set="3.0" 
have-quorum="1" dc-uuid="27f54ec3-b626-4b4f-b8a6-4ed0b768513c" epoch="102" 
num_updates="0" cib-last-written="Wed Jan 28 08:32:39 2009">
  <configuration>
    <crm_config>
      <cluster_property_set id="cib-bootstrap-options">
        <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" 
value="1.0.1-node: 6fc5ce8302abf145a02891ec41e5a492efbe8efe"/>
      </cluster_property_set>
    </crm_config>
    <nodes>
      <node id="5e3e3c2d-55e7-4c51-90be-5c4a1912bf3e" uname="nomen.esri.com" 
type="normal">
        <instance_attributes id="nodes-5e3e3c2d-55e7-4c51-90be-5c4a1912bf3e">
          <nvpair id="standby-5e3e3c2d-55e7-4c51-90be-5c4a1912bf3e" 
name="standby" value="off"/>
        </instance_attributes>
      </node>
      <node id="27f54ec3-b626-4b4f-b8a6-4ed0b768513c" uname="rubric.esri.com" 
type="normal">
        <instance_attributes id="nodes-27f54ec3-b626-4b4f-b8a6-4ed0b768513c">
          <nvpair id="standby-27f54ec3-b626-4b4f-b8a6-4ed0b768513c" 
name="standby" value="off"/>
        </instance_attributes>
      </node>
    </nodes>
    <resources>
      <group id="Directory_Server">
        <meta_attributes id="Directory_Server-meta_attributes">
          <nvpair id="Directory_Server-meta_attributes-collocated" 
name="collocated" value="true"/>
          <nvpair id="Directory_Server-meta_attributes-ordered" name="ordered" 
value="true"/>
          <nvpair id="Directory_Server-meta_attributes-migration-threshold" 
name="migration-threshold" value="1"/>
          <nvpair id="Directory_Server-meta_attributes-failure-timeout" 
name="failure-timeout" value="10s"/>
          <nvpair id="Directory_Server-meta_attributes-resource-stickiness" 
name="resource-stickiness" value="10"/>
        </meta_attributes>
        <primitive class="ocf" id="VIP" provider="heartbeat" type="IPaddr">
          <instance_attributes id="VIP-instance_attributes">
            <nvpair id="VIP-instance_attributes-ip" name="ip" 
value="10.50.26.250"/>
          </instance_attributes>
          <operations id="VIP-ops">
            <op id="VIP-monitor-5s" interval="5s" name="monitor" timeout="5s"/>
          </operations>
        </primitive>
        <primitive class="ocf" id="ECAS" provider="esri" type="ecas">
          <operations id="ECAS-ops">
            <op id="ECAS-monitor-3s" interval="3s" name="monitor" timeout="3s"/>
          </operations>
        </primitive>
        <primitive class="ocf" id="FDS_Admin" provider="esri" type="fdsadm">
          <operations id="FDS_Admin-ops">
            <op id="FDS_Admin-monitor-3s" interval="3s" name="monitor" 
timeout="3s"/>
          </operations>
        </primitive>
        <primitive class="ocf" id="Emergency_Contact" provider="heartbeat" 
type="MailTo">
          <instance_attributes id="Emergency_Contact-instance_attributes">
            <nvpair id="Emergency_Contact-instance_attributes-email" 
name="email" value="[email protected]"/>
            <nvpair id="Emergency_Contact-instance_attributes-subject" 
name="subject" value="Failover Occured"/>
          </instance_attributes>
          <operations id="Emergency_Contact-ops">
            <op id="Emergency_Contact-monitor-3s" interval="3s" name="monitor" 
timeout="3s"/>
          </operations>
        </primitive>
      </group>
    </resources>
    <constraints/>
  </configuration>
</cib>


Regards,
jerome



-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Dominik Klein
Sent: Tuesday, January 27, 2009 10:48 PM
To: General Linux-HA mailing list
Subject: Re: [Linux-HA] Failover not working as I expected

Jerome Yanga wrote:
> Dominik,
>
> Here is the status of the two concerns I needed help on.
>
> 01)  When a node comes back up after a restart of heartbeat, resources gets 
> bounced when it rejoins the cluster.
> STATUS:  The resources still gets bounced when a node joins the cluster even 
> if I had deleted all the constraints.

Well, your configuration lacks resource-stickiness ;) I think I already
mentioned this in an earlier email.

> 02)  Stopping one resource in a group does not failover the group to the 
> other node.
> STATUS:  migration-threshold works like a charm.  :)  Thanks.
>
> If I may, I have another concern that popped up.
>
> 03)  I cannot seem to get MailTo to work.  I am trying to add this resource 
> under the Directory_Server group so that everytime a failover is experienced, 
> it will notify me that it did.

The configuration of the agent is - as far as I can see - okay. You'd
have to look at the logs and see what it was doing/trying to do but failed.

Also:
Lookup your $MAILCMD in /usr/lib/ocf/resource.d/heartbeat/.ocf-binaries
and then try to do something like:

echo "some text for the test email" | $MAILCMD -s "failover occured"
[email protected]

If that works (ie you receive the email), the agent also should work.

Regards
Dominik

> Below is the current cib.xml file I have.
>
> <cib admin_epoch="0" validate-with="pacemaker-1.0" crm_feature_set="3.0" 
> have-quorum="1" dc-uuid="27f54ec3-b626-4b4f-b8a6-4ed0b768513c" epoch="99" 
> num_updates="0" cib-last-written="Tue Jan 27 12:59:21 2009">
>   <configuration>
>     <crm_config>
>       <cluster_property_set id="cib-bootstrap-options">
>         <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" 
> value="1.0.1-node: 6fc5ce8302abf145a02891ec41e5a492efbe8efe"/>
>       </cluster_property_set>
>     </crm_config>
>     <nodes>
>       <node id="5e3e3c2d-55e7-4c51-90be-5c4a1912bf3e" uname="nomen.esri.com" 
> type="normal">
>         <instance_attributes id="nodes-5e3e3c2d-55e7-4c51-90be-5c4a1912bf3e">
>           <nvpair id="standby-5e3e3c2d-55e7-4c51-90be-5c4a1912bf3e" 
> name="standby" value="off"/>
>         </instance_attributes>
>       </node>
>       <node id="27f54ec3-b626-4b4f-b8a6-4ed0b768513c" uname="rubric.esri.com" 
> type="normal">
>         <instance_attributes id="nodes-27f54ec3-b626-4b4f-b8a6-4ed0b768513c">
>           <nvpair id="standby-27f54ec3-b626-4b4f-b8a6-4ed0b768513c" 
> name="standby" value="off"/>
>         </instance_attributes>
>       </node>
>     </nodes>
>     <resources>
>       <group id="Directory_Server">
>         <meta_attributes id="Directory_Server-meta_attributes">
>           <nvpair id="Directory_Server-meta_attributes-collocated" 
> name="collocated" value="true"/>
>           <nvpair id="Directory_Server-meta_attributes-ordered" 
> name="ordered" value="true"/>
>           <nvpair id="Directory_Server-meta_attributes-migration-threshold" 
> name="migration-threshold" value="1"/>
>           <nvpair id="Directory_Server-meta_attributes-failure-timeout" 
> name="failure-timeout" value="10s"/>
>         </meta_attributes>
>         <primitive class="ocf" id="VIP" provider="heartbeat" type="IPaddr">
>           <instance_attributes id="VIP-instance_attributes">
>             <nvpair id="VIP-instance_attributes-ip" name="ip" 
> value="10.50.26.250"/>
>           </instance_attributes>
>           <operations id="VIP-ops">
>             <op id="VIP-monitor-5s" interval="5s" name="monitor" 
> timeout="5s"/>
>           </operations>
>         </primitive>
>         <primitive class="ocf" id="ECAS" provider="esri" type="ecas">
>           <operations id="ECAS-ops">
>             <op id="ECAS-monitor-3s" interval="3s" name="monitor" 
> timeout="3s"/>
>           </operations>
>         </primitive>
>         <primitive class="ocf" id="FDS_Admin" provider="esri" type="fdsadm">
>           <operations id="FDS_Admin-ops">
>             <op id="FDS_Admin-monitor-3s" interval="3s" name="monitor" 
> timeout="3s"/>
>           </operations>
>         </primitive>
>         <primitive class="ocf" provider="heartbeat" type="MailTo" 
> id="Emergency_Contact">
>           <instance_attributes id="Emergency_Contact-instance_attributes">
>             <nvpair id="Emergency_Contact-instance_attributes-email" 
> name="email" value="[email protected]"/>
>             <nvpair id="Emergency_Contact-instance_attributes-subject" 
> name="subject" value="Failover Occured"/>
>           </instance_attributes>
>           <operations id="Emergency_Contact-ops">
>             <op interval="3s" name="monitor" timeout="3s" 
> id="Emergency_Contact-monitor-3s"/>
>           </operations>
>         </primitive>
>       </group>
>     </resources>
>     <constraints/>
>   </configuration>
> </cib>
>
> Help.
>
> Regards,
> jerome
>
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Dominik Klein
> Sent: Monday, January 26, 2009 10:52 PM
> To: General Linux-HA mailing list
> Subject: Re: [Linux-HA] Failover not working as I expected
>
> Jerome Yanga wrote:
>> Andrew,
>>
>> I apologize for my sending my previous email abruptly.
>>
>> I have followed your recommendation and installed Pacemaker.
>>
>> Here is my config.
>>
>> Packages Installed:
>> heartbeat-2.99.2-6.1
>> heartbeat-common-2.99.2-6.1
>> heartbeat-debug-2.99.2-6.1
>> heartbeat-ldirectord-2.99.2-6.1
>> heartbeat-resources-2.99.2-6.1
>> libheartbeat2-2.99.2-6.1
>> libpacemaker3-1.0.1-3.1
>> pacemaker-1.0.1-3.1
>> pacemaker-debug-1.0.1-3.1
>> pacemaker-pygui-1.4-11.9
>> pacemaker-pygui-debug-1.4-11.9
>>
>>
>>
>> ha.cf:
>> # Logging
>> debug                                1
>> use_logd                     false
>> logfacility                  daemon
>>
>> # Misc Options
>> traditional_compression      off
>> compression                  bz2
>> coredumps                    true
>>
>> # Communications
>> udpport                      691
>> bcast                                eth1 eth0
>> autojoin                     any
>>
>> # Thresholds (in seconds)
>> keepalive                    1
>> warntime                     6
>> deadtime                     10
>> initdead                     15
>>
>> ping 10.50.254.254
>> crm respawn
>>  apiauth     mgmtd   uid=root
>>  respawn     root    /usr/lib/heartbeat/mgmtd -v
>>
>>
>> cib.xml:
>> <cib admin_epoch="0" validate-with="pacemaker-1.0" crm_feature_set="3.0" 
>> have-quorum="1" epoch="57" dc-uuid="5e3e3c2d-55e7-4c51-90be-5c4a1912bf3e" 
>> num_updates="0" cib-last-written="Mon Jan 26 13:57:32 2009">
>>   <configuration>
>>     <crm_config>
>>       <cluster_property_set id="cib-bootstrap-options">
>>         <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" 
>> value="1.0.1-node: 6fc5ce8302abf145a02891ec41e5a492efbe8efe"/>
>>       </cluster_property_set>
>>     </crm_config>
>>     <nodes>
>>       <node id="5e3e3c2d-55e7-4c51-90be-5c4a1912bf3e" uname="nomen.esri.com" 
>> type="normal">
>>         <instance_attributes id="nodes-5e3e3c2d-55e7-4c51-90be-5c4a1912bf3e">
>>           <nvpair id="standby-5e3e3c2d-55e7-4c51-90be-5c4a1912bf3e" 
>> name="standby" value="off"/>
>>         </instance_attributes>
>>       </node>
>>       <node id="27f54ec3-b626-4b4f-b8a6-4ed0b768513c" 
>> uname="rubric.esri.com" type="normal">
>>         <instance_attributes id="nodes-27f54ec3-b626-4b4f-b8a6-4ed0b768513c">
>>           <nvpair id="standby-27f54ec3-b626-4b4f-b8a6-4ed0b768513c" 
>> name="standby" value="off"/>
>>         </instance_attributes>
>>       </node>
>>     </nodes>
>>     <resources>
>>       <group id="Directory_Server">
>>         <meta_attributes id="Directory_Server-meta_attributes">
>>           <nvpair id="Directory_Server-meta_attributes-collocated" 
>> name="collocated" value="true"/>
>>           <nvpair id="Directory_Server-meta_attributes-ordered" 
>> name="ordered" value="true"/>
>>           <nvpair id="Directory_Server-meta_attributes-resource_stickiness" 
>> name="resource_stickiness" value="100"/>
>>         </meta_attributes>
>>         <primitive class="ocf" id="VIP" provider="heartbeat" type="IPaddr">
>>           <instance_attributes id="VIP-instance_attributes">
>>             <nvpair id="VIP-instance_attributes-ip" name="ip" 
>> value="10.50.26.250"/>
>>           </instance_attributes>
>>           <operations id="VIP-ops">
>>             <op id="VIP-monitor-5s" interval="5s" name="monitor" 
>> timeout="5s"/>
>>           </operations>
>>         </primitive>
>>         <primitive class="ocf" id="ECAS" provider="esri" type="ecas">
>>           <operations id="ECAS-ops">
>>             <op id="ECAS-monitor-3s" interval="3s" name="monitor" 
>> timeout="3s"/>
>>           </operations>
>>           <meta_attributes id="ECAS-meta_attributes">
>>             <nvpair id="ECAS-meta_attributes-target-role" name="target-role" 
>> value="Started"/>
>>           </meta_attributes>
>>         </primitive>
>>         <primitive class="ocf" id="FDS_Admin" provider="esri" type="fdsadm">
>>           <operations id="FDS_Admin-ops">
>>             <op id="FDS_Admin-monitor-3s" interval="3s" name="monitor" 
>> timeout="3s"/>
>>           </operations>
>>         </primitive>
>>       </group>
>>     </resources>
>>     <constraints>
>>       <rsc_location id="cli-prefer-Directory_Server" rsc="Directory_Server">
>>         <rule id="cli-prefer-rule-Directory_Server" score="INFINITY" 
>> boolean-op="and">
>>           <expression id="cli-prefer-expr-Directory_Server" 
>> attribute="#uname" operation="eq" value="rubric.esri.com" type="string"/>
>>         </rule>
>>       </rsc_location>
>>       <rsc_location id="cli-prefer-FDS_Admin" rsc="FDS_Admin">
>>         <rule id="cli-prefer-rule-FDS_Admin" score="INFINITY" 
>> boolean-op="and">
>>           <expression id="cli-prefer-expr-FDS_Admin" attribute="#uname" 
>> operation="eq" value="nomen.esri.com" type="string"/>
>>         </rule>
>>       </rsc_location>
>>     </constraints>
>>   </configuration>
>> </cib>
>>
>>
>>
>> I still have the following issues when I only had heartbeat 2.1.3-1.  My 
>> concerns are still as follows:
>>
>> 01)  When a node comes back up after a restart of heartbeat, resources gets 
>> bounced when it rejoins the cluster.
>
> Well, you have defined rsc_location constraints with a score of
> INFINITY, so that is expected.
>
>> 02)  Stopping one resource in a group does not failover the group to the 
>> other node.
>
> Lookup migration-threshold.
>
> Regards
> Dominik
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

RE: [Linux-HA] Failover not working as I expected

Reply via email to