> Assaf N wrote:
> > Hello,
> >
> > I started a small test cluster using heartbeat 2.1.1. The cluster contains
> one simple master/slave resource.
> >
> > While playing around with this cluster, I've noticed that whenever the
> resource is promoted to be the master on a machine, Heartbeat stops calling
> its monitor operation on this node. A quick look on the ha-debug log reveals
> that the monitor op is stopped intentionally, because of the resource
> promotion. However, there is no restarting of this op once the node becomes
> the master. When a second node starts and its resource takes the master
> role, our demoted resource starts to be monitored again.
> >
> > I'm attaching my cib.xml, ha-debug and the resource agent script. Do I
> have a configuration error, or have I encountered a bug?
>
> please refer to the following conversation and tell us whether this
> resolves your issue:
>
> http://www.gossamer-threads.com/lists/linuxha/users/42529
>
Thanks, it does resolve my issue. How embarrassing to discover it was answered
a few days ago... I searched the list a few days before it was posted, and
neglected to search again before sending my question... :-)
Now I've encountered a new issue - the 'success' return code from the monitor
function is supposed to be 0 when the resource is a slave, and 8 when it's a
master, right? Well, this is true when the resource is first started, but after
the resource is promoted and demoted heartbeat still considers 8 to be the
success return value, although the resource is not a master anymore. If I
return 0 the resource is stopped and started, and the success return value is 0
again. Is this on purpose?
I'm experiencing another strange behavior on the following scenario - one node
is the DC and running the master instance of a resource, and the second is
running the slave instance. When I stop the heartbeat service on the first node
(rh4vm2, the DC) it takes it a hundred seconds to go down, and it complains
about the monitor action running on the second node (rh4vm1):
crmd[11630]: 2007/10/02_12:46:58 info: stop_subsystem: Sent -TERM to tengine:
[11680]
crmd[11630]: 2007/10/02_12:46:58 info: do_shutdown: Waiting for subsystems to
exit
tengine[11680]: 2007/10/02_12:47:06 WARN: action_timer_callback: Timer popped
(abort_level=1000000, complete=false)
tengine[11680]: 2007/10/02_12:47:06 WARN: print_elem: Action missed its
timeout[Action 2]: In-flight (id: rsc_smith:0_monitor_3000, l
oc: rh4vm1, priority: 20)
tengine[11680]: 2007/10/02_12:48:37 WARN: global_timer_callback: Timer popped
(abort_level=1000000, complete=false)
tengine[11680]: 2007/10/02_12:48:37 info: unconfirmed_actions: Action
rsc_smith:0_monitor_3000 2 unconfirmed from peer
tengine[11680]: 2007/10/02_12:48:37 ERROR: unconfirmed_actions: Waiting on 1
unconfirmed actions
tengine[11680]: 2007/10/02_12:48:37 WARN: global_timer_callback: Transition
abort timeout reached... marking transition complete.
tengine[11680]: 2007/10/02_12:48:37 info: notify_crmd: Exiting after transition
tengine[11680]: 2007/10/02_12:48:37 WARN: global_timer_callback: Writing 1
unconfirmed actions to the CIB
tengine[11680]: 2007/10/02_12:48:37 info: unconfirmed_actions: Action
rsc_smith:0_monitor_3000 2 unconfirmed from peer
tengine[11680]: 2007/10/02_12:48:37 ERROR: unconfirmed_actions: Waiting on 1
unconfirmed actions
Any idea why this happens?
Thanks for your help,
Assaf
my cib:
<cib admin_epoch="0" have_quorum="false" ignore_dtd="false" num_peers="0"
cib_feature_revision="1.3" generated="false" epoch="1385"num_updates="1"
cib-last-written="Tue Oct 2 12:50:31 2007">
<configuration>
<crm_config>
<cluster_property_set id="cluster_properties">
<attributes>
<nvpair id="default-resource-stickiness"
name="default-resource-stickiness" value="70"/>
<nvpair id="default-resource-failure-stickiness"
name="default-resource-failure-stickiness" value="-100"/>
</attributes>
</cluster_property_set>
<cluster_property_set id="cib-bootstrap-options">
<attributes>
<nvpair name="last-lrm-refresh"
id="cib-bootstrap-options-last-lrm-refresh" value="1191307342"/>
</attributes>
</cluster_property_set>
</crm_config>
<nodes>
<node id="0441b161-2421-4218-8b03-0c044937e197" uname="rh4vm1"
type="normal">
<instance_attributes id="master-0441b161-2421-4218-8b03-0c044937e197">
<attributes>
<nvpair
id="nodes-master-rsc_smith:1-0441b161-2421-4218-8b03-0c044937e197"
name="master-rsc_smith:1" value="20"/>
<nvpair
id="nodes-master-rsc_smith:0-0441b161-2421-4218-8b03-0c044937e197"
name="master-rsc_smith:0" value="20"/>
</attributes>
</instance_attributes>
</node>
<node uname="rh4vm2" type="normal"
id="f55d8a1b-6931-4a84-989c-7f241ce2897e">
<instance_attributes id="master-f55d8a1b-6931-4a84-989c-7f241ce2897e">
<attributes>
<nvpair name="master-rsc_smith:0"
id="nodes-master-rsc_smith:0-f55d8a1b-6931-4a84-989c-7f241ce2897e" value="20"/>
<nvpair name="master-rsc_smith:1"
id="nodes-master-rsc_smith:1-f55d8a1b-6931-4a84-989c-7f241ce2897e" value="30"/>
</attributes>
</instance_attributes>
</node>
</nodes>
<resources>
<master_slave id="master_slave_mvap" ordered="false" interleave="false"
notify="false">
<instance_attributes id="ia_clone_ip">
<attributes>
<nvpair id="nvpair_ms_grp_mvap_clone_max" name="clone_max"
value="2"/>
<nvpair id="nvpair_ms_grp_mvap_clone_node_max"
name="clone_node_max" value="1"/>
<nvpair id="nvpair_ms_grp_mvap_master_max" name="master_max"
value="1"/>
<nvpair id="nvpair_ms_grp_mvap_master_node_max"
name="master_node_max" value="1"/>
</attributes>
</instance_attributes>
<primitive id="rsc_smith" class="ocf" type="smith2_agent"
provider="ML">
<operations>
<op id="op_smith_monitor_special" name="monitor" timeout="3s"
interval="3000ms" start_delay="6s">
<instance_attributes id="ia_smith_monitor_special">
<attributes>
<nvpair id="nvpair_smith_monitor_special_action"
name="monitor_action" value="BIT1"/>
</attributes>
</instance_attributes>
</op>
<op id="op_smith_monitor_master" name="monitor" timeout="3s"
interval="3001ms" start_delay="6s" role="Master">
<instance_attributes id="ia_smith_monitor_master">
<attributes>
<nvpair id="nvpair_smith_monitor_master_action"
name="monitor_action" value="BIT2"/>
<nvpair id="nvpair_smith_monitor_master_state"
name="master_monitor" value="master"/>
</attributes>
</instance_attributes>
</op>
</operations>
</primitive>
</master_slave>
</resources>
<constraints>
<rsc_location id="loc_smith0" rsc="rsc_smith:0">
<rule id="loc_smith0_rule_run" score="INFINITY">
<expression id="loc_smith0_expression_run" attribute="#uname"
operation="eq" value="rh4vm1"/>
</rule>
<rule id="loc_smith0_rule_norun" score="-INFINITY">
<expression id="loc_smith0_expression_norun" attribute="#uname"
operation="ne" value="rh4vm1"/>
</rule>
</rsc_location>
<rsc_location id="loc_smith1" rsc="rsc_smith:1">
<rule id="loc_smith1_rule_run" score="INFINITY">
<expression id="loc_smith1_expression_run" attribute="#uname"
operation="eq" value="rh4vm2"/>
</rule>
<rule id="loc_smith1_rule_norun" score="-INFINITY">
<expression id="loc_smith1_expression_norun" attribute="#uname"
operation="ne" value="rh4vm2"/>
</rule>
</rsc_location>
</constraints>
</configuration>
</cib>
> cheers,
> raoul bhatia
> --
> ____________________________________________________________________
> DI (FH) Raoul Bhatia M.Sc. email. [EMAIL PROTECTED]
> Technischer Leiter
>
> IPAX - Aloy Bhatia Hava OEG web. http://www.ipax.at
> Barawitzkagasse 10/2/2/11 email. [EMAIL PROTECTED]
> 1190 Wien tel. +43 1 3670030
> FN 277995t HG Wien fax. +43 1 3670030 15
> ____________________________________________________________________
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
____________________________________________________________________________________
Need a vacation? Get great deals
to amazing places on Yahoo! Travel.
http://travel.yahoo.com/
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems