Hi, On Tue, Oct 02, 2007 at 11:27:59AM -0700, Assaf N wrote: > > Assaf N wrote: > > > Hello, > > > > > > I started a small test cluster using heartbeat 2.1.1. The cluster contains > > one simple master/slave resource. > > > > > > While playing around with this cluster, I've noticed that whenever the > > resource is promoted to be the master on a machine, Heartbeat stops calling > > its monitor operation on this node. A quick look on the ha-debug log reveals > > that the monitor op is stopped intentionally, because of the resource > > promotion. However, there is no restarting of this op once the node becomes > > the master. When a second node starts and its resource takes the master > > role, our demoted resource starts to be monitored again. > > > > > > I'm attaching my cib.xml, ha-debug and the resource agent script. Do I > > have a configuration error, or have I encountered a bug? > > > > please refer to the following conversation and tell us whether this > > resolves your issue: > > > > http://www.gossamer-threads.com/lists/linuxha/users/42529 > > > > Thanks, it does resolve my issue. How embarrassing to discover it was > answered a few days ago... I searched the list a few days before it was > posted, and neglected to search again before sending my question... :-) > > Now I've encountered a new issue - the 'success' return code from the monitor > function is supposed to be 0 when the resource is a slave, and 8 when it's a > master, right? Well, this is true when the resource is first started, but > after the resource is promoted and demoted heartbeat still considers 8 to be > the success return value, although the resource is not a master anymore. If I > return 0 the resource is stopped and started, and the success return value is > 0 again. Is this on purpose? > > I'm experiencing another strange behavior on the following scenario - one > node is the DC and running the master instance of a resource, and the second > is running the slave instance. When I stop the heartbeat service on the first > node (rh4vm2, the DC) it takes it a hundred seconds to go down, and it > complains about the monitor action running on the second node (rh4vm1): > > crmd[11630]: 2007/10/02_12:46:58 info: stop_subsystem: Sent -TERM to tengine: > [11680] > crmd[11630]: 2007/10/02_12:46:58 info: do_shutdown: Waiting for subsystems to > exit > tengine[11680]: 2007/10/02_12:47:06 WARN: action_timer_callback: Timer popped > (abort_level=1000000, complete=false) > tengine[11680]: 2007/10/02_12:47:06 WARN: print_elem: Action missed its > timeout[Action 2]: In-flight (id: rsc_smith:0_monitor_3000, l > oc: rh4vm1, priority: 20) > tengine[11680]: 2007/10/02_12:48:37 WARN: global_timer_callback: Timer popped > (abort_level=1000000, complete=false) > tengine[11680]: 2007/10/02_12:48:37 info: unconfirmed_actions: Action > rsc_smith:0_monitor_3000 2 unconfirmed from peer > tengine[11680]: 2007/10/02_12:48:37 ERROR: unconfirmed_actions: Waiting on 1 > unconfirmed actions > tengine[11680]: 2007/10/02_12:48:37 WARN: global_timer_callback: Transition > abort timeout reached... marking transition complete. > tengine[11680]: 2007/10/02_12:48:37 info: notify_crmd: Exiting after > transition > tengine[11680]: 2007/10/02_12:48:37 WARN: global_timer_callback: Writing 1 > unconfirmed actions to the CIB > tengine[11680]: 2007/10/02_12:48:37 info: unconfirmed_actions: Action > rsc_smith:0_monitor_3000 2 unconfirmed from peer > tengine[11680]: 2007/10/02_12:48:37 ERROR: unconfirmed_actions: Waiting on 1 > unconfirmed actions > > Any idea why this happens?
No. The CRM expected a reply about an action outcome from LRM, but received none or something. There were bugs in this area before, but should've been fixed. This a test cluster? Can you turn debug on? You can then use the brand new hb_report utility to collect all the information (see http://marc.info/?l=linux-ha&m=119091078501042&w=2 ). And finally file a bug report. Thanks, Dejan > Thanks for your help, > Assaf > > > > > my cib: > > <cib admin_epoch="0" have_quorum="false" ignore_dtd="false" num_peers="0" > cib_feature_revision="1.3" generated="false" epoch="1385"num_updates="1" > cib-last-written="Tue Oct 2 12:50:31 2007"> > <configuration> > <crm_config> > <cluster_property_set id="cluster_properties"> > <attributes> > <nvpair id="default-resource-stickiness" > name="default-resource-stickiness" value="70"/> > <nvpair id="default-resource-failure-stickiness" > name="default-resource-failure-stickiness" value="-100"/> > </attributes> > </cluster_property_set> > <cluster_property_set id="cib-bootstrap-options"> > <attributes> > <nvpair name="last-lrm-refresh" > id="cib-bootstrap-options-last-lrm-refresh" value="1191307342"/> > </attributes> > </cluster_property_set> > </crm_config> > <nodes> > <node id="0441b161-2421-4218-8b03-0c044937e197" uname="rh4vm1" > type="normal"> > <instance_attributes > id="master-0441b161-2421-4218-8b03-0c044937e197"> > <attributes> > <nvpair > id="nodes-master-rsc_smith:1-0441b161-2421-4218-8b03-0c044937e197" > name="master-rsc_smith:1" value="20"/> > <nvpair > id="nodes-master-rsc_smith:0-0441b161-2421-4218-8b03-0c044937e197" > name="master-rsc_smith:0" value="20"/> > </attributes> > </instance_attributes> > </node> > <node uname="rh4vm2" type="normal" > id="f55d8a1b-6931-4a84-989c-7f241ce2897e"> > <instance_attributes > id="master-f55d8a1b-6931-4a84-989c-7f241ce2897e"> > <attributes> > <nvpair name="master-rsc_smith:0" > id="nodes-master-rsc_smith:0-f55d8a1b-6931-4a84-989c-7f241ce2897e" > value="20"/> > <nvpair name="master-rsc_smith:1" > id="nodes-master-rsc_smith:1-f55d8a1b-6931-4a84-989c-7f241ce2897e" > value="30"/> > </attributes> > </instance_attributes> > </node> > </nodes> > <resources> > <master_slave id="master_slave_mvap" ordered="false" > interleave="false" notify="false"> > <instance_attributes id="ia_clone_ip"> > <attributes> > <nvpair id="nvpair_ms_grp_mvap_clone_max" name="clone_max" > value="2"/> > <nvpair id="nvpair_ms_grp_mvap_clone_node_max" > name="clone_node_max" value="1"/> > <nvpair id="nvpair_ms_grp_mvap_master_max" name="master_max" > value="1"/> > <nvpair id="nvpair_ms_grp_mvap_master_node_max" > name="master_node_max" value="1"/> > </attributes> > </instance_attributes> > <primitive id="rsc_smith" class="ocf" type="smith2_agent" > provider="ML"> > <operations> > <op id="op_smith_monitor_special" name="monitor" timeout="3s" > interval="3000ms" start_delay="6s"> > <instance_attributes id="ia_smith_monitor_special"> > <attributes> > <nvpair id="nvpair_smith_monitor_special_action" > name="monitor_action" value="BIT1"/> > </attributes> > </instance_attributes> > </op> > <op id="op_smith_monitor_master" name="monitor" timeout="3s" > interval="3001ms" start_delay="6s" role="Master"> > <instance_attributes id="ia_smith_monitor_master"> > <attributes> > <nvpair id="nvpair_smith_monitor_master_action" > name="monitor_action" value="BIT2"/> > <nvpair id="nvpair_smith_monitor_master_state" > name="master_monitor" value="master"/> > </attributes> > </instance_attributes> > </op> > </operations> > </primitive> > </master_slave> > </resources> > <constraints> > <rsc_location id="loc_smith0" rsc="rsc_smith:0"> > <rule id="loc_smith0_rule_run" score="INFINITY"> > <expression id="loc_smith0_expression_run" attribute="#uname" > operation="eq" value="rh4vm1"/> > </rule> > <rule id="loc_smith0_rule_norun" score="-INFINITY"> > <expression id="loc_smith0_expression_norun" attribute="#uname" > operation="ne" value="rh4vm1"/> > </rule> > </rsc_location> > <rsc_location id="loc_smith1" rsc="rsc_smith:1"> > <rule id="loc_smith1_rule_run" score="INFINITY"> > <expression id="loc_smith1_expression_run" attribute="#uname" > operation="eq" value="rh4vm2"/> > </rule> > <rule id="loc_smith1_rule_norun" score="-INFINITY"> > <expression id="loc_smith1_expression_norun" attribute="#uname" > operation="ne" value="rh4vm2"/> > </rule> > </rsc_location> > </constraints> > </configuration> > </cib> > > > > > > > > > > > > > > > cheers, > > raoul bhatia > > -- > > ____________________________________________________________________ > > DI (FH) Raoul Bhatia M.Sc. email. [EMAIL PROTECTED] > > Technischer Leiter > > > > IPAX - Aloy Bhatia Hava OEG web. http://www.ipax.at > > Barawitzkagasse 10/2/2/11 email. [EMAIL PROTECTED] > > 1190 Wien tel. +43 1 3670030 > > FN 277995t HG Wien fax. +43 1 3670030 15 > > ____________________________________________________________________ > > _______________________________________________ > > Linux-HA mailing list > > [email protected] > > http://lists.linux-ha.org/mailman/listinfo/linux-ha > > See also: http://linux-ha.org/ReportingProblems > > > > > > > > ____________________________________________________________________________________ > Need a vacation? Get great deals > to amazing places on Yahoo! Travel. > http://travel.yahoo.com/ > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
