On Wed, Oct 14, 2009 at 11:03 PM, hj lee <kerd...@gmail.com> wrote: > > > On Wed, Oct 14, 2009 at 12:26 AM, Andrew Beekhof <and...@beekhof.net> wrote: >> >> On Wed, Oct 14, 2009 at 2:48 AM, hj lee <kerd...@gmail.com> wrote: >> > Hi, >> > >> > I configured two nodes cluster on RHEL 5.3 with the following resources. >> > Note that I am using pacemaker-1.0.6. >> > - IPMI stonith as a clone. Each IPMI clone is monitoring the other node. >> > - One Master/Slave resource: Master is running on node1, Slave is >> > running on >> > node2. >> > - One FakeIPMI resource. >> > >> > When I manually trigger the failure in monitor and stop operation of >> > FakeIPMI at node1, the IPMI stonith running on node2 detects its state >> > unclean correctly and it tries to demote Master resource in node1 and >> > reset >> > th node1. The problem I am seeing is the promotion happens 60 sec later >> > after the stonith reset the node1 successfully. >> > >> > I want the Slave gets promoted immediately right after the stonith reset >> > returned successfully! From the log, >> >> You mean the one we can't see or comment on? > > Hi, > > I attached the logs. Here is a summary of logs: > 1. I manually triggered testdummy-res:0 to fail on monitor and stop > operation. the pengine detected it. > Oct 14 13:10:24 node1 pengine: [4033]: info: unpack_rsc_op: > testdummy-res:0_monitor_10000 on node2 returned 1 (unknown error) instead of > the expected value: 0 (ok) > > 2. The stop failed on testdummy-res:0 > Oct 14 13:10:24 node1 crmd: [4034]: WARN: status_from_rc: Action 6 > (testdummy-res:0_stop_0) on node2 failed (target: 0 vs. rc: 1): Error > > 3. The stonith opeartion is scheduled to node2 > Oct 14 13:10:24 node1 pengine: [4033]: WARN: stage6: Scheduling Node node2 > for STONITH > > 4. The demote/promote is scheduled > Oct 14 13:10:24 node1 pengine: [4033]: notice: LogActions: Stop resource > testdummy-res:0#011(node2) > Oct 14 13:10:24 node1 pengine: [4033]: notice: LogActions: Leave resource > testdummy-res:1#011(Started node1) > Oct 14 13:10:24 node1 pengine: [4033]: notice: LogActions: Stop resource > ipmi-stonith-res:0#011(node2) > Oct 14 13:10:24 node1 pengine: [4033]: notice: LogActions: Leave resource > ipmi-stonith-res:1#011(Started node1) > Oct 14 13:10:24 node1 pengine: [4033]: notice: LogActions: Demote > vmrd-res:0#011(Master -> Stopped node2) > Oct 14 13:10:24 node1 pengine: [4033]: notice: LogActions: Stop resource > vmrd-res:0#011(node2) > Oct 14 13:10:24 node1 pengine: [4033]: notice: LogActions: Promote > vmrd-res:1#011(Slave -> Master node1) > Oct 14 13:10:24 node1 pengine: [4033]: notice: LogActions: Move resource > vsstvm-res#011(Started node2 -> node1) > > 5. stonith reset is issued and returned successfully > Oct 14 13:10:24 node1 stonithd: [4029]: info: stonith_operate_locally::2688: > sending fencing op RESET for node2 to ipmi-stonith-res:1 (external/ipmi) > (pid=12365) > ........... > Oct 14 13:10:24 node1 stonithd: [12365]: debug: external_run_cmd: > '/usr/lib/stonith/plugins/external/ipmi reset node2' output: Chassis Power > Control: Reset > Oct 14 13:10:24 node1 stonithd: [12365]: debug: external_reset_req: running > 'ipmi reset' returned 0 > > 6. Starting demote at node2 > Oct 14 13:10:24 node1 crmd: [4034]: info: te_rsc_command: Initiating action > 23: demote vmrd-res:0_demote_0 on node2
That looks like a bug. Can you attach /var/lib/pengine/pe-warn-26646.bz2 from node1 please? > > 7. Pacemaker knows that the demote will fail because node2 is offline > Oct 14 13:10:24 node1 crmd: [4034]: notice: fail_incompletable_actions: > Action 23 (23) is scheduled for node2 (offline) > > 8. This action_timer_callback seems trigger the promote. > Oct 14 13:11:30 node1 crmd: [4034]: WARN: action_timer_callback: Timer > popped (timeout=6000, abort_level=1000000, complete=false) > > 9. Eventually node1 is promoted > Oct 14 13:11:31 node1 crmd: [4034]: info: te_rsc_command: Initiating action > 63: notify vmrd-res:1_pre_notify_promote_0 on node1 (local) > > > There is 66 sec gap between step 7 and step 8. I think once stonith reset > the node2 successfully, then all the operations scheduled on that node > should be canceled immediately and promote the slave node immediately. > > Thank you very much in advance > > > > _______________________________________________ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > _______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker