Hi, I do not know if this is related, but I've just seen something "similar" on a different two-node cluster. For some reason (out of scope here) corosync ring was broken, one cluster node was fenced, then resources began to start on the another node (s01-0). Then I see the following: ============= Aug 9 18:17:21 s01-0 crmd: [1728]: info: abort_transition_graph: te_update_diff:149 - Triggered transition abort (complete=0, tag=nvpair, id=status-s01-0-last-failure-config-templates-lv, magic=NA, cib=0.6816.474) : Transient attribute: update Aug 9 18:17:21 s01-0 crmd: [1728]: info: update_abort_priority: Abort priority upgraded from 1 to 1000000 Aug 9 18:17:21 s01-0 crmd: [1728]: info: update_abort_priority: 'Event failed' abort superceeded Aug 9 18:17:36 s01-0 pengine: [1727]: WARN: unpack_rsc_op: Processing failed op config-templates-lv_start_0 on s01-0: not running (7) Aug 9 18:17:36 s01-0 pengine: [1727]: WARN: unpack_rsc_op: Processing failed op config-templates-lv_stop_0 on s01-0: unknown error (1) Aug 9 18:17:36 s01-0 pengine: [1727]: WARN: pe_fence_node: Node s01-0 will be fenced to recover from resource failure(s) ============= But, config-templates-lv_stop_0 was never run.
At the same time (18:17:22) pacemaker began to stop all other logical volumes resources (except config-templates-lv) and had success. 09.08.2011 11:40, Vladislav Bogdanov wrote: > Hi Andrew, all, > > Just found little bit strange crmd behaviour - DC timed out stop > operation (which lead to node fencing) by itself, without even waiting > for lrmd (on another node) to finish that operation. > > Here is what it did: > =============== > Aug 9 07:30:07 mgmt01 crmd: [15781]: WARN: action_timer_callback: Timer > popped (timeout=600000, abort_level=1000000, complete=false) > Aug 9 07:30:07 mgmt01 crmd: [15781]: ERROR: print_elem: Aborting > transition, action lost: [Action 228]: In-flight (id: > vptest22.vds-ok.com-vm_stop_0, loc: v03-a, priority: 0) > Aug 9 07:30:07 mgmt01 crmd: [15781]: info: abort_transition_graph: > action_timer_callback:486 - Triggered transition abort (complete=0) : > Action lost > Aug 9 07:30:07 mgmt01 crmd: [15781]: WARN: cib_action_update: rsc_op > 228: vptest22.vds-ok.com-vm_stop_0 on v03-a timed out > Aug 9 07:30:07 mgmt01 crmd: [15781]: info: create_operation_update: > cib_action_update: Updating resouce vptest22.vds-ok.com-vm after Timed > Out stop op (interval=0) > Aug 9 07:30:07 mgmt01 crmd: [15781]: notice: run_graph: Transition 1500 > (Complete=64, Pending=0, Fired=0, Skipped=40, Incomplete=0, > Source=/var/lib/pengine/pe-input-19.bz2): Stopped > Aug 9 07:30:07 mgmt01 crmd: [15781]: info: te_graph_trigger: Transition > 1500 is now complete > Aug 9 07:30:07 mgmt01 crmd: [15781]: info: do_state_transition: State > transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC > cause=C_FSA_INTERNAL origin=notify_crmd ] > Aug 9 07:30:07 mgmt01 crmd: [15781]: info: do_state_transition: All 3 > cluster nodes are eligible to run resources. > Aug 9 07:30:07 mgmt01 crmd: [15781]: info: process_graph_event: Action > vptest22.vds-ok.com-vm_stop_0 arrived after a completed transition > Aug 9 07:30:07 mgmt01 crmd: [15781]: info: abort_transition_graph: > process_graph_event:482 - Triggered transition abort (complete=1, > tag=lrm_rsc_op, id=vptest22.vds-ok.com-vm_last_failure_0, > magic=2:1;228:1500:0:c9c55d75-3a1e-4fcc-af7f-81ede42fa918, > cib=0.4519.88) : Inactive graph > Aug 9 07:30:07 mgmt01 crmd: [15781]: WARN: update_failcount: Updating > failcount for vptest22.vds-ok.com-vm on v03-a after failed stop: rc=1 > (update=INFINITY, time=1312875007) > Aug 9 07:30:08 mgmt01 crmd: [15781]: info: abort_transition_graph: > te_update_diff:142 - Triggered transition abort (complete=1, tag=nvpair, > id=status-v03-a-fail-count-vptest22.vds-ok.com-vm, > name=fail-count-vptest22.vds-ok.com-vm, value=INFINITY, magic=NA, > cib=0.4519.91) : Transient attribute: update > Aug 9 07:30:10 mgmt01 pengine: [15780]: WARN: pe_fence_node: Node v03-a > will be fenced to recover from resource failure(s) > Aug 9 07:30:10 mgmt01 pengine: [15780]: WARN: unpack_rsc_op: Processing > failed op vptest22.vds-ok.com-vm_last_failure_0 on v03-a: unknown error (1) > ============== > > That was a mass stop of 60 virtual machines on 2 12-core nodes. I > suspect that nodes had very high load that time, so it is possible that > crmd and lrmd on node being fenced had no chance to get enough CPU > cycles from scheduler. I saw messages from lrmd there that some > operations stayed in op list for a very long time (~10 min). But, last > such message was logged 1 minute before DC decided to timeout that stop > operation. During that minute only some informative messages from some > resource agents were logged, nothing from lrmd or crmd. > > What could cause such behaviuor? High CPU load? > > Is it intended that DC may do that (mark operation as timed out)? > > What can be done to prevent such things in future? Bigger scheduling > priority (up to real-time) for lrmd (or for whole pacemaker subsystem)? > Bigger timeouts? > > Should I post hb_report for that (if it is not intended of course)? > > Thank you, > Vladislav > > _______________________________________________ > Pacemaker mailing list: [email protected] > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker _______________________________________________ Pacemaker mailing list: [email protected] http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
