It appears that it is trying to start the ClusterIP on node2 but it never does and I don't see any error preventing it. In my logs I see this about every 30 seconds:
Aug 13 11:16:48 node2 tengine: [14098]: info: tengine_stonith_callback: call=-100, optype=1, node_name=node1, result=2, node_list=, action=5:114:a3663c3f-0b44-40c4-bd07-99d3ff079344 Aug 13 11:16:48 node2 crmd: [14085]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_IPC_MESSAGE origin=route_message ] Aug 13 11:16:48 node2 tengine: [14098]: info: update_abort_priority: Abort priority upgraded to 1000000 Aug 13 11:16:48 node2 crmd: [14085]: info: do_state_transition: All 1 cluster nodes are eligible to run resources. Aug 13 11:16:48 node2 tengine: [14098]: info: update_abort_priority: Abort action 0 superceeded by 2 Aug 13 11:16:48 node2 tengine: [14098]: info: run_graph: ==================================================== Aug 13 11:16:48 node2 tengine: [14098]: notice: run_graph: Transition 114: (Complete=1, Pending=0, Fired=0, Skipped=2, Incomplete=0) Aug 13 11:16:48 node2 pengine: [14099]: notice: cluster_option: Using default value 'stop' for cluster option 'no-quorum-policy' Aug 13 11:16:48 node2 pengine: [14099]: notice: cluster_option: Using default value 'true' for cluster option 'symmetric-cluster' Aug 13 11:16:48 node2 crmd: [14085]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=route_message ] Aug 13 11:16:48 node2 pengine: [14099]: notice: cluster_option: Using default value 'reboot' for cluster option 'stonith-action' Aug 13 11:16:48 node2 pengine: [14099]: notice: cluster_option: Using default value '0' for cluster option 'default-resource-stickiness' Aug 13 11:16:48 node2 pengine: [14099]: notice: cluster_option: Using default value '0' for cluster option 'default-resource-failure-stickiness' Aug 13 11:16:48 node2 pengine: [14099]: notice: cluster_option: Using default value 'true' for cluster option 'is-managed-default' Aug 13 11:16:48 node2 pengine: [14099]: notice: cluster_option: Using default value '60s' for cluster option 'cluster-delay' Aug 13 11:16:48 node2 pengine: [14099]: notice: cluster_option: Using default value '20s' for cluster option 'default-action-timeout' Aug 13 11:16:48 node2 pengine: [14099]: notice: cluster_option: Using default value 'true' for cluster option 'stop-orphan-resources' Aug 13 11:16:48 node2 pengine: [14099]: notice: cluster_option: Using default value 'true' for cluster option 'stop-orphan-actions' Aug 13 11:16:48 node2 pengine: [14099]: notice: cluster_option: Using default value 'false' for cluster option 'remove-after-stop' Aug 13 11:16:48 node2 pengine: [14099]: notice: cluster_option: Using default value '-1' for cluster option 'pe-error-series-max' Aug 13 11:16:48 node2 pengine: [14099]: notice: cluster_option: Using default value '-1' for cluster option 'pe-warn-series-max' Aug 13 11:16:48 node2 pengine: [14099]: notice: cluster_option: Using default value '-1' for cluster option 'pe-input-series-max' Aug 13 11:16:48 node2 tengine: [14098]: info: unpack_graph: Unpacked transition 115: 3 actions in 3 synapses Aug 13 11:16:48 node2 pengine: [14099]: notice: cluster_option: Using default value 'true' for cluster option 'startup-fencing' Aug 13 11:16:48 node2 pengine: [14099]: info: determine_online_status: Node node2 is online Aug 13 11:16:48 node2 tengine: [14098]: info: te_fence_node: Executing reboot fencing operation (5) on node1 (timeout=30000) Aug 13 11:16:48 node2 pengine: [14099]: WARN: determine_online_status_fencing: Node node1 (24378a9e-3483-4ea4-bd7e-40a59a73a0e7) is un-expectedly down Aug 13 11:16:48 node2 pengine: [14099]: info: determine_online_status_fencing: ^Iha_state=dead, ccm_state=false, crm_state=offline, join_state=down, expected=member Aug 13 11:16:48 node2 stonithd: [14083]: info: client tengine [pid: 14098] want a STONITH operation RESET to node node1. Aug 13 11:16:48 node2 pengine: [14099]: WARN: determine_online_status: Node node1 is unclean Aug 13 11:16:48 node2 stonithd: [14083]: info: Broadcasting the message succeeded: require others to stonith node node1. Aug 13 11:16:48 node2 pengine: [14099]: info: native_print: ClusterIP^I(heartbeat::ocf:IPaddr2):^IStarted node1 Aug 13 11:16:48 node2 pengine: [14099]: notice: NoRoleChange: Move resource ClusterIP^I(node1 -> node2) Aug 13 11:16:48 node2 pengine: [14099]: WARN: custom_action: Action ClusterIP_stop_0 on node1 is unrunnable (offline) Aug 13 11:16:48 node2 pengine: [14099]: WARN: custom_action: Marking node node1 unclean Aug 13 11:16:48 node2 pengine: [14099]: notice: StartRsc: node2^IStart ClusterIP Aug 13 11:16:48 node2 pengine: [14099]: WARN: stage6: Scheduling Node node1 for STONITH Aug 13 11:16:48 node2 pengine: [14099]: info: native_stop_constraints: ClusterIP_stop_0 is implicit after node1 is fenced Aug 13 11:16:48 node2 pengine: [14099]: WARN: process_pe_message: Transition 115: WARNINGs found during PE processing. PEngine Input stored in: /var/lib/heartbeat/pengine/pe-warn-100.raw Aug 13 11:16:48 node2 pengine: [14099]: info: process_pe_message: Configuration WARNINGs found during PE processing. Please run "crm_verify -L" to identify issues. -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of David Brossard Sent: Monday, August 13, 2007 11:07 AM To: General Linux-HA mailing list Subject: [Linux-HA] IP resource never fails over during outage Now I am having some more weirdness I cannot figure out. I have setup a single resource of an IP address. It comes up fine, and I can move it between nodes using crm_resource -M -R ClusterIP. However if I reboot node1 when it is hosting the IP, the resource never fails over to node2. The gui shows it running on node1 even though node1 is offine: [EMAIL PROTECTED]:/var/lib/heartbeat$ crm_resource -L -V crm_resource[15028]: 2007/08/13_11:04:39 info: Invoked: crm_resource -L -V crm_resource[15028]: 2007/08/13_11:04:39 WARN: determine_online_status_fencing: Node node1 (24378a9e-3483-4ea4-bd7e-40a59a73a0e7) is un-expectedly down crm_resource[15028]: 2007/08/13_11:04:39 WARN: determine_online_status: Node node1 is unclean ClusterIP (heartbeat::ocf:IPaddr2) [EMAIL PROTECTED]:/var/lib/heartbeat$ crm_resource -L -V crm_resource[15028]: 2007/08/13_11:04:39 info: Invoked: crm_resource -L -V crm_resource[15028]: 2007/08/13_11:04:39 WARN: determine_online_status_fencing: Node node1 (24378a9e-3483-4ea4-bd7e-40a59a73a0e7) is un-expectedly down crm_resource[15028]: 2007/08/13_11:04:39 WARN: determine_online_status: Node node1 is unclean ClusterIP (heartbeat::ocf:IPaddr2) [EMAIL PROTECTED]:/var/lib/heartbeat$ crm_resource -x -r ClusterIP crm_resource[15029]: 2007/08/13_11:05:04 info: Invoked: crm_resource -x -r ClusterIP ClusterIP (heartbeat::ocf:IPaddr2): Started node1 raw xml: <primitive id="ClusterIP" class="ocf" type="IPaddr2" provider="heartbeat"> <instance_attributes id="ClusterIP_instance_attrs"> <attributes> <nvpair id="3a5e34c7-d7dc-477f-8624-cc98ee7e1c41" name="ip" value="172.31.252.7"/> </attributes> </instance_attributes> </primitive> _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
