Andrew Beekhof a écrit : > On 9/19/07, FG <[EMAIL PROTECTED]> wrote: > >> Hi, >> >> I use heartbeat 2.1.1 in an active/passive configuration. >> >> I'am testing differents failover and how stonith reacts ? >> >> >> When my active node is down (hardware failure or simply kill -9 PID(hb >> master)), Stonith with apcmastersnmp from my standby node shoot the active >> node and resources are failover on the standby node. GOOD... >> >> Now my problem: >> 1- If I unplug the network card, pingd reacts and failover the resources on >> the other node BUT stonith doesn't shoot the active node before. >> > > how many communication paths do you have? > is heartbeat using (only?) the network you unplugged? > I've two communication paths. eth0 for network production and eth1 for heartbeat communication (+ serial line soon). When i unplug eth0, i got "ping node dead" and so the resources are failover to the standby node but without shooting the active node.
The logs when this situation happens: attrd[19320]: 2007/09/20_10:38:19 info: attrd_ha_callback: flush message from castor attrd[19320]: 2007/09/20_10:38:19 info: attrd_perform_update: Sent update 7: pingd=200 tengine[19328]: 2007/09/20_10:38:19 info: extract_event: Aborting on transient_attributes changes for 47cb4e3e-7c8f-4dc0-9da8-d9744815ed53 tengine[19328]: 2007/09/20_10:38:19 info: update_abort_priority: Abort priority upgraded to 1000000 crmd[19321]: 2007/09/20_10:38:19 info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_IPC_MESSAGE origin=route_message ] tengine[19328]: 2007/09/20_10:38:19 info: te_update_diff: Aborting on transient_attributes deletions crmd[19321]: 2007/09/20_10:38:19 info: do_state_transition: All 2 cluster nodes are eligible to run resources. pengine[19329]: 2007/09/20_10:38:20 notice: cluster_option: Using default value '60s' for cluster option 'cluster-delay' pengine[19329]: 2007/09/20_10:38:20 notice: cluster_option: Using default value '-1' for cluster option 'pe-error-series-max' pengine[19329]: 2007/09/20_10:38:20 notice: cluster_option: Using default value '-1' for cluster option 'pe-warn-series-max' pengine[19329]: 2007/09/20_10:38:20 notice: cluster_option: Using default value '-1' for cluster option 'pe-input-series-max' pengine[19329]: 2007/09/20_10:38:20 notice: cluster_option: Using default value 'true' for cluster option 'startup-fencing' pengine[19329]: 2007/09/20_10:38:20 info: determine_online_status: Node pollux is online pengine[19329]: 2007/09/20_10:38:20 info: determine_online_status: Node castor is online pengine[19329]: 2007/09/20_10:38:20 info: group_print: Resource Group: group_1 pengine[19329]: 2007/09/20_10:38:20 info: native_print: IPaddr_147_210_36_7 (heartbeat::ocf:IPaddr): Started castor pengine[19329]: 2007/09/20_10:38:20 info: native_print: Filesystem_2 (heartbeat::ocf:Filesystem): Started castor pengine[19329]: 2007/09/20_10:38:20 info: native_print: cyrus-imapd_3 (lsb:cyrus-imapd): Started castor pengine[19329]: 2007/09/20_10:38:20 info: native_print: saslauthd_4 (lsb:saslauthd): Started castor pengine[19329]: 2007/09/20_10:38:20 info: clone_print: Clone Set: pingd pengine[19329]: 2007/09/20_10:38:20 info: native_print: pingd-child:0 (heartbeat::ocf:pingd): Started pollux pengine[19329]: 2007/09/20_10:38:20 info: native_print: pingd-child:1 (heartbeat::ocf:pingd): Started castor pengine[19329]: 2007/09/20_10:38:20 info: clone_print: Clone Set: DoFencing pengine[19329]: 2007/09/20_10:38:20 info: native_print: child_DoFencing:0 (stonith:apcmastersnmp): Started pollux pengine[19329]: 2007/09/20_10:38:20 info: native_print: child_DoFencing:1 (stonith:apcmastersnmp): Started castor pengine[19329]: 2007/09/20_10:38:20 notice: NoRoleChange: Move resource IPaddr_147_210_36_7 (castor -> pollux) pengine[19329]: 2007/09/20_10:38:20 notice: StopRsc: castor Stop IPaddr_147_210_36_7 pengine[19329]: 2007/09/20_10:38:20 notice: StartRsc: pollux Start IPaddr_147_210_36_7 pengine[19329]: 2007/09/20_10:38:20 notice: RecurringOp: pollux IPaddr_147_210_36_7_monitor_5000 pengine[19329]: 2007/09/20_10:38:20 notice: NoRoleChange: Move resource Filesystem_2 (castor -> pollux) pengine[19329]: 2007/09/20_10:38:20 notice: StopRsc: castor Stop Filesystem_2 pengine[19329]: 2007/09/20_10:38:20 notice: StartRsc: pollux Start Filesystem_2 pengine[19329]: 2007/09/20_10:38:20 notice: RecurringOp: pollux Filesystem_2_monitor_60000 pengine[19329]: 2007/09/20_10:38:20 notice: NoRoleChange: Move resource cyrus-imapd_3 (castor -> pollux) pengine[19329]: 2007/09/20_10:38:20 notice: StopRsc: castor Stop cyrus-imapd_3 pengine[19329]: 2007/09/20_10:38:20 notice: StartRsc: pollux Start cyrus-imapd_3 pengine[19329]: 2007/09/20_10:38:20 notice: RecurringOp: pollux cyrus-imapd_3_monitor_60000 pengine[19329]: 2007/09/20_10:38:20 notice: NoRoleChange: Move resource saslauthd_4 (castor -> pollux) pengine[19329]: 2007/09/20_10:38:20 notice: StopRsc: castor Stop saslauthd_4 pengine[19329]: 2007/09/20_10:38:20 notice: StartRsc: pollux Start saslauthd_4 pengine[19329]: 2007/09/20_10:38:20 notice: RecurringOp: pollux saslauthd_4_monitor_60000 pengine[19329]: 2007/09/20_10:38:20 notice: NoRoleChange: Leave resource pingd-child:0 (pollux) pengine[19329]: 2007/09/20_10:38:20 notice: NoRoleChange: Leave resource pingd-child:1 (castor) pengine[19329]: 2007/09/20_10:38:20 notice: NoRoleChange: Leave resource child_DoFencing:0 (pollux) pengine[19329]: 2007/09/20_10:38:20 notice: NoRoleChange: Leave resource child_DoFencing:1 (castor) pengine[19329]: 2007/09/20_10:38:20 info: process_pe_message: Transition 5: PEngine Input stored in: /var/lib/heartbeat/pengine/pe-input-221.raw crmd[19321]: 2007/09/20_10:38:20 info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=route_message ] tengine[19328]: 2007/09/20_10:38:20 info: unpack_graph: Unpacked transition 5: 16 actions in 16 synapses tengine[19328]: 2007/09/20_10:38:20 info: te_pseudo_action: Pseudo action 26 fired and confirmed tengine[19328]: 2007/09/20_10:38:20 info: send_rsc_command: Initiating action 21: saslauthd_4_stop_0 on castor tengine[19328]: 2007/09/20_10:38:22 info: match_graph_event: Action saslauthd_4_stop_0 (21) confirmed on castor tengine[19328]: 2007/09/20_10:38:22 info: send_rsc_command: Initiating action 18: cyrus-imapd_3_stop_0 on castor tengine[19328]: 2007/09/20_10:38:24 info: match_graph_event: Action cyrus-imapd_3_stop_0 (18) confirmed on castor tengine[19328]: 2007/09/20_10:38:24 info: send_rsc_command: Initiating action 15: Filesystem_2_stop_0 on castor tengine[19328]: 2007/09/20_10:38:26 info: match_graph_event: Action Filesystem_2_stop_0 (15) confirmed on castor tengine[19328]: 2007/09/20_10:38:26 info: send_rsc_command: Initiating action 12: IPaddr_147_210_36_7_stop_0 on castor tengine[19328]: 2007/09/20_10:38:27 info: match_graph_event: Action IPaddr_147_210_36_7_stop_0 (12) confirmed on castor tengine[19328]: 2007/09/20_10:38:27 info: te_pseudo_action: Pseudo action 27 fired and confirmed tengine[19328]: 2007/09/20_10:38:27 info: te_pseudo_action: Pseudo action 24 fired and confirmed tengine[19328]: 2007/09/20_10:38:27 info: send_rsc_command: Initiating action 13: IPaddr_147_210_36_7_start_0 on pollux crmd[19321]: 2007/09/20_10:38:27 info: do_lrm_rsc_op: Performing op=IPaddr_147_210_36_7_start_0 key=13:5:b9f1026e-93d3-46f1-9b87-857212f2fd7e) lrmd[19318]: 2007/09/20_10:38:27 info: RA output: (IPaddr_147_210_36_7:start:stderr) Rewrote octal netmask as: 24 IPaddr[19558][19597]: 2007/09/20_10:38:27 DEBUG: Using calculated broadcast for 147.210.36.7: 147.210.36.255 IPaddr[19558][19614]: 2007/09/20_10:38:27 INFO: eval ifconfig eth0:0 147.210.36.7 netmask 255.255.255.0 broadcast 147.210.36.255 IPaddr[19558][19619]: 2007/09/20_10:38:27 DEBUG: Sending Gratuitous Arp for 147.210.36.7 on eth0:0 [eth0] crmd[19321]: 2007/09/20_10:38:27 info: process_lrm_event: LRM operation IPaddr_147_210_36_7_start_0 (call=14, rc=0) complete crmd[19321]: 2007/09/20_10:38:27 info: build_operation_update: Digest for 0:0;13:5:b9f1026e-93d3-46f1-9b87-857212f2fd7e (IPaddr_147_210_36_7_start_0) was e03993409a8940d5daa9e68a96ee5f0c crmd[19321]: 2007/09/20_10:38:27 info: log_data_element: build_operation_update: digest:source <parameters ip="147.210.36.7" netmask="255.255.255.0" nic="eth0"/> tengine[19328]: 2007/09/20_10:38:27 info: match_graph_event: Action IPaddr_147_210_36_7_start_0 (13) confirmed on pollux tengine[19328]: 2007/09/20_10:38:27 info: send_rsc_command: Initiating action 14: IPaddr_147_210_36_7_monitor_5000 on pollux tengine[19328]: 2007/09/20_10:38:27 info: send_rsc_command: Initiating action 16: Filesystem_2_start_0 on pollux crmd[19321]: 2007/09/20_10:38:27 info: do_lrm_rsc_op: Performing op=IPaddr_147_210_36_7_monitor_5000 key=14:5:b9f1026e-93d3-46f1-9b87-857212f2fd7e) crmd[19321]: 2007/09/20_10:38:27 info: do_lrm_rsc_op: Performing op=Filesystem_2_start_0 key=16:5:b9f1026e-93d3-46f1-9b87-857212f2fd7e) crmd[19321]: 2007/09/20_10:38:27 info: process_lrm_event: LRM operation IPaddr_147_210_36_7_monitor_5000 (call=15, rc=0) complete Filesystem[19645][19693]: 2007/09/20_10:38:27 INFO: Running start for /dev/VolGroup01/maillv on /mailsan tengine[19328]: 2007/09/20_10:38:27 info: match_graph_event: Action IPaddr_147_210_36_7_monitor_5000 (14) confirmed on pollux crmd[19321]: 2007/09/20_10:38:27 info: process_lrm_event: LRM operation Filesystem_2_start_0 (call=16, rc=0) complete crmd[19321]: 2007/09/20_10:38:27 info: build_operation_update: Digest for 0:0;16:5:b9f1026e-93d3-46f1-9b87-857212f2fd7e (Filesystem_2_start_0) was c2fd28c4595a232e7f3843e77f7214e4 crmd[19321]: 2007/09/20_10:38:27 info: log_data_element: build_operation_update: digest:source <parameters directory="/mailsan" fstype="ext3" device="/dev/VolGroup01/maillv" options="noatime"/> tengine[19328]: 2007/09/20_10:38:27 info: match_graph_event: Action Filesystem_2_start_0 (16) confirmed on pollux tengine[19328]: 2007/09/20_10:38:27 info: send_rsc_command: Initiating action 17: Filesystem_2_monitor_60000 on pollux tengine[19328]: 2007/09/20_10:38:27 info: send_rsc_command: Initiating action 19: cyrus-imapd_3_start_0 on pollux crmd[19321]: 2007/09/20_10:38:27 info: do_lrm_rsc_op: Performing op=Filesystem_2_monitor_60000 key=17:5:b9f1026e-93d3-46f1-9b87-857212f2fd7e) crmd[19321]: 2007/09/20_10:38:27 info: do_lrm_rsc_op: Performing op=cyrus-imapd_3_start_0 key=19:5:b9f1026e-93d3-46f1-9b87-857212f2fd7e) lrmd[19716]: 2007/09/20_10:38:27 WARN: For LSB init script, no additional parameters are needed. lrmd[19318]: 2007/09/20_10:38:27 info: RA output: (cyrus-imapd_3:start:stdout) Importation des bases de donn��es cyrus-imapd crmd[19321]: 2007/09/20_10:38:27 info: process_lrm_event: LRM operation Filesystem_2_monitor_60000 (call=17, rc=0) complete tengine[19328]: 2007/09/20_10:38:27 info: match_graph_event: Action Filesystem_2_monitor_60000 (17) confirmed on pollux lrmd[19318]: 2007/09/20_10:38:29 info: RA output: (cyrus-imapd_3:start:stdout) [ lrmd[19318]: 2007/09/20_10:38:29 info: RA output: (cyrus-imapd_3:start:stdout) OK lrmd[19318]: 2007/09/20_10:38:29 info: RA output: (cyrus-imapd_3:start:stdout) ] lrmd[19318]: 2007/09/20_10:38:29 info: RA output: (cyrus-imapd_3:start:stdout) lrmd[19318]: 2007/09/20_10:38:29 info: RA output: (cyrus-imapd_3:start:stdout) lrmd[19318]: 2007/09/20_10:38:29 info: RA output: (cyrus-imapd_3:start:stdout) D��marrage de cyrus-imapd : lrmd[19318]: 2007/09/20_10:38:29 info: RA output: (cyrus-imapd_3:start:stdout) [ lrmd[19318]: 2007/09/20_10:38:29 info: RA output: (cyrus-imapd_3:start:stdout) OK ] lrmd[19318]: 2007/09/20_10:38:29 info: RA output: (cyrus-imapd_3:start:stdout) crmd[19321]: 2007/09/20_10:38:29 info: process_lrm_event: LRM operation cyrus-imapd_3_start_0 (call=18, rc=0) complete crmd[19321]: 2007/09/20_10:38:29 info: build_operation_update: Digest for 0:0;19:5:b9f1026e-93d3-46f1-9b87-857212f2fd7e (cyrus-imapd_3_start_0) was f2317cad3d54cec5d7d7aa7d0bf35cf8 crmd[19321]: 2007/09/20_10:38:29 info: log_data_element: build_operation_update: digest:source <parameters/> tengine[19328]: 2007/09/20_10:38:29 info: match_graph_event: Action cyrus-imapd_3_start_0 (19) confirmed on pollux tengine[19328]: 2007/09/20_10:38:29 info: send_rsc_command: Initiating action 20: cyrus-imapd_3_monitor_60000 on pollux tengine[19328]: 2007/09/20_10:38:29 info: send_rsc_command: Initiating action 22: saslauthd_4_start_0 on pollux crmd[19321]: 2007/09/20_10:38:29 info: do_lrm_rsc_op: Performing op=cyrus-imapd_3_monitor_60000 key=20:5:b9f1026e-93d3-46f1-9b87-857212f2fd7e) crmd[19321]: 2007/09/20_10:38:29 info: do_lrm_rsc_op: Performing op=saslauthd_4_start_0 key=22:5:b9f1026e-93d3-46f1-9b87-857212f2fd7e) lrmd[19870]: 2007/09/20_10:38:29 WARN: For LSB init script, no additional parameters are needed. lrmd[19318]: 2007/09/20_10:38:29 info: RA output: (saslauthd_4:start:stdout) D��marrage de saslauthd : lrmd[19318]: 2007/09/20_10:38:29 info: RA output: (saslauthd_4:start:stdout) [ lrmd[19318]: 2007/09/20_10:38:29 info: RA output: (saslauthd_4:start:stdout) OK ] lrmd[19318]: 2007/09/20_10:38:29 info: RA output: (saslauthd_4:start:stdout) crmd[19321]: 2007/09/20_10:38:29 info: process_lrm_event: LRM operation cyrus-imapd_3_monitor_60000 (call=19, rc=0) complete crmd[19321]: 2007/09/20_10:38:29 info: process_lrm_event: LRM operation saslauthd_4_start_0 (call=20, rc=0) complete crmd[19321]: 2007/09/20_10:38:29 info: build_operation_update: Digest for 0:0;22:5:b9f1026e-93d3-46f1-9b87-857212f2fd7e (saslauthd_4_start_0) was f2317cad3d54cec5d7d7aa7d0bf35cf8 crmd[19321]: 2007/09/20_10:38:29 info: log_data_element: build_operation_update: digest:source <parameters/> tengine[19328]: 2007/09/20_10:38:29 info: match_graph_event: Action cyrus-imapd_3_monitor_60000 (20) confirmed on pollux tengine[19328]: 2007/09/20_10:38:29 info: match_graph_event: Action saslauthd_4_start_0 (22) confirmed on pollux tengine[19328]: 2007/09/20_10:38:29 info: te_pseudo_action: Pseudo action 25 fired and confirmed tengine[19328]: 2007/09/20_10:38:29 info: send_rsc_command: Initiating action 23: saslauthd_4_monitor_60000 on pollux crmd[19321]: 2007/09/20_10:38:29 info: do_lrm_rsc_op: Performing op=saslauthd_4_monitor_60000 key=23:5:b9f1026e-93d3-46f1-9b87-857212f2fd7e) crmd[19321]: 2007/09/20_10:38:29 info: process_lrm_event: LRM operation saslauthd_4_monitor_60000 (call=21, rc=0) complete tengine[19328]: 2007/09/20_10:38:29 info: match_graph_event: Action saslauthd_4_monitor_60000 (23) confirmed on pollux tengine[19328]: 2007/09/20_10:38:29 info: run_graph: Transition 5: (Complete=16, Pending=0, Fired=0, Skipped=0, Incomplete=0) tengine[19328]: 2007/09/20_10:38:29 info: notify_crmd: Transition 5 status: te_complete - <null> crmd[19321]: 2007/09/20_10:38:29 info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_IPC_MESSAGE origin=route_message ] Do i need a particular constraint between pingd and stonith ? > >> 2- Same problem whith resource's failure, at the end of the sixth failure >> (depends of my configuration stickiness), the resources are failover on the >> standby node BUT again, stonith doesn't shoot the node. >> > > as long as the resource stops correctly, there is no need to shoot the node > Ok, i understand this... The resource filesystem is correctly umounted (stopped), no risk of data corruption, then no need to shoot the node Thanks > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
