Andrew Beekhof a écrit :
> On 9/19/07, FG <[EMAIL PROTECTED]> wrote:
>   
>> Hi,
>>
>> I use heartbeat 2.1.1 in an active/passive configuration.
>>
>> I'am testing differents failover and how stonith reacts ?
>>
>>
>> When my active node is down (hardware failure or simply kill -9 PID(hb 
>> master)), Stonith with apcmastersnmp from my standby node shoot the active 
>> node and  resources are failover on the standby node. GOOD...
>>
>> Now my problem:
>> 1- If I unplug the network card, pingd reacts and failover the resources on 
>> the other node BUT stonith doesn't shoot the active node before.
>>     
>
> how many communication paths do you have?
> is heartbeat using (only?) the network you unplugged?
>   
I've two communication paths. eth0 for network
production and eth1 for
heartbeat communication (+ serial line soon).
When i unplug eth0, i got "ping node dead" and so
the resources are
failover to the standby node but without shooting
the active node.

The logs when this situation happens:
attrd[19320]: 2007/09/20_10:38:19 info:
attrd_ha_callback: flush message
from castor
attrd[19320]: 2007/09/20_10:38:19 info:
attrd_perform_update: Sent
update 7: pingd=200
tengine[19328]: 2007/09/20_10:38:19 info:
extract_event: Aborting on
transient_attributes changes for
47cb4e3e-7c8f-4dc0-9da8-d9744815ed53
tengine[19328]: 2007/09/20_10:38:19 info:
update_abort_priority: Abort
priority upgraded to 1000000
crmd[19321]: 2007/09/20_10:38:19 info:
do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_IPC_MESSAGE origin=route_message ]
tengine[19328]: 2007/09/20_10:38:19 info:
te_update_diff: Aborting on
transient_attributes deletions
crmd[19321]: 2007/09/20_10:38:19 info:
do_state_transition: All 2
cluster nodes are eligible to run resources.
pengine[19329]: 2007/09/20_10:38:20 notice:
cluster_option: Using
default value '60s' for cluster option 'cluster-delay'
pengine[19329]: 2007/09/20_10:38:20 notice:
cluster_option: Using
default value '-1' for cluster option
'pe-error-series-max'
pengine[19329]: 2007/09/20_10:38:20 notice:
cluster_option: Using
default value '-1' for cluster option
'pe-warn-series-max'
pengine[19329]: 2007/09/20_10:38:20 notice:
cluster_option: Using
default value '-1' for cluster option
'pe-input-series-max'
pengine[19329]: 2007/09/20_10:38:20 notice:
cluster_option: Using
default value 'true' for cluster option
'startup-fencing'
pengine[19329]: 2007/09/20_10:38:20 info:
determine_online_status: Node
pollux is online
pengine[19329]: 2007/09/20_10:38:20 info:
determine_online_status: Node
castor is online
pengine[19329]: 2007/09/20_10:38:20 info:
group_print: Resource Group:
group_1
pengine[19329]: 2007/09/20_10:38:20 info:
native_print:
IPaddr_147_210_36_7 (heartbeat::ocf:IPaddr):
  Started castor
pengine[19329]: 2007/09/20_10:38:20 info:
native_print:
Filesystem_2        (heartbeat::ocf:Filesystem):
  Started castor
pengine[19329]: 2007/09/20_10:38:20 info:
native_print:
cyrus-imapd_3       (lsb:cyrus-imapd):
Started castor
pengine[19329]: 2007/09/20_10:38:20 info:
native_print:     saslauthd_4
(lsb:saslauthd):        Started castor
pengine[19329]: 2007/09/20_10:38:20 info:
clone_print: Clone Set: pingd
pengine[19329]: 2007/09/20_10:38:20 info:
native_print:
pingd-child:0       (heartbeat::ocf:pingd):
Started pollux
pengine[19329]: 2007/09/20_10:38:20 info:
native_print:
pingd-child:1       (heartbeat::ocf:pingd):
Started castor
pengine[19329]: 2007/09/20_10:38:20 info:
clone_print: Clone Set: DoFencing
pengine[19329]: 2007/09/20_10:38:20 info:
native_print:
child_DoFencing:0   (stonith:apcmastersnmp):
  Started pollux
pengine[19329]: 2007/09/20_10:38:20 info:
native_print:
child_DoFencing:1   (stonith:apcmastersnmp):
  Started castor
pengine[19329]: 2007/09/20_10:38:20 notice:
NoRoleChange: Move  resource
IPaddr_147_210_36_7    (castor -> pollux)
pengine[19329]: 2007/09/20_10:38:20 notice:
StopRsc:   castor   Stop
IPaddr_147_210_36_7
pengine[19329]: 2007/09/20_10:38:20 notice:
StartRsc:  pollux   Start
IPaddr_147_210_36_7
pengine[19329]: 2007/09/20_10:38:20 notice:
RecurringOp: pollux
IPaddr_147_210_36_7_monitor_5000
pengine[19329]: 2007/09/20_10:38:20 notice:
NoRoleChange: Move  resource
Filesystem_2   (castor -> pollux)
pengine[19329]: 2007/09/20_10:38:20 notice:
StopRsc:   castor   Stop
Filesystem_2
pengine[19329]: 2007/09/20_10:38:20 notice:
StartRsc:  pollux   Start
Filesystem_2
pengine[19329]: 2007/09/20_10:38:20 notice:
RecurringOp: pollux
Filesystem_2_monitor_60000
pengine[19329]: 2007/09/20_10:38:20 notice:
NoRoleChange: Move  resource
cyrus-imapd_3  (castor -> pollux)
pengine[19329]: 2007/09/20_10:38:20 notice:
StopRsc:   castor   Stop
cyrus-imapd_3
pengine[19329]: 2007/09/20_10:38:20 notice:
StartRsc:  pollux   Start
cyrus-imapd_3
pengine[19329]: 2007/09/20_10:38:20 notice:
RecurringOp: pollux
cyrus-imapd_3_monitor_60000
pengine[19329]: 2007/09/20_10:38:20 notice:
NoRoleChange: Move  resource
saslauthd_4    (castor -> pollux)
pengine[19329]: 2007/09/20_10:38:20 notice:
StopRsc:   castor   Stop
saslauthd_4
pengine[19329]: 2007/09/20_10:38:20 notice:
StartRsc:  pollux   Start
saslauthd_4
pengine[19329]: 2007/09/20_10:38:20 notice:
RecurringOp: pollux
saslauthd_4_monitor_60000
pengine[19329]: 2007/09/20_10:38:20 notice:
NoRoleChange: Leave resource
pingd-child:0  (pollux)
pengine[19329]: 2007/09/20_10:38:20 notice:
NoRoleChange: Leave resource
pingd-child:1  (castor)
pengine[19329]: 2007/09/20_10:38:20 notice:
NoRoleChange: Leave resource
child_DoFencing:0      (pollux)
pengine[19329]: 2007/09/20_10:38:20 notice:
NoRoleChange: Leave resource
child_DoFencing:1      (castor)
pengine[19329]: 2007/09/20_10:38:20 info:
process_pe_message: Transition
5: PEngine Input stored in:
/var/lib/heartbeat/pengine/pe-input-221.raw
crmd[19321]: 2007/09/20_10:38:20 info:
do_state_transition: State
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE
[ input=I_PE_SUCCESS
cause=C_IPC_MESSAGE origin=route_message ]
tengine[19328]: 2007/09/20_10:38:20 info:
unpack_graph: Unpacked
transition 5: 16 actions in 16 synapses
tengine[19328]: 2007/09/20_10:38:20 info:
te_pseudo_action: Pseudo
action 26 fired and confirmed
tengine[19328]: 2007/09/20_10:38:20 info:
send_rsc_command: Initiating
action 21: saslauthd_4_stop_0 on castor
tengine[19328]: 2007/09/20_10:38:22 info:
match_graph_event: Action
saslauthd_4_stop_0 (21) confirmed on castor
tengine[19328]: 2007/09/20_10:38:22 info:
send_rsc_command: Initiating
action 18: cyrus-imapd_3_stop_0 on castor
tengine[19328]: 2007/09/20_10:38:24 info:
match_graph_event: Action
cyrus-imapd_3_stop_0 (18) confirmed on castor
tengine[19328]: 2007/09/20_10:38:24 info:
send_rsc_command: Initiating
action 15: Filesystem_2_stop_0 on castor
tengine[19328]: 2007/09/20_10:38:26 info:
match_graph_event: Action
Filesystem_2_stop_0 (15) confirmed on castor
tengine[19328]: 2007/09/20_10:38:26 info:
send_rsc_command: Initiating
action 12: IPaddr_147_210_36_7_stop_0 on castor
tengine[19328]: 2007/09/20_10:38:27 info:
match_graph_event: Action
IPaddr_147_210_36_7_stop_0 (12) confirmed on castor
tengine[19328]: 2007/09/20_10:38:27 info:
te_pseudo_action: Pseudo
action 27 fired and confirmed
tengine[19328]: 2007/09/20_10:38:27 info:
te_pseudo_action: Pseudo
action 24 fired and confirmed
tengine[19328]: 2007/09/20_10:38:27 info:
send_rsc_command: Initiating
action 13: IPaddr_147_210_36_7_start_0 on pollux
crmd[19321]: 2007/09/20_10:38:27 info:
do_lrm_rsc_op: Performing
op=IPaddr_147_210_36_7_start_0
key=13:5:b9f1026e-93d3-46f1-9b87-857212f2fd7e)
lrmd[19318]: 2007/09/20_10:38:27 info: RA output:
(IPaddr_147_210_36_7:start:stderr) Rewrote octal
netmask as: 24

IPaddr[19558][19597]: 2007/09/20_10:38:27 DEBUG:
Using calculated
broadcast for 147.210.36.7: 147.210.36.255
IPaddr[19558][19614]: 2007/09/20_10:38:27 INFO:
eval ifconfig eth0:0
147.210.36.7 netmask 255.255.255.0 broadcast
147.210.36.255
IPaddr[19558][19619]: 2007/09/20_10:38:27 DEBUG:
Sending Gratuitous Arp
for 147.210.36.7 on eth0:0 [eth0]
crmd[19321]: 2007/09/20_10:38:27 info:
process_lrm_event: LRM operation
IPaddr_147_210_36_7_start_0 (call=14, rc=0) complete
crmd[19321]: 2007/09/20_10:38:27 info:
build_operation_update: Digest
for 0:0;13:5:b9f1026e-93d3-46f1-9b87-857212f2fd7e
(IPaddr_147_210_36_7_start_0) was
e03993409a8940d5daa9e68a96ee5f0c

crmd[19321]: 2007/09/20_10:38:27 info:
log_data_element:
build_operation_update: digest:source <parameters
ip="147.210.36.7"
netmask="255.255.255.0" nic="eth0"/>
tengine[19328]: 2007/09/20_10:38:27 info:
match_graph_event: Action
IPaddr_147_210_36_7_start_0 (13) confirmed on pollux
tengine[19328]: 2007/09/20_10:38:27 info:
send_rsc_command: Initiating
action 14: IPaddr_147_210_36_7_monitor_5000 on pollux
tengine[19328]: 2007/09/20_10:38:27 info:
send_rsc_command: Initiating
action 16: Filesystem_2_start_0 on pollux
crmd[19321]: 2007/09/20_10:38:27 info:
do_lrm_rsc_op: Performing
op=IPaddr_147_210_36_7_monitor_5000
key=14:5:b9f1026e-93d3-46f1-9b87-857212f2fd7e)
crmd[19321]: 2007/09/20_10:38:27 info:
do_lrm_rsc_op: Performing
op=Filesystem_2_start_0
key=16:5:b9f1026e-93d3-46f1-9b87-857212f2fd7e)
crmd[19321]: 2007/09/20_10:38:27 info:
process_lrm_event: LRM operation
IPaddr_147_210_36_7_monitor_5000 (call=15, rc=0)
complete
Filesystem[19645][19693]: 2007/09/20_10:38:27
INFO: Running start for
/dev/VolGroup01/maillv on /mailsan
tengine[19328]: 2007/09/20_10:38:27 info:
match_graph_event: Action
IPaddr_147_210_36_7_monitor_5000 (14) confirmed on
pollux
crmd[19321]: 2007/09/20_10:38:27 info:
process_lrm_event: LRM operation
Filesystem_2_start_0 (call=16, rc=0) complete
crmd[19321]: 2007/09/20_10:38:27 info:
build_operation_update: Digest
for 0:0;16:5:b9f1026e-93d3-46f1-9b87-857212f2fd7e
(Filesystem_2_start_0)
was c2fd28c4595a232e7f3843e77f7214e4

crmd[19321]: 2007/09/20_10:38:27 info:
log_data_element:
build_operation_update: digest:source <parameters
directory="/mailsan"
fstype="ext3" device="/dev/VolGroup01/maillv"
options="noatime"/>
tengine[19328]: 2007/09/20_10:38:27 info:
match_graph_event: Action
Filesystem_2_start_0 (16) confirmed on pollux
tengine[19328]: 2007/09/20_10:38:27 info:
send_rsc_command: Initiating
action 17: Filesystem_2_monitor_60000 on pollux
tengine[19328]: 2007/09/20_10:38:27 info:
send_rsc_command: Initiating
action 19: cyrus-imapd_3_start_0 on pollux
crmd[19321]: 2007/09/20_10:38:27 info:
do_lrm_rsc_op: Performing
op=Filesystem_2_monitor_60000
key=17:5:b9f1026e-93d3-46f1-9b87-857212f2fd7e)
crmd[19321]: 2007/09/20_10:38:27 info:
do_lrm_rsc_op: Performing
op=cyrus-imapd_3_start_0
key=19:5:b9f1026e-93d3-46f1-9b87-857212f2fd7e)
lrmd[19716]: 2007/09/20_10:38:27 WARN: For LSB
init script, no
additional parameters are needed.
lrmd[19318]: 2007/09/20_10:38:27 info: RA output:
(cyrus-imapd_3:start:stdout) Importation des bases
de donn��es cyrus-imapd
crmd[19321]: 2007/09/20_10:38:27 info:
process_lrm_event: LRM operation
Filesystem_2_monitor_60000 (call=17, rc=0) complete
tengine[19328]: 2007/09/20_10:38:27 info:
match_graph_event: Action
Filesystem_2_monitor_60000 (17) confirmed on pollux
lrmd[19318]: 2007/09/20_10:38:29 info: RA output:
(cyrus-imapd_3:start:stdout) [
lrmd[19318]: 2007/09/20_10:38:29 info: RA output:
(cyrus-imapd_3:start:stdout)   OK
lrmd[19318]: 2007/09/20_10:38:29 info: RA output:
(cyrus-imapd_3:start:stdout) ]
lrmd[19318]: 2007/09/20_10:38:29 info: RA output:
(cyrus-imapd_3:start:stdout)
lrmd[19318]: 2007/09/20_10:38:29 info: RA output:
(cyrus-imapd_3:start:stdout)

lrmd[19318]: 2007/09/20_10:38:29 info: RA output:
(cyrus-imapd_3:start:stdout) D��marrage de
cyrus-imapd :
lrmd[19318]: 2007/09/20_10:38:29 info: RA output:
(cyrus-imapd_3:start:stdout) [
lrmd[19318]: 2007/09/20_10:38:29 info: RA output:
(cyrus-imapd_3:start:stdout)   OK  ]
lrmd[19318]: 2007/09/20_10:38:29 info: RA output:
(cyrus-imapd_3:start:stdout)

crmd[19321]: 2007/09/20_10:38:29 info:
process_lrm_event: LRM operation
cyrus-imapd_3_start_0 (call=18, rc=0) complete
crmd[19321]: 2007/09/20_10:38:29 info:
build_operation_update: Digest
for 0:0;19:5:b9f1026e-93d3-46f1-9b87-857212f2fd7e
(cyrus-imapd_3_start_0) was
f2317cad3d54cec5d7d7aa7d0bf35cf8

crmd[19321]: 2007/09/20_10:38:29 info:
log_data_element:
build_operation_update: digest:source <parameters/>
tengine[19328]: 2007/09/20_10:38:29 info:
match_graph_event: Action
cyrus-imapd_3_start_0 (19) confirmed on pollux
tengine[19328]: 2007/09/20_10:38:29 info:
send_rsc_command: Initiating
action 20: cyrus-imapd_3_monitor_60000 on pollux
tengine[19328]: 2007/09/20_10:38:29 info:
send_rsc_command: Initiating
action 22: saslauthd_4_start_0 on pollux
crmd[19321]: 2007/09/20_10:38:29 info:
do_lrm_rsc_op: Performing
op=cyrus-imapd_3_monitor_60000
key=20:5:b9f1026e-93d3-46f1-9b87-857212f2fd7e)
crmd[19321]: 2007/09/20_10:38:29 info:
do_lrm_rsc_op: Performing
op=saslauthd_4_start_0
key=22:5:b9f1026e-93d3-46f1-9b87-857212f2fd7e)
lrmd[19870]: 2007/09/20_10:38:29 WARN: For LSB
init script, no
additional parameters are needed.
lrmd[19318]: 2007/09/20_10:38:29 info: RA output:
(saslauthd_4:start:stdout) D��marrage de saslauthd :
lrmd[19318]: 2007/09/20_10:38:29 info: RA output:
(saslauthd_4:start:stdout) [
lrmd[19318]: 2007/09/20_10:38:29 info: RA output:
(saslauthd_4:start:stdout)   OK  ]
lrmd[19318]: 2007/09/20_10:38:29 info: RA output:
(saslauthd_4:start:stdout)

crmd[19321]: 2007/09/20_10:38:29 info:
process_lrm_event: LRM operation
cyrus-imapd_3_monitor_60000 (call=19, rc=0) complete
crmd[19321]: 2007/09/20_10:38:29 info:
process_lrm_event: LRM operation
saslauthd_4_start_0 (call=20, rc=0) complete
crmd[19321]: 2007/09/20_10:38:29 info:
build_operation_update: Digest
for 0:0;22:5:b9f1026e-93d3-46f1-9b87-857212f2fd7e
(saslauthd_4_start_0)
was f2317cad3d54cec5d7d7aa7d0bf35cf8

crmd[19321]: 2007/09/20_10:38:29 info:
log_data_element:
build_operation_update: digest:source <parameters/>
tengine[19328]: 2007/09/20_10:38:29 info:
match_graph_event: Action
cyrus-imapd_3_monitor_60000 (20) confirmed on pollux
tengine[19328]: 2007/09/20_10:38:29 info:
match_graph_event: Action
saslauthd_4_start_0 (22) confirmed on pollux
tengine[19328]: 2007/09/20_10:38:29 info:
te_pseudo_action: Pseudo
action 25 fired and confirmed
tengine[19328]: 2007/09/20_10:38:29 info:
send_rsc_command: Initiating
action 23: saslauthd_4_monitor_60000 on pollux
crmd[19321]: 2007/09/20_10:38:29 info:
do_lrm_rsc_op: Performing
op=saslauthd_4_monitor_60000
key=23:5:b9f1026e-93d3-46f1-9b87-857212f2fd7e)
crmd[19321]: 2007/09/20_10:38:29 info:
process_lrm_event: LRM operation
saslauthd_4_monitor_60000 (call=21, rc=0) complete
tengine[19328]: 2007/09/20_10:38:29 info:
match_graph_event: Action
saslauthd_4_monitor_60000 (23) confirmed on pollux
tengine[19328]: 2007/09/20_10:38:29 info:
run_graph: Transition 5:
(Complete=16, Pending=0, Fired=0, Skipped=0,
Incomplete=0)
tengine[19328]: 2007/09/20_10:38:29 info:
notify_crmd: Transition 5
status: te_complete - <null>
crmd[19321]: 2007/09/20_10:38:29 info:
do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [
input=I_TE_SUCCESS
cause=C_IPC_MESSAGE origin=route_message ]

Do i need a particular constraint between pingd
and stonith ?

>   
>> 2- Same problem whith resource's failure, at the end of the sixth failure 
>> (depends of my configuration stickiness), the resources are failover on the 
>> standby node BUT again, stonith doesn't shoot the node.
>>     
>
> as long as the resource stops correctly, there is no need to shoot the node
>   
Ok, i understand this... The resource filesystem
is correctly umounted
(stopped), no risk of data corruption, then no
need to shoot the node

Thanks
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
>   

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to