[Linux-HA] 2-node cluster: loss of node = loss of quorum which kills clustered services

Marc Mon, 19 May 2008 07:58:11 -0700

This works:


   - Start with resource group on main node, backup node is DC
   - Migrate resource group to backup node using crm_resource -M -r <grp> -H
   <backup node name>

I now see all the resources running on the backup node as expected.

Next I stop heartbeat on the main node (I want to perform maintenance).
Heartbeat allows this to happen.  Moments after it's gone though, the log
shows that it has lost quorum and stops the resources that are running on
the backup node.

The total number of nodes in the cluster is 2.  I guess we'll never have
quorum if one of the nodes is down.  I wouldn't expect resources to be
released in this scenario though.  What is the right thing to do here?
Do I need to run a quorumd server?

Our communication settings from ha.cf are:
baud   19200
serial /dev/ttyS0
ucast bond1 192.168.50.1

Here's some of the logs:
pengine[10338]: 2008/05/19_10:36:51 info: determine_online_status: Node
dbnya1.mycompany.com is shutting down
pengine[10338]: 2008/05/19_10:36:51 notice: group_print: Resource Group:
pgsql.myapp.group
pengine[10338]: 2008/05/19_10:36:51 notice: native_print:
pgsql.myapp.ip    (heartbeat::ocf:IPaddr):        Started
dbnya2.mycompany.com
pengine[10338]: 2008/05/19_10:36:51 notice: native_print:
pgsql.myapp.fsData        (heartbeat::ocf:Filesystem):    Started
dbnya2.mycompany.
com
pengine[10338]: 2008/05/19_10:36:51 notice: native_print:
pgsql.myapp.fsTxnLog      (heartbeat::ocf:Filesystem):    Started
dbnya2.mycompany.
com
pengine[10338]: 2008/05/19_10:36:51 notice: native_print:
pgsql.myapp.nfslock       (lsb:nfslock):  Started dbnya2.mycompany.com
pengine[10338]: 2008/05/19_10:36:51 notice: native_print:
pgsql.myapp.nfs   (lsb:nfs-mycompany):     Started dbnya2.mycompany.com
pengine[10338]: 2008/05/19_10:36:51 notice: native_print:
pgsql.myapp.pgsql (heartbeat::ocf:pgsql): Started dbnya2.mycompany.com
pengine[10338]: 2008/05/19_10:36:51 notice: NoRoleChange: Leave resource
pgsql.myapp.ip (dbnya2.mycompany.com)
pengine[10338]: 2008/05/19_10:36:51 notice: NoRoleChange: Leave resource
pgsql.myapp.fsData     (dbnya2.mycompany.com)
pengine[10338]: 2008/05/19_10:36:51 notice: NoRoleChange: Leave resource
pgsql.myapp.fsTxnLog   (dbnya2.mycompany.com)
pengine[10338]: 2008/05/19_10:36:51 notice: NoRoleChange: Leave resource
pgsql.myapp.nfslock    (dbnya2.mycompany.com)
pengine[10338]: 2008/05/19_10:36:51 notice: NoRoleChange: Leave resource
pgsql.myapp.nfs        (dbnya2.mycompany.com)
pengine[10338]: 2008/05/19_10:36:51 notice: NoRoleChange: Leave resource
pgsql.myapp.pgsql      (dbnya2.mycompany.com)
crmd[8999]: 2008/05/19_10:36:51 info: do_state_transition: State transition
S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_
IPC_MESSAGE origin=route_message ]
pengine[10338]: 2008/05/19_10:36:51 info: stage6: Scheduling Node
dbnya1.mycompany.com for shutdown
tengine[10337]: 2008/05/19_10:36:51 info: unpack_graph: Unpacked transition
3: 1 actions in 1 synapses
tengine[10337]: 2008/05/19_10:36:51 info: te_crm_command: Executing
crm-event (22): do_shutdown on dbnya1.mycompany.com
pengine[10338]: 2008/05/19_10:36:51 info: process_pe_message: Transition 3:
PEngine Input stored in: /var/lib/heartbeat/pengine/pe-input-143.bz2
crmd[8999]: 2008/05/19_10:36:51 notice: crmd_client_status_callback: Status
update: Client dbnya1.mycompany.com/crmd now has status [offline]
cib[8995]: 2008/05/19_10:36:52 info: cib_process_shutdown_req: Shutdown REQ
from dbnya1.mycompany.com
cib[8995]: 2008/05/19_10:36:52 info: cib_client_status_callback: Status
update: Client dbnya1.mycompany.com/cib now has status [leave]
crmd[8999]: 2008/05/19_10:36:52 info: mem_handle_event: Got an event
OC_EV_MS_INVALID from ccm
cib[8995]: 2008/05/19_10:36:52 info: mem_handle_event: Got an event
OC_EV_MS_INVALID from ccm
crmd[8999]: 2008/05/19_10:36:52 info: mem_handle_event: no mbr_track info
cib[8995]: 2008/05/19_10:36:52 info: mem_handle_event: no mbr_track info
crmd[8999]: 2008/05/19_10:36:52 info: mem_handle_event: Got an event
OC_EV_MS_INVALID from ccm
tengine[10337]: 2008/05/19_10:36:52 info: update_abort_priority: Abort
priority upgraded to 1000000
cib[8995]: 2008/05/19_10:36:52 info: mem_handle_event: Got an event
OC_EV_MS_INVALID from ccm
crmd[8999]: 2008/05/19_10:36:52 info: mem_handle_event: instance=10,
nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=4
tengine[10337]: 2008/05/19_10:36:52 info: update_abort_priority: Abort
action 0 superceeded by 2
cib[8995]: 2008/05/19_10:36:52 info: mem_handle_event: instance=10, nodes=1,
new=0, lost=1, n_idx=0, new_idx=1, old_idx=4
crmd[8999]: 2008/05/19_10:36:52 info: crmd_ccm_msg_callback: Quorum lost
after event=INVALID (id=10)
cib[8995]: 2008/05/19_10:36:52 info: cib_ccm_msg_callback: LOST:
dbnya1.mycompany.com
crmd[8999]: 2008/05/19_10:36:52 info: crmd_ccm_msg_callback: Quorum lost:
triggering transition (INVALID)
cib[8995]: 2008/05/19_10:36:52 info: cib_ccm_msg_callback: PEER:
dbnya2.mycompany.com
crmd[8999]: 2008/05/19_10:36:52 info: ccm_event_detail: INVALID: trans=10,
nodes=1, new=0, lost=1 n_idx=0, new_idx=1, old_idx=4
crmd[8999]: 2008/05/19_10:36:52 info: ccm_event_detail:         CURRENT:
dbnya2.mycompany.com [nodeid=1, born=10]
crmd[8999]: 2008/05/19_10:36:52 info: ccm_event_detail:         LOST:
dbnya1.mycompany.com [nodeid=0, born=9]
tengine[10337]: 2008/05/19_10:36:52 info: run_graph: Transition 3:
(Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0)
crmd[8999]: 2008/05/19_10:36:52 info: do_state_transition: State transition
S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_IPC
_MESSAGE origin=route_message ]
crmd[8999]: 2008/05/19_10:36:52 info: do_state_transition: All 1 cluster
nodes are eligible to run resources.
pengine[10338]: 2008/05/19_10:36:52 WARN: cluster_status: We do not have
quorum - fencing and resource management disabled
pengine[10338]: 2008/05/19_10:36:52 info: determine_online_status: Node
dbnya2.mycompany.com is online
pengine[10338]: 2008/05/19_10:36:52 notice: group_print: Resource Group:
pgsql.myapp.group
pengine[10338]: 2008/05/19_10:36:52 notice: native_print:
pgsql.myapp.ip    (heartbeat::ocf:IPaddr):        Started
dbnya2.mycompany.com
pengine[10338]: 2008/05/19_10:36:52 notice: native_print:
pgsql.myapp.fsData        (heartbeat::ocf:Filesystem):    Started
dbnya2.mycompany.
com
pengine[10338]: 2008/05/19_10:36:52 notice: native_print:
pgsql.myapp.fsTxnLog      (heartbeat::ocf:Filesystem):    Started
dbnya2.mycompany.
com
pengine[10338]: 2008/05/19_10:36:52 notice: native_print:
pgsql.myapp.nfslock       (lsb:nfslock):  Started dbnya2.mycompany.com
pengine[10338]: 2008/05/19_10:36:52 notice: native_print:
pgsql.myapp.nfs   (lsb:nfs-mycompany):     Started dbnya2.mycompany.com
pengine[10338]: 2008/05/19_10:36:52 notice: native_print:
pgsql.myapp.pgsql (heartbeat::ocf:pgsql): Started dbnya2.mycompany.com
pengine[10338]: 2008/05/19_10:36:52 notice: StopRsc:
dbnya2.mycompany.com
Stop pgsql.myapp.ip
pengine[10338]: 2008/05/19_10:36:52 notice: StopRsc:
dbnya2.mycompany.com
Stop pgsql.myapp.fsData
pengine[10338]: 2008/05/19_10:36:52 notice: StopRsc:
dbnya2.mycompany.com
Stop pgsql.myapp.fsTxnLog
pengine[10338]: 2008/05/19_10:36:52 notice: StopRsc:
dbnya2.mycompany.com
Stop pgsql.myapp.nfslock
pengine[10338]: 2008/05/19_10:36:52 notice: StopRsc:
dbnya2.mycompany.com
Stop pgsql.myapp.nfs
pengine[10338]: 2008/05/19_10:36:52 notice: StopRsc:
dbnya2.mycompany.com
Stop pgsql.myapp.pgsql
crmd[8999]: 2008/05/19_10:36:52 info: do_state_transition: State transition
S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_
IPC_MESSAGE origin=route_message ]
tengine[10337]: 2008/05/19_10:36:52 info: unpack_graph: Unpacked transition
4: 9 actions in 9 synapses
tengine[10337]: 2008/05/19_10:36:52 info: te_pseudo_action: Pseudo action 18
fired and confirmed
tengine[10337]: 2008/05/19_10:36:52 info: send_rsc_command: Initiating
action 14: pgsql.myapp.pgsql_stop_0 on dbnya2.mycompany.com
crmd[8999]: 2008/05/19_10:36:52 info: do_lrm_rsc_op: Performing
op=pgsql.myapp.pgsql_stop_0 key=14:4:60fbb97d-8b53-401c-85c6-7f7e57de9b00)
lrmd[8996]: 2008/05/19_10:36:52 info: rsc:pgsql.myapp.pgsql: stop
pengine[10338]: 2008/05/19_10:36:52 info: process_pe_message: Transition 4:
PEngine Input stored in: /var/lib/heartbeat/pengine/pe-input-144.bz2
pgsql[19654][19679]: 2008/05/19_10:36:53 INFO: PostgreSQL is down
crmd[8999]: 2008/05/19_10:36:54 info: process_lrm_event: LRM operation
pgsql.myapp.pgsql_stop_0 (call=27, rc=0) complete
tengine[10337]: 2008/05/19_10:36:54 info: match_graph_event: Action
pgsql.myapp.pgsql_stop_0 (14) confirmed on dbnya2.mycompany.com (rc=0)
tengine[10337]: 2008/05/19_10:36:54 info: send_rsc_command: Initiating
action 12: pgsql.myapp.nfs_stop_0 on dbnya2.mycompany.com
crmd[8999]: 2008/05/19_10:36:54 info: do_lrm_rsc_op: Performing
op=pgsql.myapp.nfs_stop_0 key=12:4:60fbb97d-8b53-401c-85c6-7f7e57de9b00)
lrmd[8996]: 2008/05/19_10:36:54 info: rsc:pgsql.myapp.nfs: stop
lrmd[19683]: 2008/05/19_10:36:54 WARN: For LSB init script, no additional
parameters are needed.
lrmd[8996]: 2008/05/19_10:36:54 info: RA output:
(pgsql.myapp.nfs:stop:stdout) Shutting down NFS mountd:
lrmd[8996]: 2008/05/19_10:36:55 info: RA output:
(pgsql.myapp.nfs:stop:stdout) [
lrmd[8996]: 2008/05/19_10:36:55 info: RA output:
(pgsql.myapp.nfs:stop:stdout)   OK  ]
lrmd[8996]: 2008/05/19_10:36:55 info: RA output:
(pgsql.myapp.nfs:stop:stdout)
lrmd[8996]: 2008/05/19_10:36:55 info: RA output:
(pgsql.myapp.nfs:stop:stdout)

lrmd[8996]: 2008/05/19_10:36:55 info: RA output:
(pgsql.myapp.nfs:stop:stdout) Shutting down NFS daemon:
lrmd[8996]: 2008/05/19_10:36:55 info: RA output:
(pgsql.myapp.nfs:stop:stdout) [
lrmd[8996]: 2008/05/19_10:36:55 info: RA output:
(pgsql.myapp.nfs:stop:stdout)   OK
lrmd[8996]: 2008/05/19_10:36:55 info: RA output:
(pgsql.myapp.nfs:stop:stdout) ]
lrmd[8996]: 2008/05/19_10:36:55 info: RA output:
(pgsql.myapp.nfs:stop:stdout)
lrmd[8996]: 2008/05/19_10:36:58 info: RA output:
(pgsql.myapp.nfs:stop:stdout) nfsd (pid 19578 19577 19576 19575 19574 19573
19572 19571) is run
ning...

lrmd[8996]: 2008/05/19_10:36:58 info: RA output:
(pgsql.myapp.nfs:stop:stdout) Force-killing nfs daemon:
lrmd[8996]: 2008/05/19_10:37:02 info: RA output:
(pgsql.myapp.nfs:stop:stdout) [
lrmd[8996]: 2008/05/19_10:37:02 info: RA output:
(pgsql.myapp.nfs:stop:stdout)   OK  ]
lrmd[8996]: 2008/05/19_10:37:02 info: RA output:
(pgsql.myapp.nfs:stop:stdout)

lrmd[8996]: 2008/05/19_10:37:02 info: RA output:
(pgsql.myapp.nfs:stop:stdout) Shutting down NFS quotas:
lrmd[8996]: 2008/05/19_10:37:02 info: RA output:
(pgsql.myapp.nfs:stop:stdout) [
lrmd[8996]: 2008/05/19_10:37:02 info: RA output:
(pgsql.myapp.nfs:stop:stdout)   OK  ]
lrmd[8996]: 2008/05/19_10:37:02 info: RA output:
(pgsql.myapp.nfs:stop:stdout)

lrmd[8996]: 2008/05/19_10:37:02 info: RA output:
(pgsql.myapp.nfs:stop:stdout) Shutting down NFS services:
lrmd[8996]: 2008/05/19_10:37:02 info: RA output:
(pgsql.myapp.nfs:stop:stdout) [
lrmd[8996]: 2008/05/19_10:37:02 info: RA output:
(pgsql.myapp.nfs:stop:stdout)   OK
lrmd[8996]: 2008/05/19_10:37:02 info: RA output:
(pgsql.myapp.nfs:stop:stdout) ]
lrmd[8996]: 2008/05/19_10:37:02 info: RA output:
(pgsql.myapp.nfs:stop:stdout)
lrmd[8996]: 2008/05/19_10:37:02 info: RA output:
(pgsql.myapp.nfs:stop:stdout)

crmd[8999]: 2008/05/19_10:37:02 info: process_lrm_event: LRM operation
pgsql.myapp.nfs_stop_0 (call=28, rc=0) complete
tengine[10337]: 2008/05/19_10:37:02 info: match_graph_event: Action
pgsql.myapp.nfs_stop_0 (12) confirmed on dbnya2.mycompany.com (rc=0)
tengine[10337]: 2008/05/19_10:37:02 info: send_rsc_command: Initiating
action 10: pgsql.myapp.nfslock_stop_0 on dbnya2.mycompany.com
crmd[8999]: 2008/05/19_10:37:02 info: do_lrm_rsc_op: Performing
op=pgsql.myapp.nfslock_stop_0 key=10:4:60fbb97d-8b53-401c-85c6-7f7e57de9b00)
lrmd[8996]: 2008/05/19_10:37:02 info: rsc:pgsql.myapp.nfslock: stop
lrmd[19726]: 2008/05/19_10:37:02 WARN: For LSB init script, no additional
parameters are needed.
lrmd[8996]: 2008/05/19_10:37:02 info: RA output:
(pgsql.myapp.nfslock:stop:stdout) Stopping NFS statd:
lrmd[8996]: 2008/05/19_10:37:02 info: RA output:
(pgsql.myapp.nfslock:stop:stdout) [
lrmd[8996]: 2008/05/19_10:37:02 info: RA output:
(pgsql.myapp.nfslock:stop:stdout)   OK  ]
lrmd[8996]: 2008/05/19_10:37:02 info: RA output:
(pgsql.myapp.nfslock:stop:stdout)

crmd[8999]: 2008/05/19_10:37:02 info: process_lrm_event: LRM operation
pgsql.myapp.nfslock_stop_0 (call=29, rc=0) complete
tengine[10337]: 2008/05/19_10:37:02 info: match_graph_event: Action
pgsql.myapp.nfslock_stop_0 (10) confirmed on dbnya2.mycompany.com (rc=0)
tengine[10337]: 2008/05/19_10:37:02 info: send_rsc_command: Initiating
action 8: pgsql.myapp.fsTxnLog_stop_0 on dbnya2.mycompany.com
crmd[8999]: 2008/05/19_10:37:02 info: do_lrm_rsc_op: Performing
op=pgsql.myapp.fsTxnLog_stop_0 key=8:4:60fbb97d-8b53-401c-85c6-7f7e57de9b00)
lrmd[8996]: 2008/05/19_10:37:02 info: rsc:pgsql.myapp.fsTxnLog: stop
Filesystem[19747][19777]: 2008/05/19_10:37:02 INFO: Running stop for
/dev/mapper/md3000-txnlogp1 on /opt/data/md3000/txnlog
Filesystem[19747][19787]: 2008/05/19_10:37:02 INFO: Trying to unmount
/opt/data/md3000/txnlog
Filesystem[19747][19793]: 2008/05/19_10:37:02 INFO: unmounted
/opt/data/md3000/txnlog successfully
crmd[8999]: 2008/05/19_10:37:02 info: process_lrm_event: LRM operation
pgsql.myapp.fsTxnLog_stop_0 (call=30, rc=0) complete
tengine[10337]: 2008/05/19_10:37:02 info: match_graph_event: Action
pgsql.myapp.fsTxnLog_stop_0 (8) confirmed on dbnya2.mycompany.com (rc=0)
tengine[10337]: 2008/05/19_10:37:02 info: send_rsc_command: Initiating
action 6: pgsql.myapp.fsData_stop_0 on dbnya2.mycompany.com
crmd[8999]: 2008/05/19_10:37:02 info: do_lrm_rsc_op: Performing
op=pgsql.myapp.fsData_stop_0 key=6:4:60fbb97d-8b53-401c-85c6-7f7e57de9b00)
lrmd[8996]: 2008/05/19_10:37:02 info: rsc:pgsql.myapp.fsData: stop
Filesystem[19808][19838]: 2008/05/19_10:37:02 INFO: Running stop for
/dev/mapper/md3000-datap1 on /opt/data/md3000/data
Filesystem[19808][19848]: 2008/05/19_10:37:02 INFO: Trying to unmount
/opt/data/md3000/data
Filesystem[19808][19859]: 2008/05/19_10:37:02 INFO: unmounted
/opt/data/md3000/data successfully
crmd[8999]: 2008/05/19_10:37:02 info: process_lrm_event: LRM operation
pgsql.myapp.fsData_stop_0 (call=31, rc=0) complete
tengine[10337]: 2008/05/19_10:37:02 info: match_graph_event: Action
pgsql.myapp.fsData_stop_0 (6) confirmed on dbnya2.mycompany.com (rc=0)
tengine[10337]: 2008/05/19_10:37:02 info: send_rsc_command: Initiating
action 4: pgsql.myapp.ip_stop_0 on dbnya2.mycompany.com
crmd[8999]: 2008/05/19_10:37:02 info: do_lrm_rsc_op: Performing
op=pgsql.myapp.ip_stop_0 key=4:4:60fbb97d-8b53-401c-85c6-7f7e57de9b00)
crmd[8999]: 2008/05/19_10:37:02 info: do_lrm_rsc_op: Performing
op=pgsql.myapp.ip_stop_0 key=4:4:60fbb97d-8b53-401c-85c6-7f7e57de9b00)
lrmd[8996]: 2008/05/19_10:37:02 info: rsc:pgsql.myapp.ip: stop
lrmd[8996]: 2008/05/19_10:37:02 info: RA output:
(pgsql.myapp.ip:stop:stdout) In IP Stop

lrmd[8996]: 2008/05/19_10:37:02 info: RA output:
(pgsql.myapp.ip:stop:stderr) SIOCDELRT: No such process

IPaddr[19869][19884]: 2008/05/19_10:37:02 INFO: ifconfig bond0:0 down
crmd[8999]: 2008/05/19_10:37:02 info: process_lrm_event: LRM operation
pgsql.myapp.ip_stop_0 (call=32, rc=0) complete
tengine[10337]: 2008/05/19_10:37:02 info: match_graph_event: Action
pgsql.myapp.ip_stop_0 (4) confirmed on dbnya2.mycompany.com (rc=0)
tengine[10337]: 2008/05/19_10:37:02 info: te_pseudo_action: Pseudo action 19
fired and confirmed
tengine[10337]: 2008/05/19_10:37:02 info: te_pseudo_action: Pseudo action 3
fired and confirmed
crmd[8999]: 2008/05/19_10:37:02 info: do_state_transition: State transition
S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_IPC_MESSA
GE origin=route_message ]
tengine[10337]: 2008/05/19_10:37:02 info: run_graph: Transition 4:
(Complete=9, Pending=0, Fired=0, Skipped=0, Incomplete=0)
tengine[10337]: 2008/05/19_10:37:02 info: notify_crmd: Transition 4 status:
te_complete - <null>
heartbeat[7257]: 2008/05/19_10:37:23 WARN: node dbnya1.mycompany.com: is
dead
crmd[8999]: 2008/05/19_10:37:23 notice: crmd_ha_status_callback: Status
update: Node dbnya1.mycompany.com now has status [dead]
heartbeat[7257]: 2008/05/19_10:37:23 info: Link dbnya1.mycompany.com:/dev/ttyS0
dead.
heartbeat[7257]: 2008/05/19_10:37:23 info: Link dbnya1.mycompany.com:bond1
dead.
cib[8995]: 2008/05/19_10:40:45 info: cib_stats: Processed 29 operations
(2758.00us average, 0% utilization) in the last 10min
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] 2-node cluster: loss of node = loss of quorum which kills clustered services

Reply via email to