This works:
- Start with resource group on main node, backup node is DC - Migrate resource group to backup node using crm_resource -M -r <grp> -H <backup node name> I now see all the resources running on the backup node as expected. Next I stop heartbeat on the main node (I want to perform maintenance). Heartbeat allows this to happen. Moments after it's gone though, the log shows that it has lost quorum and stops the resources that are running on the backup node. The total number of nodes in the cluster is 2. I guess we'll never have quorum if one of the nodes is down. I wouldn't expect resources to be released in this scenario though. What is the right thing to do here? Do I need to run a quorumd server? Our communication settings from ha.cf are: baud 19200 serial /dev/ttyS0 ucast bond1 192.168.50.1 Here's some of the logs: pengine[10338]: 2008/05/19_10:36:51 info: determine_online_status: Node dbnya1.mycompany.com is shutting down pengine[10338]: 2008/05/19_10:36:51 notice: group_print: Resource Group: pgsql.myapp.group pengine[10338]: 2008/05/19_10:36:51 notice: native_print: pgsql.myapp.ip (heartbeat::ocf:IPaddr): Started dbnya2.mycompany.com pengine[10338]: 2008/05/19_10:36:51 notice: native_print: pgsql.myapp.fsData (heartbeat::ocf:Filesystem): Started dbnya2.mycompany. com pengine[10338]: 2008/05/19_10:36:51 notice: native_print: pgsql.myapp.fsTxnLog (heartbeat::ocf:Filesystem): Started dbnya2.mycompany. com pengine[10338]: 2008/05/19_10:36:51 notice: native_print: pgsql.myapp.nfslock (lsb:nfslock): Started dbnya2.mycompany.com pengine[10338]: 2008/05/19_10:36:51 notice: native_print: pgsql.myapp.nfs (lsb:nfs-mycompany): Started dbnya2.mycompany.com pengine[10338]: 2008/05/19_10:36:51 notice: native_print: pgsql.myapp.pgsql (heartbeat::ocf:pgsql): Started dbnya2.mycompany.com pengine[10338]: 2008/05/19_10:36:51 notice: NoRoleChange: Leave resource pgsql.myapp.ip (dbnya2.mycompany.com) pengine[10338]: 2008/05/19_10:36:51 notice: NoRoleChange: Leave resource pgsql.myapp.fsData (dbnya2.mycompany.com) pengine[10338]: 2008/05/19_10:36:51 notice: NoRoleChange: Leave resource pgsql.myapp.fsTxnLog (dbnya2.mycompany.com) pengine[10338]: 2008/05/19_10:36:51 notice: NoRoleChange: Leave resource pgsql.myapp.nfslock (dbnya2.mycompany.com) pengine[10338]: 2008/05/19_10:36:51 notice: NoRoleChange: Leave resource pgsql.myapp.nfs (dbnya2.mycompany.com) pengine[10338]: 2008/05/19_10:36:51 notice: NoRoleChange: Leave resource pgsql.myapp.pgsql (dbnya2.mycompany.com) crmd[8999]: 2008/05/19_10:36:51 info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_ IPC_MESSAGE origin=route_message ] pengine[10338]: 2008/05/19_10:36:51 info: stage6: Scheduling Node dbnya1.mycompany.com for shutdown tengine[10337]: 2008/05/19_10:36:51 info: unpack_graph: Unpacked transition 3: 1 actions in 1 synapses tengine[10337]: 2008/05/19_10:36:51 info: te_crm_command: Executing crm-event (22): do_shutdown on dbnya1.mycompany.com pengine[10338]: 2008/05/19_10:36:51 info: process_pe_message: Transition 3: PEngine Input stored in: /var/lib/heartbeat/pengine/pe-input-143.bz2 crmd[8999]: 2008/05/19_10:36:51 notice: crmd_client_status_callback: Status update: Client dbnya1.mycompany.com/crmd now has status [offline] cib[8995]: 2008/05/19_10:36:52 info: cib_process_shutdown_req: Shutdown REQ from dbnya1.mycompany.com cib[8995]: 2008/05/19_10:36:52 info: cib_client_status_callback: Status update: Client dbnya1.mycompany.com/cib now has status [leave] crmd[8999]: 2008/05/19_10:36:52 info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm cib[8995]: 2008/05/19_10:36:52 info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm crmd[8999]: 2008/05/19_10:36:52 info: mem_handle_event: no mbr_track info cib[8995]: 2008/05/19_10:36:52 info: mem_handle_event: no mbr_track info crmd[8999]: 2008/05/19_10:36:52 info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm tengine[10337]: 2008/05/19_10:36:52 info: update_abort_priority: Abort priority upgraded to 1000000 cib[8995]: 2008/05/19_10:36:52 info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm crmd[8999]: 2008/05/19_10:36:52 info: mem_handle_event: instance=10, nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=4 tengine[10337]: 2008/05/19_10:36:52 info: update_abort_priority: Abort action 0 superceeded by 2 cib[8995]: 2008/05/19_10:36:52 info: mem_handle_event: instance=10, nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=4 crmd[8999]: 2008/05/19_10:36:52 info: crmd_ccm_msg_callback: Quorum lost after event=INVALID (id=10) cib[8995]: 2008/05/19_10:36:52 info: cib_ccm_msg_callback: LOST: dbnya1.mycompany.com crmd[8999]: 2008/05/19_10:36:52 info: crmd_ccm_msg_callback: Quorum lost: triggering transition (INVALID) cib[8995]: 2008/05/19_10:36:52 info: cib_ccm_msg_callback: PEER: dbnya2.mycompany.com crmd[8999]: 2008/05/19_10:36:52 info: ccm_event_detail: INVALID: trans=10, nodes=1, new=0, lost=1 n_idx=0, new_idx=1, old_idx=4 crmd[8999]: 2008/05/19_10:36:52 info: ccm_event_detail: CURRENT: dbnya2.mycompany.com [nodeid=1, born=10] crmd[8999]: 2008/05/19_10:36:52 info: ccm_event_detail: LOST: dbnya1.mycompany.com [nodeid=0, born=9] tengine[10337]: 2008/05/19_10:36:52 info: run_graph: Transition 3: (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0) crmd[8999]: 2008/05/19_10:36:52 info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_IPC _MESSAGE origin=route_message ] crmd[8999]: 2008/05/19_10:36:52 info: do_state_transition: All 1 cluster nodes are eligible to run resources. pengine[10338]: 2008/05/19_10:36:52 WARN: cluster_status: We do not have quorum - fencing and resource management disabled pengine[10338]: 2008/05/19_10:36:52 info: determine_online_status: Node dbnya2.mycompany.com is online pengine[10338]: 2008/05/19_10:36:52 notice: group_print: Resource Group: pgsql.myapp.group pengine[10338]: 2008/05/19_10:36:52 notice: native_print: pgsql.myapp.ip (heartbeat::ocf:IPaddr): Started dbnya2.mycompany.com pengine[10338]: 2008/05/19_10:36:52 notice: native_print: pgsql.myapp.fsData (heartbeat::ocf:Filesystem): Started dbnya2.mycompany. com pengine[10338]: 2008/05/19_10:36:52 notice: native_print: pgsql.myapp.fsTxnLog (heartbeat::ocf:Filesystem): Started dbnya2.mycompany. com pengine[10338]: 2008/05/19_10:36:52 notice: native_print: pgsql.myapp.nfslock (lsb:nfslock): Started dbnya2.mycompany.com pengine[10338]: 2008/05/19_10:36:52 notice: native_print: pgsql.myapp.nfs (lsb:nfs-mycompany): Started dbnya2.mycompany.com pengine[10338]: 2008/05/19_10:36:52 notice: native_print: pgsql.myapp.pgsql (heartbeat::ocf:pgsql): Started dbnya2.mycompany.com pengine[10338]: 2008/05/19_10:36:52 notice: StopRsc: dbnya2.mycompany.com Stop pgsql.myapp.ip pengine[10338]: 2008/05/19_10:36:52 notice: StopRsc: dbnya2.mycompany.com Stop pgsql.myapp.fsData pengine[10338]: 2008/05/19_10:36:52 notice: StopRsc: dbnya2.mycompany.com Stop pgsql.myapp.fsTxnLog pengine[10338]: 2008/05/19_10:36:52 notice: StopRsc: dbnya2.mycompany.com Stop pgsql.myapp.nfslock pengine[10338]: 2008/05/19_10:36:52 notice: StopRsc: dbnya2.mycompany.com Stop pgsql.myapp.nfs pengine[10338]: 2008/05/19_10:36:52 notice: StopRsc: dbnya2.mycompany.com Stop pgsql.myapp.pgsql crmd[8999]: 2008/05/19_10:36:52 info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_ IPC_MESSAGE origin=route_message ] tengine[10337]: 2008/05/19_10:36:52 info: unpack_graph: Unpacked transition 4: 9 actions in 9 synapses tengine[10337]: 2008/05/19_10:36:52 info: te_pseudo_action: Pseudo action 18 fired and confirmed tengine[10337]: 2008/05/19_10:36:52 info: send_rsc_command: Initiating action 14: pgsql.myapp.pgsql_stop_0 on dbnya2.mycompany.com crmd[8999]: 2008/05/19_10:36:52 info: do_lrm_rsc_op: Performing op=pgsql.myapp.pgsql_stop_0 key=14:4:60fbb97d-8b53-401c-85c6-7f7e57de9b00) lrmd[8996]: 2008/05/19_10:36:52 info: rsc:pgsql.myapp.pgsql: stop pengine[10338]: 2008/05/19_10:36:52 info: process_pe_message: Transition 4: PEngine Input stored in: /var/lib/heartbeat/pengine/pe-input-144.bz2 pgsql[19654][19679]: 2008/05/19_10:36:53 INFO: PostgreSQL is down crmd[8999]: 2008/05/19_10:36:54 info: process_lrm_event: LRM operation pgsql.myapp.pgsql_stop_0 (call=27, rc=0) complete tengine[10337]: 2008/05/19_10:36:54 info: match_graph_event: Action pgsql.myapp.pgsql_stop_0 (14) confirmed on dbnya2.mycompany.com (rc=0) tengine[10337]: 2008/05/19_10:36:54 info: send_rsc_command: Initiating action 12: pgsql.myapp.nfs_stop_0 on dbnya2.mycompany.com crmd[8999]: 2008/05/19_10:36:54 info: do_lrm_rsc_op: Performing op=pgsql.myapp.nfs_stop_0 key=12:4:60fbb97d-8b53-401c-85c6-7f7e57de9b00) lrmd[8996]: 2008/05/19_10:36:54 info: rsc:pgsql.myapp.nfs: stop lrmd[19683]: 2008/05/19_10:36:54 WARN: For LSB init script, no additional parameters are needed. lrmd[8996]: 2008/05/19_10:36:54 info: RA output: (pgsql.myapp.nfs:stop:stdout) Shutting down NFS mountd: lrmd[8996]: 2008/05/19_10:36:55 info: RA output: (pgsql.myapp.nfs:stop:stdout) [ lrmd[8996]: 2008/05/19_10:36:55 info: RA output: (pgsql.myapp.nfs:stop:stdout) OK ] lrmd[8996]: 2008/05/19_10:36:55 info: RA output: (pgsql.myapp.nfs:stop:stdout) lrmd[8996]: 2008/05/19_10:36:55 info: RA output: (pgsql.myapp.nfs:stop:stdout) lrmd[8996]: 2008/05/19_10:36:55 info: RA output: (pgsql.myapp.nfs:stop:stdout) Shutting down NFS daemon: lrmd[8996]: 2008/05/19_10:36:55 info: RA output: (pgsql.myapp.nfs:stop:stdout) [ lrmd[8996]: 2008/05/19_10:36:55 info: RA output: (pgsql.myapp.nfs:stop:stdout) OK lrmd[8996]: 2008/05/19_10:36:55 info: RA output: (pgsql.myapp.nfs:stop:stdout) ] lrmd[8996]: 2008/05/19_10:36:55 info: RA output: (pgsql.myapp.nfs:stop:stdout) lrmd[8996]: 2008/05/19_10:36:58 info: RA output: (pgsql.myapp.nfs:stop:stdout) nfsd (pid 19578 19577 19576 19575 19574 19573 19572 19571) is run ning... lrmd[8996]: 2008/05/19_10:36:58 info: RA output: (pgsql.myapp.nfs:stop:stdout) Force-killing nfs daemon: lrmd[8996]: 2008/05/19_10:37:02 info: RA output: (pgsql.myapp.nfs:stop:stdout) [ lrmd[8996]: 2008/05/19_10:37:02 info: RA output: (pgsql.myapp.nfs:stop:stdout) OK ] lrmd[8996]: 2008/05/19_10:37:02 info: RA output: (pgsql.myapp.nfs:stop:stdout) lrmd[8996]: 2008/05/19_10:37:02 info: RA output: (pgsql.myapp.nfs:stop:stdout) Shutting down NFS quotas: lrmd[8996]: 2008/05/19_10:37:02 info: RA output: (pgsql.myapp.nfs:stop:stdout) [ lrmd[8996]: 2008/05/19_10:37:02 info: RA output: (pgsql.myapp.nfs:stop:stdout) OK ] lrmd[8996]: 2008/05/19_10:37:02 info: RA output: (pgsql.myapp.nfs:stop:stdout) lrmd[8996]: 2008/05/19_10:37:02 info: RA output: (pgsql.myapp.nfs:stop:stdout) Shutting down NFS services: lrmd[8996]: 2008/05/19_10:37:02 info: RA output: (pgsql.myapp.nfs:stop:stdout) [ lrmd[8996]: 2008/05/19_10:37:02 info: RA output: (pgsql.myapp.nfs:stop:stdout) OK lrmd[8996]: 2008/05/19_10:37:02 info: RA output: (pgsql.myapp.nfs:stop:stdout) ] lrmd[8996]: 2008/05/19_10:37:02 info: RA output: (pgsql.myapp.nfs:stop:stdout) lrmd[8996]: 2008/05/19_10:37:02 info: RA output: (pgsql.myapp.nfs:stop:stdout) crmd[8999]: 2008/05/19_10:37:02 info: process_lrm_event: LRM operation pgsql.myapp.nfs_stop_0 (call=28, rc=0) complete tengine[10337]: 2008/05/19_10:37:02 info: match_graph_event: Action pgsql.myapp.nfs_stop_0 (12) confirmed on dbnya2.mycompany.com (rc=0) tengine[10337]: 2008/05/19_10:37:02 info: send_rsc_command: Initiating action 10: pgsql.myapp.nfslock_stop_0 on dbnya2.mycompany.com crmd[8999]: 2008/05/19_10:37:02 info: do_lrm_rsc_op: Performing op=pgsql.myapp.nfslock_stop_0 key=10:4:60fbb97d-8b53-401c-85c6-7f7e57de9b00) lrmd[8996]: 2008/05/19_10:37:02 info: rsc:pgsql.myapp.nfslock: stop lrmd[19726]: 2008/05/19_10:37:02 WARN: For LSB init script, no additional parameters are needed. lrmd[8996]: 2008/05/19_10:37:02 info: RA output: (pgsql.myapp.nfslock:stop:stdout) Stopping NFS statd: lrmd[8996]: 2008/05/19_10:37:02 info: RA output: (pgsql.myapp.nfslock:stop:stdout) [ lrmd[8996]: 2008/05/19_10:37:02 info: RA output: (pgsql.myapp.nfslock:stop:stdout) OK ] lrmd[8996]: 2008/05/19_10:37:02 info: RA output: (pgsql.myapp.nfslock:stop:stdout) crmd[8999]: 2008/05/19_10:37:02 info: process_lrm_event: LRM operation pgsql.myapp.nfslock_stop_0 (call=29, rc=0) complete tengine[10337]: 2008/05/19_10:37:02 info: match_graph_event: Action pgsql.myapp.nfslock_stop_0 (10) confirmed on dbnya2.mycompany.com (rc=0) tengine[10337]: 2008/05/19_10:37:02 info: send_rsc_command: Initiating action 8: pgsql.myapp.fsTxnLog_stop_0 on dbnya2.mycompany.com crmd[8999]: 2008/05/19_10:37:02 info: do_lrm_rsc_op: Performing op=pgsql.myapp.fsTxnLog_stop_0 key=8:4:60fbb97d-8b53-401c-85c6-7f7e57de9b00) lrmd[8996]: 2008/05/19_10:37:02 info: rsc:pgsql.myapp.fsTxnLog: stop Filesystem[19747][19777]: 2008/05/19_10:37:02 INFO: Running stop for /dev/mapper/md3000-txnlogp1 on /opt/data/md3000/txnlog Filesystem[19747][19787]: 2008/05/19_10:37:02 INFO: Trying to unmount /opt/data/md3000/txnlog Filesystem[19747][19793]: 2008/05/19_10:37:02 INFO: unmounted /opt/data/md3000/txnlog successfully crmd[8999]: 2008/05/19_10:37:02 info: process_lrm_event: LRM operation pgsql.myapp.fsTxnLog_stop_0 (call=30, rc=0) complete tengine[10337]: 2008/05/19_10:37:02 info: match_graph_event: Action pgsql.myapp.fsTxnLog_stop_0 (8) confirmed on dbnya2.mycompany.com (rc=0) tengine[10337]: 2008/05/19_10:37:02 info: send_rsc_command: Initiating action 6: pgsql.myapp.fsData_stop_0 on dbnya2.mycompany.com crmd[8999]: 2008/05/19_10:37:02 info: do_lrm_rsc_op: Performing op=pgsql.myapp.fsData_stop_0 key=6:4:60fbb97d-8b53-401c-85c6-7f7e57de9b00) lrmd[8996]: 2008/05/19_10:37:02 info: rsc:pgsql.myapp.fsData: stop Filesystem[19808][19838]: 2008/05/19_10:37:02 INFO: Running stop for /dev/mapper/md3000-datap1 on /opt/data/md3000/data Filesystem[19808][19848]: 2008/05/19_10:37:02 INFO: Trying to unmount /opt/data/md3000/data Filesystem[19808][19859]: 2008/05/19_10:37:02 INFO: unmounted /opt/data/md3000/data successfully crmd[8999]: 2008/05/19_10:37:02 info: process_lrm_event: LRM operation pgsql.myapp.fsData_stop_0 (call=31, rc=0) complete tengine[10337]: 2008/05/19_10:37:02 info: match_graph_event: Action pgsql.myapp.fsData_stop_0 (6) confirmed on dbnya2.mycompany.com (rc=0) tengine[10337]: 2008/05/19_10:37:02 info: send_rsc_command: Initiating action 4: pgsql.myapp.ip_stop_0 on dbnya2.mycompany.com crmd[8999]: 2008/05/19_10:37:02 info: do_lrm_rsc_op: Performing op=pgsql.myapp.ip_stop_0 key=4:4:60fbb97d-8b53-401c-85c6-7f7e57de9b00) crmd[8999]: 2008/05/19_10:37:02 info: do_lrm_rsc_op: Performing op=pgsql.myapp.ip_stop_0 key=4:4:60fbb97d-8b53-401c-85c6-7f7e57de9b00) lrmd[8996]: 2008/05/19_10:37:02 info: rsc:pgsql.myapp.ip: stop lrmd[8996]: 2008/05/19_10:37:02 info: RA output: (pgsql.myapp.ip:stop:stdout) In IP Stop lrmd[8996]: 2008/05/19_10:37:02 info: RA output: (pgsql.myapp.ip:stop:stderr) SIOCDELRT: No such process IPaddr[19869][19884]: 2008/05/19_10:37:02 INFO: ifconfig bond0:0 down crmd[8999]: 2008/05/19_10:37:02 info: process_lrm_event: LRM operation pgsql.myapp.ip_stop_0 (call=32, rc=0) complete tengine[10337]: 2008/05/19_10:37:02 info: match_graph_event: Action pgsql.myapp.ip_stop_0 (4) confirmed on dbnya2.mycompany.com (rc=0) tengine[10337]: 2008/05/19_10:37:02 info: te_pseudo_action: Pseudo action 19 fired and confirmed tengine[10337]: 2008/05/19_10:37:02 info: te_pseudo_action: Pseudo action 3 fired and confirmed crmd[8999]: 2008/05/19_10:37:02 info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_IPC_MESSA GE origin=route_message ] tengine[10337]: 2008/05/19_10:37:02 info: run_graph: Transition 4: (Complete=9, Pending=0, Fired=0, Skipped=0, Incomplete=0) tengine[10337]: 2008/05/19_10:37:02 info: notify_crmd: Transition 4 status: te_complete - <null> heartbeat[7257]: 2008/05/19_10:37:23 WARN: node dbnya1.mycompany.com: is dead crmd[8999]: 2008/05/19_10:37:23 notice: crmd_ha_status_callback: Status update: Node dbnya1.mycompany.com now has status [dead] heartbeat[7257]: 2008/05/19_10:37:23 info: Link dbnya1.mycompany.com:/dev/ttyS0 dead. heartbeat[7257]: 2008/05/19_10:37:23 info: Link dbnya1.mycompany.com:bond1 dead. cib[8995]: 2008/05/19_10:40:45 info: cib_stats: Processed 29 operations (2758.00us average, 0% utilization) in the last 10min _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
