We just had one of our two Heartbeat servers - in this case, the primary server - do an emergency reboot earlier tonight, and I'm confused as to why. Here's the sanitized version of the logs from right before the reboot occurred. Can somebody tell me what happened and what I can do to make sure it doesn't happen again? The reboot in this case left the system in an unstable state, and it didn't fail over to the other server.
Thanks in advance! Dec 31 00:08:11 SERVER1 crmd: [11492]: info: update_dc: Set DC to SERVER1 (2.0) Dec 31 00:08:12 SERVER1 lrmd: [11489]: info: Managed s0-drbd:monitor process 17953 exited with return code 0. Dec 31 00:08:12 SERVER1 crmd: [11492]: info: do_state_transition: State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ] Dec 31 00:08:12 SERVER1 crmd: [11492]: info: do_state_transition: All 1 cluster nodes responded to the join offer. Dec 31 00:08:12 SERVER1 attrd: [11491]: info: attrd_local_callback: Sending full refresh Dec 31 00:08:12 SERVER1 cib: [11488]: info: sync_our_cib: Syncing CIB to all peers Dec 31 00:08:12 SERVER1 crmd: [11492]: info: update_dc: Set DC to SERVER1 (2.0) Dec 31 00:08:13 SERVER1 crmd: [11492]: info: do_state_transition: State transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_JOIN_REQUEST cause=C_HA_MESSAGE origin=route_message ] Dec 31 00:08:13 SERVER1 crmd: [11492]: info: update_dc: Unset DC SERVER1 Dec 31 00:08:13 SERVER1 crmd: [11492]: info: do_dc_join_offer_all: join-234: Waiting on 1 outstanding join acks Dec 31 00:08:13 SERVER1 crmd: [11492]: info: update_dc: Set DC to SERVER1 (2.0) Dec 31 00:08:13 SERVER1 crmd: [11492]: info: do_state_transition: State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ] Dec 31 00:08:13 SERVER1 crmd: [11492]: info: do_state_transition: All 1 cluster nodes responded to the join offer. Dec 31 00:08:13 SERVER1 attrd: [11491]: info: attrd_local_callback: Sending full refresh Dec 31 00:08:13 SERVER1 cib: [11488]: info: sync_our_cib: Syncing CIB to all peers Dec 31 00:08:13 SERVER1 cib: [11488]: WARN: cib_peer_callback: Discarding cib_replace message (2bd5ae) from SERVER2: not in our membership Dec 31 00:08:13 SERVER1 cib: [11488]: WARN: cib_peer_callback: Discarding cib_apply_diff message (2bd5b0) from SERVER2: not in our membership Dec 31 00:08:13 SERVER1 crmd: [11492]: WARN: crmd_ha_msg_callback: Ignoring HA message (op=join_ack_nack) from SERVER2: not in our membership list (size=1) Dec 31 00:08:13 SERVER1 cib: [11488]: WARN: cib_peer_callback: Discarding cib_apply_diff message (2bd5b2) from SERVER2: not in our membership Dec 31 00:08:13 SERVER1 cib: [11488]: WARN: cib_peer_callback: Discarding cib_apply_diff message (2bd5b3) from SERVER2: not in our membership Dec 31 00:08:13 SERVER1 crmd: [11492]: ERROR: do_cl_join_finalize_respond: Join join-244 with SERVER2 failed. NACK'd Dec 31 00:08:13 SERVER1 crmd: [11492]: ERROR: do_log: [[FSA]] Input I_ERROR from do_cl_join_finalize_respond() received in state (S_FINALIZE_JOIN) Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_state_transition: State transition S_FINALIZE_JOIN -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=do_cl_join_finalize_respond ] Dec 31 00:08:14 SERVER1 crmd: [11492]: ERROR: do_recover: Action A_RECOVER (0000000001000000) not supported Dec 31 00:08:14 SERVER1 crmd: [11492]: WARN: do_election_vote: Not voting in election, we're in state S_RECOVERY Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_dc_release: DC role released Dec 31 00:08:14 SERVER1 crmd: [11492]: info: stop_subsystem: Sent -TERM to pengine: [16623] Dec 31 00:08:14 SERVER1 pengine: [16623]: info: pengine_shutdown: Exiting PEngine (SIGTERM) Dec 31 00:08:14 SERVER1 crmd: [11492]: info: stop_subsystem: Sent -TERM to tengine: [16622] Dec 31 00:08:14 SERVER1 tengine: [16622]: info: update_abort_priority: Abort action 2 superceeded by 3 Dec 31 00:08:14 SERVER1 tengine: [16622]: info: notify_crmd: Exiting after transition Dec 31 00:08:14 SERVER1 tengine: [16622]: info: te_init: Exiting tengine Dec 31 00:08:14 SERVER1 crmd: [11492]: ERROR: do_log: [[FSA]] Input I_TERMINATE from do_recover() received in state (S_RECOVERY) Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_state_transition: State transition S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL origin=do_recover ] Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_shutdown: Terminating the pengine Dec 31 00:08:14 SERVER1 crmd: [11492]: info: stop_subsystem: Sent -TERM to pengine: [16623] Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_shutdown: Terminating the tengine Dec 31 00:08:14 SERVER1 crmd: [11492]: info: stop_subsystem: Sent -TERM to tengine: [16622] Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_shutdown: Waiting for subsystems to exit Dec 31 00:08:14 SERVER1 crmd: [11492]: WARN: register_fsa_input_adv: do_shutdown stalled the FSA with pending inputs Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_shutdown: All subsystems stopped, continuing Dec 31 00:08:14 SERVER1 crmd: [11492]: WARN: do_log: [[FSA]] Input I_PENDING from do_election_vote() received in state (S_TERMINATE) Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_shutdown: Terminating the pengine Dec 31 00:08:14 SERVER1 crmd: [11492]: info: stop_subsystem: Sent -TERM to pengine: [16623] Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_shutdown: Terminating the tengine Dec 31 00:08:14 SERVER1 crmd: [11492]: info: stop_subsystem: Sent -TERM to tengine: [16622] Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_shutdown: Waiting for subsystems to exit Dec 31 00:08:14 SERVER1 crmd: [11492]: WARN: register_fsa_input_adv: do_shutdown stalled the FSA with pending inputs Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_shutdown: All subsystems stopped, continuing Dec 31 00:08:14 SERVER1 crmd: [11492]: WARN: G_SIG_dispatch: Dispatch function for SIGCHLD was delayed 640 ms (> 100 ms) before being called (GSource: 0x9022b70) Dec 31 00:08:14 SERVER1 cib: [11488]: WARN: cib_peer_callback: Discarding cib_replace message (2bd5b7) from SERVER2: not in our membership Dec 31 00:08:14 SERVER1 crmd: [11492]: info: G_SIG_dispatch: started at 1008580897 should have started at 1008580833 Dec 31 00:08:14 SERVER1 cib: [11488]: WARN: cib_peer_callback: Discarding cib_apply_diff message (2bd5b9) from SERVER2: not in our membership Dec 31 00:08:14 SERVER1 crmd: [11492]: info: crmdManagedChildDied: Process tengine:[16622] exited (signal=0, exitcode=0) Dec 31 00:08:15 SERVER1 cib: [11488]: WARN: cib_peer_callback: Discarding cib_apply_diff message (2bd5ba) from SERVER2: not in our membership Dec 31 00:08:15 SERVER1 crmd: [11492]: info: crmdManagedChildDied: Process pengine:[16623] exited (signal=0, exitcode=0) Dec 31 00:08:15 SERVER1 crmd: [11492]: WARN: G_SIG_dispatch: Dispatch function for SIGCHLD took too long to execute: 170 ms (> 30 ms) (GSource: 0x9022b70) Dec 31 00:08:15 SERVER1 crmd: [11492]: WARN: do_log: [[FSA]] Input I_RELEASE_SUCCESS from do_dc_release() received in state (S_TERMINATE) Dec 31 00:08:15 SERVER1 crmd: [11492]: info: do_shutdown: All subsystems stopped, continuing Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: verify_stopped: 6 pending LRM operations at shutdown Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending action: s0-fs:34 Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending action: s0-drbd:27 Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending action: r1-fs:40 Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending action: r2-drbd:38 Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending action: r2-fs:45 Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending action: r1-drbd:31 Dec 31 00:08:15 SERVER1 crmd: [11492]: info: do_lrm_control: Disconnected from the LRM Dec 31 00:08:15 SERVER1 ccm: [11487]: info: client (pid=11492) removed from ccm Dec 31 00:08:15 SERVER1 crmd: [11492]: info: do_ha_control: Disconnected from Heartbeat Dec 31 00:08:15 SERVER1 heartbeat: [11474]: WARN: G_CH_dispatch_int: Dispatch function for API client took too long to execute: 140 ms (> 100 ms) (GSource: 0x9bfd678) Dec 31 00:08:15 SERVER1 crmd: [11492]: info: do_cib_control: Disconnecting CIB Dec 31 00:08:15 SERVER1 crmd: [11492]: info: crmd_cib_connection_destroy: Connection to the CIB terminated... Dec 31 00:08:15 SERVER1 cib: [11488]: info: cib_process_readwrite: We are now in R/O mode Dec 31 00:08:15 SERVER1 cib: [11488]: WARN: send_via_callback_channel: Client b50a170d-fb93-4f5a-93f4-34f0b097dd19 has disconnected Dec 31 00:08:15 SERVER1 cib: [11488]: WARN: do_local_notify: A-Sync reply to 11492 failed: client left before we could send reply Dec 31 00:08:15 SERVER1 cib: [11488]: WARN: cib_peer_callback: Discarding cib_apply_diff message (2bd5bc) from SERVER2: not in our membership Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: verify_stopped: 6 pending LRM operations at shutdown Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending action: s0-fs:34 Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending action: s0-drbd:27 Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending action: r1-fs:40 Dec 31 00:08:16 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending action: r2-drbd:38 Dec 31 00:08:16 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending action: r2-fs:45 Dec 31 00:08:16 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending action: r1-drbd:31 Dec 31 00:08:16 SERVER1 crmd: [11492]: info: do_exit: Performing A_EXIT_0 - gracefully exiting the CRMd Dec 31 00:08:16 SERVER1 crmd: [11492]: ERROR: do_exit: Could not recover from internal error Dec 31 00:08:16 SERVER1 crmd: [11492]: info: free_mem: Dropping I_JOIN_RESULT: [ state=S_TERMINATE cause=C_HA_MESSAGE origin=route_message ] Dec 31 00:08:16 SERVER1 crmd: [11492]: info: free_mem: Dropping I_TERMINATE: [ state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ] Dec 31 00:08:16 SERVER1 crmd: [11492]: info: do_exit: [crmd] stopped (2) Dec 31 00:08:16 SERVER1 heartbeat: [11474]: WARN: Managed /usr/lib/heartbeat/crmd process 11492 exited with return code 2. Dec 31 00:08:16 SERVER1 heartbeat: [11474]: ERROR: Client /usr/lib/heartbeat/crmd exited with return code 2. Dec 31 00:08:16 SERVER1 heartbeat: [11474]: EMERG: Rebooting system. Reason: /usr/lib/heartbeat/crmd _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
