Hi,

On Wed, Dec 31, 2008 at 12:30:41AM -0800, Mike Sweetser - Adhost wrote:
> We just had one of our two Heartbeat servers - in this case,
> the primary server - do an emergency reboot earlier tonight,
> and I'm confused as to why.   Here's the sanitized version of
> the logs from right before the reboot occurred. Can somebody
> tell me what happened and what I can do to make sure it doesn't
> happen again?

The reason for reboot was that the crmd encountered an
unrecoverable condition and exited. It is not clear what happened
to crmd. It could be a communication problem, though there's
nothing in the logs from the lower layer (heartbeat). BTW, you
can prevent reboots by replacing "crm yes" with "crm respawn" in
ha.cf, though that probably won't help.

> The reboot in this case left the system in an
> unstable state, and it didn't fail over to the other server.

A reboot shouldn't leave your system in an unstable state. What
do you mean by "unstable"? Why there was no failover? Can you
please produce a hb_report report. That would include the
configuration and logs from both nodes and all other relevant
information.

Thanks,

Dejan

> Thanks in advance!
> 
> Dec 31 00:08:11 SERVER1 crmd: [11492]: info: update_dc: Set DC to SERVER1 
> (2.0)
> Dec 31 00:08:12 SERVER1 lrmd: [11489]: info: Managed s0-drbd:monitor process 
> 17953 exited with return code 0.
> Dec 31 00:08:12 SERVER1 crmd: [11492]: info: do_state_transition: State 
> transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED 
> cause=C_FSA_INTERNAL origin=check_join_state ]
> Dec 31 00:08:12 SERVER1 crmd: [11492]: info: do_state_transition: All 1 
> cluster nodes responded to the join offer.
> Dec 31 00:08:12 SERVER1 attrd: [11491]: info: attrd_local_callback: Sending 
> full refresh
> Dec 31 00:08:12 SERVER1 cib: [11488]: info: sync_our_cib: Syncing CIB to all 
> peers
> Dec 31 00:08:12 SERVER1 crmd: [11492]: info: update_dc: Set DC to SERVER1 
> (2.0)
> Dec 31 00:08:13 SERVER1 crmd: [11492]: info: do_state_transition: State 
> transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_JOIN_REQUEST 
> cause=C_HA_MESSAGE origin=route_message ]
> Dec 31 00:08:13 SERVER1 crmd: [11492]: info: update_dc: Unset DC SERVER1
> Dec 31 00:08:13 SERVER1 crmd: [11492]: info: do_dc_join_offer_all: join-234: 
> Waiting on 1 outstanding join acks
> Dec 31 00:08:13 SERVER1 crmd: [11492]: info: update_dc: Set DC to SERVER1 
> (2.0)
> Dec 31 00:08:13 SERVER1 crmd: [11492]: info: do_state_transition: State 
> transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED 
> cause=C_FSA_INTERNAL origin=check_join_state ]
> Dec 31 00:08:13 SERVER1 crmd: [11492]: info: do_state_transition: All 1 
> cluster nodes responded to the join offer.
> Dec 31 00:08:13 SERVER1 attrd: [11491]: info: attrd_local_callback: Sending 
> full refresh
> Dec 31 00:08:13 SERVER1 cib: [11488]: info: sync_our_cib: Syncing CIB to all 
> peers
> Dec 31 00:08:13 SERVER1 cib: [11488]: WARN: cib_peer_callback: Discarding 
> cib_replace message (2bd5ae) from SERVER2: not in our membership
> Dec 31 00:08:13 SERVER1 cib: [11488]: WARN: cib_peer_callback: Discarding 
> cib_apply_diff message (2bd5b0) from SERVER2: not in our membership
> Dec 31 00:08:13 SERVER1 crmd: [11492]: WARN: crmd_ha_msg_callback: Ignoring 
> HA message (op=join_ack_nack) from SERVER2: not in our membership list 
> (size=1)
> Dec 31 00:08:13 SERVER1 cib: [11488]: WARN: cib_peer_callback: Discarding 
> cib_apply_diff message (2bd5b2) from SERVER2: not in our membership
> Dec 31 00:08:13 SERVER1 cib: [11488]: WARN: cib_peer_callback: Discarding 
> cib_apply_diff message (2bd5b3) from SERVER2: not in our membership
> Dec 31 00:08:13 SERVER1 crmd: [11492]: ERROR: do_cl_join_finalize_respond: 
> Join join-244 with SERVER2 failed.  NACK'd
> Dec 31 00:08:13 SERVER1 crmd: [11492]: ERROR: do_log: [[FSA]] Input I_ERROR 
> from do_cl_join_finalize_respond() received in state (S_FINALIZE_JOIN)
> Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_state_transition: State 
> transition S_FINALIZE_JOIN -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL 
> origin=do_cl_join_finalize_respond ]
> Dec 31 00:08:14 SERVER1 crmd: [11492]: ERROR: do_recover: Action A_RECOVER 
> (0000000001000000) not supported
> Dec 31 00:08:14 SERVER1 crmd: [11492]: WARN: do_election_vote: Not voting in 
> election, we're in state S_RECOVERY
> Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_dc_release: DC role released
> Dec 31 00:08:14 SERVER1 crmd: [11492]: info: stop_subsystem: Sent -TERM to 
> pengine: [16623]
> Dec 31 00:08:14 SERVER1 pengine: [16623]: info: pengine_shutdown: Exiting 
> PEngine (SIGTERM)
> Dec 31 00:08:14 SERVER1 crmd: [11492]: info: stop_subsystem: Sent -TERM to 
> tengine: [16622]
> Dec 31 00:08:14 SERVER1 tengine: [16622]: info: update_abort_priority: Abort 
> action 2 superceeded by 3
> Dec 31 00:08:14 SERVER1 tengine: [16622]: info: notify_crmd: Exiting after 
> transition
> Dec 31 00:08:14 SERVER1 tengine: [16622]: info: te_init: Exiting tengine
> Dec 31 00:08:14 SERVER1 crmd: [11492]: ERROR: do_log: [[FSA]] Input 
> I_TERMINATE from do_recover() received in state (S_RECOVERY)
> Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_state_transition: State 
> transition S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL 
> origin=do_recover ] 
> Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_shutdown: Terminating the 
> pengine
> Dec 31 00:08:14 SERVER1 crmd: [11492]: info: stop_subsystem: Sent -TERM to 
> pengine: [16623]
> Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_shutdown: Terminating the 
> tengine
> Dec 31 00:08:14 SERVER1 crmd: [11492]: info: stop_subsystem: Sent -TERM to 
> tengine: [16622]
> Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_shutdown: Waiting for 
> subsystems to exit 
> Dec 31 00:08:14 SERVER1 crmd: [11492]: WARN: register_fsa_input_adv: 
> do_shutdown stalled the FSA with pending inputs
> Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_shutdown: All subsystems 
> stopped, continuing
> Dec 31 00:08:14 SERVER1 crmd: [11492]: WARN: do_log: [[FSA]] Input I_PENDING 
> from do_election_vote() received in state (S_TERMINATE)
> Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_shutdown: Terminating the 
> pengine
> Dec 31 00:08:14 SERVER1 crmd: [11492]: info: stop_subsystem: Sent -TERM to 
> pengine: [16623]
> Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_shutdown: Terminating the 
> tengine
> Dec 31 00:08:14 SERVER1 crmd: [11492]: info: stop_subsystem: Sent -TERM to 
> tengine: [16622]
> Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_shutdown: Waiting for 
> subsystems to exit 
> Dec 31 00:08:14 SERVER1 crmd: [11492]: WARN: register_fsa_input_adv: 
> do_shutdown stalled the FSA with pending inputs
> Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_shutdown: All subsystems 
> stopped, continuing
> Dec 31 00:08:14 SERVER1 crmd: [11492]: WARN: G_SIG_dispatch: Dispatch 
> function for SIGCHLD was delayed 640 ms (> 100 ms) before being called 
> (GSource: 0x9022b70)
> Dec 31 00:08:14 SERVER1 cib: [11488]: WARN: cib_peer_callback: Discarding 
> cib_replace message (2bd5b7) from SERVER2: not in our membership
> Dec 31 00:08:14 SERVER1 crmd: [11492]: info: G_SIG_dispatch: started at 
> 1008580897 should have started at 1008580833
> Dec 31 00:08:14 SERVER1 cib: [11488]: WARN: cib_peer_callback: Discarding 
> cib_apply_diff message (2bd5b9) from SERVER2: not in our membership
> Dec 31 00:08:14 SERVER1 crmd: [11492]: info: crmdManagedChildDied: Process 
> tengine:[16622] exited (signal=0, exitcode=0)
> Dec 31 00:08:15 SERVER1 cib: [11488]: WARN: cib_peer_callback: Discarding 
> cib_apply_diff message (2bd5ba) from SERVER2: not in our membership
> Dec 31 00:08:15 SERVER1 crmd: [11492]: info: crmdManagedChildDied: Process 
> pengine:[16623] exited (signal=0, exitcode=0) 
> Dec 31 00:08:15 SERVER1 crmd: [11492]: WARN: G_SIG_dispatch: Dispatch 
> function for SIGCHLD took too long to execute: 170 ms (> 30 ms) (GSource: 
> 0x9022b70) 
> Dec 31 00:08:15 SERVER1 crmd: [11492]: WARN: do_log: [[FSA]] Input 
> I_RELEASE_SUCCESS from do_dc_release() received in state (S_TERMINATE)
> Dec 31 00:08:15 SERVER1 crmd: [11492]: info: do_shutdown: All subsystems 
> stopped, continuing
> Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: verify_stopped: 6 pending LRM 
> operations at shutdown
> Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending 
> action: s0-fs:34
> Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending 
> action: s0-drbd:27
> Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending 
> action: r1-fs:40
> Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending 
> action: r2-drbd:38
> Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending 
> action: r2-fs:45
> Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending 
> action: r1-drbd:31
> Dec 31 00:08:15 SERVER1 crmd: [11492]: info: do_lrm_control: Disconnected 
> from the LRM
> Dec 31 00:08:15 SERVER1 ccm: [11487]: info: client (pid=11492) removed from 
> ccm
> Dec 31 00:08:15 SERVER1 crmd: [11492]: info: do_ha_control: Disconnected from 
> Heartbeat
> Dec 31 00:08:15 SERVER1 heartbeat: [11474]: WARN: G_CH_dispatch_int: Dispatch 
> function for API client took too long to execute: 140 ms (> 100 ms) (GSource: 
> 0x9bfd678)
> Dec 31 00:08:15 SERVER1 crmd: [11492]: info: do_cib_control: Disconnecting CIB
> Dec 31 00:08:15 SERVER1 crmd: [11492]: info: crmd_cib_connection_destroy: 
> Connection to the CIB terminated...
> Dec 31 00:08:15 SERVER1 cib: [11488]: info: cib_process_readwrite: We are now 
> in R/O mode
> Dec 31 00:08:15 SERVER1 cib: [11488]: WARN: send_via_callback_channel: Client 
> b50a170d-fb93-4f5a-93f4-34f0b097dd19 has disconnected
> Dec 31 00:08:15 SERVER1 cib: [11488]: WARN: do_local_notify: A-Sync reply to 
> 11492 failed: client left before we could send reply
> Dec 31 00:08:15 SERVER1 cib: [11488]: WARN: cib_peer_callback: Discarding 
> cib_apply_diff message (2bd5bc) from SERVER2: not in our membership
> Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: verify_stopped: 6 pending LRM 
> operations at shutdown
> Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending 
> action: s0-fs:34
> Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending 
> action: s0-drbd:27
> Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending 
> action: r1-fs:40
> Dec 31 00:08:16 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending 
> action: r2-drbd:38
> Dec 31 00:08:16 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending 
> action: r2-fs:45
> Dec 31 00:08:16 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending 
> action: r1-drbd:31
> Dec 31 00:08:16 SERVER1 crmd: [11492]: info: do_exit: Performing A_EXIT_0 - 
> gracefully exiting the CRMd
> Dec 31 00:08:16 SERVER1 crmd: [11492]: ERROR: do_exit: Could not recover from 
> internal error
> Dec 31 00:08:16 SERVER1 crmd: [11492]: info: free_mem: Dropping 
> I_JOIN_RESULT: [ state=S_TERMINATE cause=C_HA_MESSAGE origin=route_message ]
> Dec 31 00:08:16 SERVER1 crmd: [11492]: info: free_mem: Dropping I_TERMINATE: 
> [ state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ]
> Dec 31 00:08:16 SERVER1 crmd: [11492]: info: do_exit: [crmd] stopped (2)
> Dec 31 00:08:16 SERVER1 heartbeat: [11474]: WARN: Managed 
> /usr/lib/heartbeat/crmd process 11492 exited with return code 2.
> Dec 31 00:08:16 SERVER1 heartbeat: [11474]: ERROR: Client 
> /usr/lib/heartbeat/crmd exited with return code 2.
> Dec 31 00:08:16 SERVER1 heartbeat: [11474]: EMERG: Rebooting system.  Reason: 
> /usr/lib/heartbeat/crmd
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to