[Linux-HA] Heartbeat Reboot - Why?

Mike Sweetser - Adhost Wed, 31 Dec 2008 01:07:50 -0800

We just had one of our two Heartbeat servers - in this case, the primary server 
- do an emergency reboot earlier tonight, and I'm confused as to why.   Here's 
the sanitized version of the logs from right before the reboot occurred. Can 
somebody tell me what happened and what I can do to make sure it doesn't happen 
again?  The reboot in this case left the system in an unstable state, and it 
didn't fail over to the other server.


Thanks in advance!

Dec 31 00:08:11 SERVER1 crmd: [11492]: info: update_dc: Set DC to SERVER1 (2.0)
Dec 31 00:08:12 SERVER1 lrmd: [11489]: info: Managed s0-drbd:monitor process 
17953 exited with return code 0.
Dec 31 00:08:12 SERVER1 crmd: [11492]: info: do_state_transition: State 
transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED 
cause=C_FSA_INTERNAL origin=check_join_state ]
Dec 31 00:08:12 SERVER1 crmd: [11492]: info: do_state_transition: All 1 cluster 
nodes responded to the join offer.
Dec 31 00:08:12 SERVER1 attrd: [11491]: info: attrd_local_callback: Sending 
full refresh
Dec 31 00:08:12 SERVER1 cib: [11488]: info: sync_our_cib: Syncing CIB to all 
peers
Dec 31 00:08:12 SERVER1 crmd: [11492]: info: update_dc: Set DC to SERVER1 (2.0)
Dec 31 00:08:13 SERVER1 crmd: [11492]: info: do_state_transition: State 
transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_JOIN_REQUEST 
cause=C_HA_MESSAGE origin=route_message ]
Dec 31 00:08:13 SERVER1 crmd: [11492]: info: update_dc: Unset DC SERVER1
Dec 31 00:08:13 SERVER1 crmd: [11492]: info: do_dc_join_offer_all: join-234: 
Waiting on 1 outstanding join acks
Dec 31 00:08:13 SERVER1 crmd: [11492]: info: update_dc: Set DC to SERVER1 (2.0)
Dec 31 00:08:13 SERVER1 crmd: [11492]: info: do_state_transition: State 
transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED 
cause=C_FSA_INTERNAL origin=check_join_state ]
Dec 31 00:08:13 SERVER1 crmd: [11492]: info: do_state_transition: All 1 cluster 
nodes responded to the join offer.
Dec 31 00:08:13 SERVER1 attrd: [11491]: info: attrd_local_callback: Sending 
full refresh
Dec 31 00:08:13 SERVER1 cib: [11488]: info: sync_our_cib: Syncing CIB to all 
peers
Dec 31 00:08:13 SERVER1 cib: [11488]: WARN: cib_peer_callback: Discarding 
cib_replace message (2bd5ae) from SERVER2: not in our membership
Dec 31 00:08:13 SERVER1 cib: [11488]: WARN: cib_peer_callback: Discarding 
cib_apply_diff message (2bd5b0) from SERVER2: not in our membership
Dec 31 00:08:13 SERVER1 crmd: [11492]: WARN: crmd_ha_msg_callback: Ignoring HA 
message (op=join_ack_nack) from SERVER2: not in our membership list (size=1)
Dec 31 00:08:13 SERVER1 cib: [11488]: WARN: cib_peer_callback: Discarding 
cib_apply_diff message (2bd5b2) from SERVER2: not in our membership
Dec 31 00:08:13 SERVER1 cib: [11488]: WARN: cib_peer_callback: Discarding 
cib_apply_diff message (2bd5b3) from SERVER2: not in our membership
Dec 31 00:08:13 SERVER1 crmd: [11492]: ERROR: do_cl_join_finalize_respond: Join 
join-244 with SERVER2 failed.  NACK'd
Dec 31 00:08:13 SERVER1 crmd: [11492]: ERROR: do_log: [[FSA]] Input I_ERROR 
from do_cl_join_finalize_respond() received in state (S_FINALIZE_JOIN)
Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_state_transition: State 
transition S_FINALIZE_JOIN -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL 
origin=do_cl_join_finalize_respond ]
Dec 31 00:08:14 SERVER1 crmd: [11492]: ERROR: do_recover: Action A_RECOVER 
(0000000001000000) not supported
Dec 31 00:08:14 SERVER1 crmd: [11492]: WARN: do_election_vote: Not voting in 
election, we're in state S_RECOVERY
Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_dc_release: DC role released
Dec 31 00:08:14 SERVER1 crmd: [11492]: info: stop_subsystem: Sent -TERM to 
pengine: [16623]
Dec 31 00:08:14 SERVER1 pengine: [16623]: info: pengine_shutdown: Exiting 
PEngine (SIGTERM)
Dec 31 00:08:14 SERVER1 crmd: [11492]: info: stop_subsystem: Sent -TERM to 
tengine: [16622]
Dec 31 00:08:14 SERVER1 tengine: [16622]: info: update_abort_priority: Abort 
action 2 superceeded by 3
Dec 31 00:08:14 SERVER1 tengine: [16622]: info: notify_crmd: Exiting after 
transition
Dec 31 00:08:14 SERVER1 tengine: [16622]: info: te_init: Exiting tengine
Dec 31 00:08:14 SERVER1 crmd: [11492]: ERROR: do_log: [[FSA]] Input I_TERMINATE 
from do_recover() received in state (S_RECOVERY)
Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_state_transition: State 
transition S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL 
origin=do_recover ] 
Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_shutdown: Terminating the 
pengine
Dec 31 00:08:14 SERVER1 crmd: [11492]: info: stop_subsystem: Sent -TERM to 
pengine: [16623]
Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_shutdown: Terminating the 
tengine
Dec 31 00:08:14 SERVER1 crmd: [11492]: info: stop_subsystem: Sent -TERM to 
tengine: [16622]
Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_shutdown: Waiting for 
subsystems to exit 
Dec 31 00:08:14 SERVER1 crmd: [11492]: WARN: register_fsa_input_adv: 
do_shutdown stalled the FSA with pending inputs
Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_shutdown: All subsystems 
stopped, continuing
Dec 31 00:08:14 SERVER1 crmd: [11492]: WARN: do_log: [[FSA]] Input I_PENDING 
from do_election_vote() received in state (S_TERMINATE)
Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_shutdown: Terminating the 
pengine
Dec 31 00:08:14 SERVER1 crmd: [11492]: info: stop_subsystem: Sent -TERM to 
pengine: [16623]
Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_shutdown: Terminating the 
tengine
Dec 31 00:08:14 SERVER1 crmd: [11492]: info: stop_subsystem: Sent -TERM to 
tengine: [16622]
Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_shutdown: Waiting for 
subsystems to exit 
Dec 31 00:08:14 SERVER1 crmd: [11492]: WARN: register_fsa_input_adv: 
do_shutdown stalled the FSA with pending inputs
Dec 31 00:08:14 SERVER1 crmd: [11492]: info: do_shutdown: All subsystems 
stopped, continuing
Dec 31 00:08:14 SERVER1 crmd: [11492]: WARN: G_SIG_dispatch: Dispatch function 
for SIGCHLD was delayed 640 ms (> 100 ms) before being called (GSource: 
0x9022b70)
Dec 31 00:08:14 SERVER1 cib: [11488]: WARN: cib_peer_callback: Discarding 
cib_replace message (2bd5b7) from SERVER2: not in our membership
Dec 31 00:08:14 SERVER1 crmd: [11492]: info: G_SIG_dispatch: started at 
1008580897 should have started at 1008580833
Dec 31 00:08:14 SERVER1 cib: [11488]: WARN: cib_peer_callback: Discarding 
cib_apply_diff message (2bd5b9) from SERVER2: not in our membership
Dec 31 00:08:14 SERVER1 crmd: [11492]: info: crmdManagedChildDied: Process 
tengine:[16622] exited (signal=0, exitcode=0)
Dec 31 00:08:15 SERVER1 cib: [11488]: WARN: cib_peer_callback: Discarding 
cib_apply_diff message (2bd5ba) from SERVER2: not in our membership
Dec 31 00:08:15 SERVER1 crmd: [11492]: info: crmdManagedChildDied: Process 
pengine:[16623] exited (signal=0, exitcode=0) 
Dec 31 00:08:15 SERVER1 crmd: [11492]: WARN: G_SIG_dispatch: Dispatch function 
for SIGCHLD took too long to execute: 170 ms (> 30 ms) (GSource: 0x9022b70) 
Dec 31 00:08:15 SERVER1 crmd: [11492]: WARN: do_log: [[FSA]] Input 
I_RELEASE_SUCCESS from do_dc_release() received in state (S_TERMINATE)
Dec 31 00:08:15 SERVER1 crmd: [11492]: info: do_shutdown: All subsystems 
stopped, continuing
Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: verify_stopped: 6 pending LRM 
operations at shutdown
Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending 
action: s0-fs:34
Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending 
action: s0-drbd:27
Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending 
action: r1-fs:40
Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending 
action: r2-drbd:38
Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending 
action: r2-fs:45
Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending 
action: r1-drbd:31
Dec 31 00:08:15 SERVER1 crmd: [11492]: info: do_lrm_control: Disconnected from 
the LRM
Dec 31 00:08:15 SERVER1 ccm: [11487]: info: client (pid=11492) removed from ccm
Dec 31 00:08:15 SERVER1 crmd: [11492]: info: do_ha_control: Disconnected from 
Heartbeat
Dec 31 00:08:15 SERVER1 heartbeat: [11474]: WARN: G_CH_dispatch_int: Dispatch 
function for API client took too long to execute: 140 ms (> 100 ms) (GSource: 
0x9bfd678)
Dec 31 00:08:15 SERVER1 crmd: [11492]: info: do_cib_control: Disconnecting CIB
Dec 31 00:08:15 SERVER1 crmd: [11492]: info: crmd_cib_connection_destroy: 
Connection to the CIB terminated...
Dec 31 00:08:15 SERVER1 cib: [11488]: info: cib_process_readwrite: We are now 
in R/O mode
Dec 31 00:08:15 SERVER1 cib: [11488]: WARN: send_via_callback_channel: Client 
b50a170d-fb93-4f5a-93f4-34f0b097dd19 has disconnected
Dec 31 00:08:15 SERVER1 cib: [11488]: WARN: do_local_notify: A-Sync reply to 
11492 failed: client left before we could send reply
Dec 31 00:08:15 SERVER1 cib: [11488]: WARN: cib_peer_callback: Discarding 
cib_apply_diff message (2bd5bc) from SERVER2: not in our membership
Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: verify_stopped: 6 pending LRM 
operations at shutdown
Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending 
action: s0-fs:34
Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending 
action: s0-drbd:27
Dec 31 00:08:15 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending 
action: r1-fs:40
Dec 31 00:08:16 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending 
action: r2-drbd:38
Dec 31 00:08:16 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending 
action: r2-fs:45
Dec 31 00:08:16 SERVER1 crmd: [11492]: ERROR: ghash_print_pending: Pending 
action: r1-drbd:31
Dec 31 00:08:16 SERVER1 crmd: [11492]: info: do_exit: Performing A_EXIT_0 - 
gracefully exiting the CRMd
Dec 31 00:08:16 SERVER1 crmd: [11492]: ERROR: do_exit: Could not recover from 
internal error
Dec 31 00:08:16 SERVER1 crmd: [11492]: info: free_mem: Dropping I_JOIN_RESULT: 
[ state=S_TERMINATE cause=C_HA_MESSAGE origin=route_message ]
Dec 31 00:08:16 SERVER1 crmd: [11492]: info: free_mem: Dropping I_TERMINATE: [ 
state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ]
Dec 31 00:08:16 SERVER1 crmd: [11492]: info: do_exit: [crmd] stopped (2)
Dec 31 00:08:16 SERVER1 heartbeat: [11474]: WARN: Managed 
/usr/lib/heartbeat/crmd process 11492 exited with return code 2.
Dec 31 00:08:16 SERVER1 heartbeat: [11474]: ERROR: Client 
/usr/lib/heartbeat/crmd exited with return code 2.
Dec 31 00:08:16 SERVER1 heartbeat: [11474]: EMERG: Rebooting system.  Reason: 
/usr/lib/heartbeat/crmd
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Heartbeat Reboot - Why?

Reply via email to