I'm at my wits end right now on why random nodes in my cluster will reboot 
(EMER SHUTDOWN and a hard reboot) sometimes and others will start fine.
 
I've attached my ha.cf, cib.xml and the latest heartbeat_debug file.
 
These are the crmd and mgmtd logs from just before it reboots.  It seems to be 
unable to complete CIB Registration?  What could cause this?  Is there a timer 
somewhere I need to increase?
 
crmd[3027]: 2008/06/25_22:57:25 info: crm_timer_popped: Wait Timer (I_NULL) 
just popped!
mgmtd[3077]: 2008/06/25_22:57:26 info: init_crm
crmd[3027]: 2008/06/25_22:57:26 WARN: do_cib_control: Couldn't complete CIB 
registration 27 times.... pause and retry
mgmtd[3077]: 2008/06/25_22:57:27 info: login to cib: 0, ret:-10
mgmtd[3077]: 2008/06/25_22:57:28 info: login to cib: 1, ret:-10
crmd[3027]: 2008/06/25_22:57:29 info: crm_timer_popped: Wait Timer (I_NULL) 
just popped!
mgmtd[3077]: 2008/06/25_22:57:30 info: login to cib: 2, ret:-10
crmd[3027]: 2008/06/25_22:57:30 WARN: do_cib_control: Couldn't complete CIB 
registration 28 times... pause and retry
mgmtd[3077]: 2008/06/25_22:57:31 info: login to cib: 3, ret:-10
mgmtd[3077]: 2008/06/25_22:57:32 info: login to cib: 4, ret:-10
crmd[3027]: 2008/06/25_22:57:32 info: crm_timer_popped: Wait Timer (I_NULL) 
just popped!
mgmtd[3077]: 2008/06/25_22:57:33 info: login to cib failed
crmd[3027]: 2008/06/25_22:57:34 WARN: do_cib_control: Couldn't complete CIB 
registration 29 times... pause and retry
mgmtd[3077]: 2008/06/25_22:57:34 ERROR: Can't initialize management 
library.Shutting down.(-1)
heartbeat[2798]: 2008/06/25_22:57:35 WARN: Managed /usr/lib/heartbeat/mgmtd -v 
process 3077 exited with return code 1.
heartbeat[2798]: 2008/06/25_22:57:36 ERROR: Respawning client 
"/usr/lib/heartbeat/mgmtd -v":
crmd[3027]: 2008/06/25_22:57:37 info: crm_timer_popped: Wait Timer (I_NULL) 
just popped!
heartbeat[2798]: 2008/06/25_22:57:37 info: Starting child client 
"/usr/lib/heartbeat/mgmtd -v" (0,0)
crmd[3027]: 2008/06/25_22:57:38 WARN: do_cib_control: Couldn't complete CIB 
registration 30 times... pause and retry
heartbeat[3079]: 2008/06/25_22:57:38 info: Starting "/usr/lib/heartbeat/mgmtd 
-v" as uid 0  gid 0 (pid 3079)
crmd[3027]: 2008/06/25_22:57:39 ERROR: do_cib_control: Could not complete CIB 
registration  30 times... hard error
mgmtd[3079]: 2008/06/25_22:57:39 info: G_main_add_SignalHandler: Added signal 
handler for signal 15
crmd[3027]: 2008/06/25_22:57:40 ERROR: do_log: [[FSA]] Input I_ERROR from 
do_cib_control() received in state (S_STARTING)
mgmtd[3079]: 2008/06/25_22:57:40 debug: Enabling coredumps
crmd[3027]: 2008/06/25_22:57:41 info: do_state_transition: State transition 
S_STARTING -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL 
origin=do_cib_control ]
mgmtd[3079]: 2008/06/25_22:57:42 info: G_main_add_SignalHandler: Added signal 
handler for signal 10
crmd[3027]: 2008/06/25_22:57:42 ERROR: do_recover: Action A_RECOVER 
(0000000001000000) not supported
mgmtd[3079]: 2008/06/25_22:57:42 info: G_main_add_SignalHandler: Added signal 
handler for signal 12
crmd[3027]: 2008/06/25_22:57:44 info: register_heartbeat_conn: Hostname: 
dal-xcp-21.prodea-lo.net
mgmtd[3079]: 2008/06/25_22:57:44 info: init_crm
crmd[3027]: 2008/06/25_22:57:45 info: register_heartbeat_conn: UUID: 
971ee6cd-9cf4-429c-b157-b8b1a75a1346
mgmtd[3079]: 2008/06/25_22:57:45 info: login to cib: 0, ret:-10
crmd[3027]: 2008/06/25_22:57:46 info: populate_cib_nodes_ha: Requesting the 
list of configured nodes
mgmtd[3079]: 2008/06/25_22:57:46 info: login to cib: 1, ret:-10
crmd[3027]: 2008/06/25_22:57:47 notice: populate_cib_nodes_ha: Node: 
dal-xcp-12.prodea-lo.net (uuid: a9439efd-141c-4c85-b109-94800d8d18f2)
crmd[3027]: 2008/06/25_22:57:47 notice: populate_cib_nodes_ha: Node: 
dal-xcp-21.prodea-lo.net (uuid: 971ee6cd-9cf4-429c-b157-b8b1a75a1346)
crmd[3027]: 2008/06/25_22:57:48 notice: populate_cib_nodes_ha: Node: 
dal-xcp-11.prodea-lo.net (uuid: 490a3ace-9bd2-448f-afc7-8792ccd0c598)
mgmtd[3079]: 2008/06/25_22:57:48 info: login to cib: 2, ret:-10
crmd[3027]: 2008/06/25_22:57:48 WARN: add_cib_op_callback: CIB call failed: not 
connected
crmd[3027]: 2008/06/25_22:57:49 ERROR: default_cib_update_callback: CIB Update 
failed: not connected
crmd[3027]: 2008/06/25_22:57:50 WARN: print_xml_formatted: 
default_cib_update_callback: update:failed: NULL
mgmtd[3079]: 2008/06/25_22:57:50 info: login to cib: 3, ret:-10
crmd[3027]: 2008/06/25_22:57:51 info: do_ha_control: Connected to Heartbeat
crmd[3027]: 2008/06/25_22:57:51 WARN: do_fsa_action: Action A_HA_CONNECT took 
8540ms to complete
mgmtd[3079]: 2008/06/25_22:57:51 info: login to cib: 4, ret:-10
crmd[3027]: 2008/06/25_22:57:52 ERROR: do_log: [[FSA]] Input I_ERROR from 
default_cib_update_callback() received in state (S_RECOVERY)
crmd[3027]: 2008/06/25_22:57:52 info: do_dc_release: DC role released
crmd[3027]: 2008/06/25_22:57:53 WARN: add_cib_op_callback: CIB call failed: not 
connected
mgmtd[3079]: 2008/06/25_22:57:53 info: login to cib failed
mgmtd[3079]: 2008/06/25_22:57:53 ERROR: Can't initialize management 
library.Shutting down.(-1)
crmd[3027]: 2008/06/25_22:57:54 ERROR: config_query_callback: Local CIB query 
resulted in an error: not connected
heartbeat[2798]: 2008/06/25_22:57:54 WARN: Managed /usr/lib/heartbeat/mgmtd -v 
process 3079 exited with return code 1.
crmd[3027]: 2008/06/25_22:57:54 ERROR: do_log: [[FSA]] Input I_ERROR from 
config_query_callback() received in state (S_RECOVERY)
heartbeat[2798]: 2008/06/25_22:57:54 ERROR: Respawning client 
"/usr/lib/heartbeat/mgmtd -v":
crmd[3027]: 2008/06/25_22:57:55 info: do_dc_release: DC role released
heartbeat[2798]: 2008/06/25_22:57:56 info: Starting child client 
"/usr/lib/heartbeat/mgmtd -v" (0,0)
crmd[3027]: 2008/06/25_22:57:56 info: do_ccm_control: CCM connection 
established... waiting for first callback
crmd[3027]: 2008/06/25_22:57:57 ERROR: do_started: Start cancelled... S_RECOVERY
crmd[3027]: 2008/06/25_22:57:57 ERROR: do_log: [[FSA]] Input I_TERMINATE from 
do_recover() received in state (S_RECOVERY)
heartbeat[3081]: 2008/06/25_22:57:57 info: Starting "/usr/lib/heartbeat/mgmtd 
-v" as uid 0  gid 0 (pid 3081)
crmd[3027]: 2008/06/25_22:57:58 info: do_state_transition: State transition 
S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL 
origin=do_recover ]
mgmtd[3081]: 2008/06/25_22:57:58 info: G_main_add_SignalHandler: Added signal 
handler for signal 15
crmd[3027]: 2008/06/25_22:57:58 info: do_shutdown: All subsystems stopped, 
continuing
mgmtd[3081]: 2008/06/25_22:57:58 debug: Enabling coredumps
crmd[3027]: 2008/06/25_22:57:59 info: do_lrm_control: Disconnected from the LRM
mgmtd[3081]: 2008/06/25_22:57:59 info: G_main_add_SignalHandler: Added signal 
handler for signal 10
crmd[3027]: 2008/06/25_22:57:59 info: do_ha_control: Disconnected from Heartbeat
mgmtd[3081]: 2008/06/25_22:58:00 info: G_main_add_SignalHandler: Added signal 
handler for signal 12
crmd[3027]: 2008/06/25_22:58:00 info: do_cib_control: Disconnecting CIB
crmd[3027]: 2008/06/25_22:58:01 info: do_exit: Performing A_EXIT_0 - gracefully 
exiting the CRMd
mgmtd[3081]: 2008/06/25_22:58:01 info: init_crm
crmd[3027]: 2008/06/25_22:58:02 ERROR: do_exit: Could not recover from internal 
error
mgmtd[3081]: 2008/06/25_22:58:02 info: login to cib: 0, ret:-10
crmd[3027]: 2008/06/25_22:58:02 info: free_mem: Dropping I_RELEASE_SUCCESS: [ 
state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_dc_release ]
crmd[3027]: 2008/06/25_22:58:03 info: free_mem: Dropping I_RELEASE_SUCCESS: [ 
state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_dc_release ]
crmd[3027]: 2008/06/25_22:58:03 info: free_mem: Dropping I_TERMINATE: [ 
state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ]
mgmtd[3081]: 2008/06/25_22:58:03 info: login to cib: 1, ret:-10
crmd[3027]: 2008/06/25_22:58:04 info: do_exit: [crmd] stopped (2)
heartbeat[2798]: 2008/06/25_22:58:05 WARN: Managed /usr/lib/heartbeat/crmd 
process 3027 exited with return code 2.
heartbeat[2798]: 2008/06/25_22:58:05 EMERG: Rebooting system..  Reason: 
/usr/lib/heartbeat/crmd



      

Attachment: HB-Rebooting.rar
Description: Binary data

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to