I'm at my wits end right now on why random nodes in my cluster will reboot (EMER SHUTDOWN and a hard reboot) sometimes and others will start fine. I've attached my ha.cf, cib.xml and the latest heartbeat_debug file. These are the crmd and mgmtd logs from just before it reboots. It seems to be unable to complete CIB Registration? What could cause this? Is there a timer somewhere I need to increase? crmd[3027]: 2008/06/25_22:57:25 info: crm_timer_popped: Wait Timer (I_NULL) just popped! mgmtd[3077]: 2008/06/25_22:57:26 info: init_crm crmd[3027]: 2008/06/25_22:57:26 WARN: do_cib_control: Couldn't complete CIB registration 27 times.... pause and retry mgmtd[3077]: 2008/06/25_22:57:27 info: login to cib: 0, ret:-10 mgmtd[3077]: 2008/06/25_22:57:28 info: login to cib: 1, ret:-10 crmd[3027]: 2008/06/25_22:57:29 info: crm_timer_popped: Wait Timer (I_NULL) just popped! mgmtd[3077]: 2008/06/25_22:57:30 info: login to cib: 2, ret:-10 crmd[3027]: 2008/06/25_22:57:30 WARN: do_cib_control: Couldn't complete CIB registration 28 times... pause and retry mgmtd[3077]: 2008/06/25_22:57:31 info: login to cib: 3, ret:-10 mgmtd[3077]: 2008/06/25_22:57:32 info: login to cib: 4, ret:-10 crmd[3027]: 2008/06/25_22:57:32 info: crm_timer_popped: Wait Timer (I_NULL) just popped! mgmtd[3077]: 2008/06/25_22:57:33 info: login to cib failed crmd[3027]: 2008/06/25_22:57:34 WARN: do_cib_control: Couldn't complete CIB registration 29 times... pause and retry mgmtd[3077]: 2008/06/25_22:57:34 ERROR: Can't initialize management library.Shutting down.(-1) heartbeat[2798]: 2008/06/25_22:57:35 WARN: Managed /usr/lib/heartbeat/mgmtd -v process 3077 exited with return code 1. heartbeat[2798]: 2008/06/25_22:57:36 ERROR: Respawning client "/usr/lib/heartbeat/mgmtd -v": crmd[3027]: 2008/06/25_22:57:37 info: crm_timer_popped: Wait Timer (I_NULL) just popped! heartbeat[2798]: 2008/06/25_22:57:37 info: Starting child client "/usr/lib/heartbeat/mgmtd -v" (0,0) crmd[3027]: 2008/06/25_22:57:38 WARN: do_cib_control: Couldn't complete CIB registration 30 times... pause and retry heartbeat[3079]: 2008/06/25_22:57:38 info: Starting "/usr/lib/heartbeat/mgmtd -v" as uid 0 gid 0 (pid 3079) crmd[3027]: 2008/06/25_22:57:39 ERROR: do_cib_control: Could not complete CIB registration 30 times... hard error mgmtd[3079]: 2008/06/25_22:57:39 info: G_main_add_SignalHandler: Added signal handler for signal 15 crmd[3027]: 2008/06/25_22:57:40 ERROR: do_log: [[FSA]] Input I_ERROR from do_cib_control() received in state (S_STARTING) mgmtd[3079]: 2008/06/25_22:57:40 debug: Enabling coredumps crmd[3027]: 2008/06/25_22:57:41 info: do_state_transition: State transition S_STARTING -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=do_cib_control ] mgmtd[3079]: 2008/06/25_22:57:42 info: G_main_add_SignalHandler: Added signal handler for signal 10 crmd[3027]: 2008/06/25_22:57:42 ERROR: do_recover: Action A_RECOVER (0000000001000000) not supported mgmtd[3079]: 2008/06/25_22:57:42 info: G_main_add_SignalHandler: Added signal handler for signal 12 crmd[3027]: 2008/06/25_22:57:44 info: register_heartbeat_conn: Hostname: dal-xcp-21.prodea-lo.net mgmtd[3079]: 2008/06/25_22:57:44 info: init_crm crmd[3027]: 2008/06/25_22:57:45 info: register_heartbeat_conn: UUID: 971ee6cd-9cf4-429c-b157-b8b1a75a1346 mgmtd[3079]: 2008/06/25_22:57:45 info: login to cib: 0, ret:-10 crmd[3027]: 2008/06/25_22:57:46 info: populate_cib_nodes_ha: Requesting the list of configured nodes mgmtd[3079]: 2008/06/25_22:57:46 info: login to cib: 1, ret:-10 crmd[3027]: 2008/06/25_22:57:47 notice: populate_cib_nodes_ha: Node: dal-xcp-12.prodea-lo.net (uuid: a9439efd-141c-4c85-b109-94800d8d18f2) crmd[3027]: 2008/06/25_22:57:47 notice: populate_cib_nodes_ha: Node: dal-xcp-21.prodea-lo.net (uuid: 971ee6cd-9cf4-429c-b157-b8b1a75a1346) crmd[3027]: 2008/06/25_22:57:48 notice: populate_cib_nodes_ha: Node: dal-xcp-11.prodea-lo.net (uuid: 490a3ace-9bd2-448f-afc7-8792ccd0c598) mgmtd[3079]: 2008/06/25_22:57:48 info: login to cib: 2, ret:-10 crmd[3027]: 2008/06/25_22:57:48 WARN: add_cib_op_callback: CIB call failed: not connected crmd[3027]: 2008/06/25_22:57:49 ERROR: default_cib_update_callback: CIB Update failed: not connected crmd[3027]: 2008/06/25_22:57:50 WARN: print_xml_formatted: default_cib_update_callback: update:failed: NULL mgmtd[3079]: 2008/06/25_22:57:50 info: login to cib: 3, ret:-10 crmd[3027]: 2008/06/25_22:57:51 info: do_ha_control: Connected to Heartbeat crmd[3027]: 2008/06/25_22:57:51 WARN: do_fsa_action: Action A_HA_CONNECT took 8540ms to complete mgmtd[3079]: 2008/06/25_22:57:51 info: login to cib: 4, ret:-10 crmd[3027]: 2008/06/25_22:57:52 ERROR: do_log: [[FSA]] Input I_ERROR from default_cib_update_callback() received in state (S_RECOVERY) crmd[3027]: 2008/06/25_22:57:52 info: do_dc_release: DC role released crmd[3027]: 2008/06/25_22:57:53 WARN: add_cib_op_callback: CIB call failed: not connected mgmtd[3079]: 2008/06/25_22:57:53 info: login to cib failed mgmtd[3079]: 2008/06/25_22:57:53 ERROR: Can't initialize management library.Shutting down.(-1) crmd[3027]: 2008/06/25_22:57:54 ERROR: config_query_callback: Local CIB query resulted in an error: not connected heartbeat[2798]: 2008/06/25_22:57:54 WARN: Managed /usr/lib/heartbeat/mgmtd -v process 3079 exited with return code 1. crmd[3027]: 2008/06/25_22:57:54 ERROR: do_log: [[FSA]] Input I_ERROR from config_query_callback() received in state (S_RECOVERY) heartbeat[2798]: 2008/06/25_22:57:54 ERROR: Respawning client "/usr/lib/heartbeat/mgmtd -v": crmd[3027]: 2008/06/25_22:57:55 info: do_dc_release: DC role released heartbeat[2798]: 2008/06/25_22:57:56 info: Starting child client "/usr/lib/heartbeat/mgmtd -v" (0,0) crmd[3027]: 2008/06/25_22:57:56 info: do_ccm_control: CCM connection established... waiting for first callback crmd[3027]: 2008/06/25_22:57:57 ERROR: do_started: Start cancelled... S_RECOVERY crmd[3027]: 2008/06/25_22:57:57 ERROR: do_log: [[FSA]] Input I_TERMINATE from do_recover() received in state (S_RECOVERY) heartbeat[3081]: 2008/06/25_22:57:57 info: Starting "/usr/lib/heartbeat/mgmtd -v" as uid 0 gid 0 (pid 3081) crmd[3027]: 2008/06/25_22:57:58 info: do_state_transition: State transition S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL origin=do_recover ] mgmtd[3081]: 2008/06/25_22:57:58 info: G_main_add_SignalHandler: Added signal handler for signal 15 crmd[3027]: 2008/06/25_22:57:58 info: do_shutdown: All subsystems stopped, continuing mgmtd[3081]: 2008/06/25_22:57:58 debug: Enabling coredumps crmd[3027]: 2008/06/25_22:57:59 info: do_lrm_control: Disconnected from the LRM mgmtd[3081]: 2008/06/25_22:57:59 info: G_main_add_SignalHandler: Added signal handler for signal 10 crmd[3027]: 2008/06/25_22:57:59 info: do_ha_control: Disconnected from Heartbeat mgmtd[3081]: 2008/06/25_22:58:00 info: G_main_add_SignalHandler: Added signal handler for signal 12 crmd[3027]: 2008/06/25_22:58:00 info: do_cib_control: Disconnecting CIB crmd[3027]: 2008/06/25_22:58:01 info: do_exit: Performing A_EXIT_0 - gracefully exiting the CRMd mgmtd[3081]: 2008/06/25_22:58:01 info: init_crm crmd[3027]: 2008/06/25_22:58:02 ERROR: do_exit: Could not recover from internal error mgmtd[3081]: 2008/06/25_22:58:02 info: login to cib: 0, ret:-10 crmd[3027]: 2008/06/25_22:58:02 info: free_mem: Dropping I_RELEASE_SUCCESS: [ state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_dc_release ] crmd[3027]: 2008/06/25_22:58:03 info: free_mem: Dropping I_RELEASE_SUCCESS: [ state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_dc_release ] crmd[3027]: 2008/06/25_22:58:03 info: free_mem: Dropping I_TERMINATE: [ state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ] mgmtd[3081]: 2008/06/25_22:58:03 info: login to cib: 1, ret:-10 crmd[3027]: 2008/06/25_22:58:04 info: do_exit: [crmd] stopped (2) heartbeat[2798]: 2008/06/25_22:58:05 WARN: Managed /usr/lib/heartbeat/crmd process 3027 exited with return code 2. heartbeat[2798]: 2008/06/25_22:58:05 EMERG: Rebooting system.. Reason: /usr/lib/heartbeat/crmd
HB-Rebooting.rar
Description: Binary data
_______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
