please use hb_report for things like this. we'd need logs from both nodes (and trying to read line-wrapped logs is near impossible)
On Wed, Aug 12, 2009 at 1:30 PM, Alain.Moulle<[email protected]> wrote: > Hi , > I'm facing a pb at first start on my two-nodes ha cluster: server2 and > server3 (both with fedora11) > The start on server2 seems to work fine, and on hb_gui I can see server2 > OnLine and > server3 Offline > But the start on server3 seems to not connect to the cluster, and at the > end crm_mon displays > only server3, no mention of server2 (in the log, seems that it fails to > calculate server2 uuid)... > A ping between both heartbeat interfaces works fine ... so I have no > idea why it does not > work ... > Any idea ? (Log is below) > Thanks for help. > Alain > > > cib[3137]: 2009/08/12_13:14:17 info: register_heartbeat_conn: Hostname: > server3 > cib[3137]: 2009/08/12_13:14:17 info: register_heartbeat_conn: UUID: > 7c6eb9f6-c938-4012-8ba0-4d5c21dd1315 > cib[3137]: 2009/08/12_13:14:17 info: ccm_connect: Registering with CCM... > cib[3137]: 2009/08/12_13:14:17 WARN: ccm_connect: CCM Activation failed > cib[3137]: 2009/08/12_13:14:17 WARN: ccm_connect: CCM Connection failed > 1 times (30 max) > stonithd[3139]: 2009/08/12_13:14:17 notice: > /usr/lib64/heartbeat/stonithd start up successfully. > stonithd[3139]: 2009/08/12_13:14:17 info: G_main_add_SignalHandler: > Added signal handler for signal 17 > mgmtd[3135]: 2009/08/12_13:14:18 info: init_crm: live > mgmtd[3135]: 2009/08/12_13:14:18 info: login to cib live: 0, ret:-10 > crmd[3142]: 2009/08/12_13:14:18 info: do_cib_control: Could not connect > to the CIB service: connection failed > crmd[3142]: 2009/08/12_13:14:18 WARN: do_cib_control: Couldn't complete > CIB registration 1 times... pause and retry > crmd[3142]: 2009/08/12_13:14:18 info: crmd_init: Starting crmd's mainloop > mgmtd[3135]: 2009/08/12_13:14:19 info: login to cib live: 1, ret:-10 > mgmtd[3135]: 2009/08/12_13:14:20 info: login to cib live: 2, ret:-10 > crmd[3142]: 2009/08/12_13:14:20 info: crm_timer_popped: Wait Timer > (I_NULL) just popped! > cib[3137]: 2009/08/12_13:14:20 info: ccm_connect: Registering with CCM... > cib[3137]: 2009/08/12_13:14:20 WARN: ccm_connect: CCM Activation failed > cib[3137]: 2009/08/12_13:14:20 WARN: ccm_connect: CCM Connection failed > 2 times (30 max) > mgmtd[3135]: 2009/08/12_13:14:21 info: login to cib live: 3, ret:-10 > crmd[3142]: 2009/08/12_13:14:21 info: do_cib_control: Could not connect > to the CIB service: connection failed > crmd[3142]: 2009/08/12_13:14:21 WARN: do_cib_control: Couldn't complete > CIB registration 2 times... pause and retry > ccm[3136]: 2009/08/12_13:14:21 debug: quorum plugin: majority > ccm[3136]: 2009/08/12_13:14:21 debug: cluster:linux-ha, member_count=1, > member_quorum_votes=100 > ccm[3136]: 2009/08/12_13:14:21 debug: total_node_count=2, > total_quorum_votes=200 > ccm[3136]: 2009/08/12_13:14:21 debug: quorum plugin: twonodes > ccm[3136]: 2009/08/12_13:14:21 debug: cluster:linux-ha, member_count=1, > member_quorum_votes=100 > ccm[3136]: 2009/08/12_13:14:21 debug: total_node_count=2, > total_quorum_votes=200 > ccm[3136]: 2009/08/12_13:14:21 info: Break tie for 2 nodes cluster > ccm[3136]: 2009/08/12_13:14:21 info: G_main_add_SignalHandler: Added > signal handler for signal 15 > mgmtd[3135]: 2009/08/12_13:14:22 info: login to cib live: 4, ret:-10 > mgmtd[3135]: 2009/08/12_13:14:23 info: login to cib failed: live > mgmtd[3135]: 2009/08/12_13:14:23 ERROR: Can't initialize management > library.Shutting down.(-1) > heartbeat[3127]: 2009/08/12_13:14:23 WARN: Managed > /usr/lib64/heartbeat/mgmtd -v -t process 3135 exited with return code 1. > heartbeat[3127]: 2009/08/12_13:14:23 ERROR: Respawning client > "/usr/lib64/heartbeat/mgmtd -v -t": > heartbeat[3127]: 2009/08/12_13:14:23 info: Starting child client > "/usr/lib64/heartbeat/mgmtd -v -t" (0,0) > crmd[3142]: 2009/08/12_13:14:23 info: crm_timer_popped: Wait Timer > (I_NULL) just popped! > cib[3137]: 2009/08/12_13:14:23 info: ccm_connect: Registering with CCM... > cib[3137]: 2009/08/12_13:14:23 info: cib_init: Requesting the list of > configured nodes > cib[3137]: 2009/08/12_13:14:23 info: cib_init: Starting cib mainloop > cib[3137]: 2009/08/12_13:14:23 info: cib_client_status_callback: Status > update: Client server3/cib now has status [join] > cib[3137]: 2009/08/12_13:14:23 info: crm_new_peer: Node 0 is now known > as server3 > cib[3137]: 2009/08/12_13:14:23 info: crm_update_peer_proc: server3.cib > is now online > cib[3137]: 2009/08/12_13:14:23 info: mem_handle_event: Got an event > OC_EV_MS_NEW_MEMBERSHIP from ccm > cib[3137]: 2009/08/12_13:14:23 info: mem_handle_event: instance=1, > nodes=1, new=1, lost=0, n_idx=0, new_idx=0, old_idx=3 > cib[3137]: 2009/08/12_13:14:23 info: cib_ccm_msg_callback: Processing > CCM event=NEW MEMBERSHIP (id=1) > cib[3137]: 2009/08/12_13:14:23 info: crm_get_peer: Node server3 now has > id: 1 > cib[3137]: 2009/08/12_13:14:23 info: crm_update_peer: Node server3: id=1 > state=member (new) addr=(null) votes=-1 born=1 seen=1 > proc=00000000000000000000000000000100 > cib[3137]: 2009/08/12_13:14:23 info: crm_update_peer_proc: server3.ais > is now online > cib[3137]: 2009/08/12_13:14:23 info: crm_update_peer_proc: server3.crmd > is now online > cib[3144]: 2009/08/12_13:14:23 info: write_cib_contents: Archived > previous version as /var/lib/heartbeat/crm/cib-9.raw > cib[3144]: 2009/08/12_13:14:23 info: write_cib_contents: Wrote version > 0.7.0 of the CIB to disk (digest: c41e9acb00ec2c131a3375b2b6b0faec) > cib[3144]: 2009/08/12_13:14:23 info: retrieveCib: Reading cluster > configuration from: /var/lib/heartbeat/crm/cib.Mx3kVH (digest: > /var/lib/heartbeat/crm/cib.hewZvu) > cib[3137]: 2009/08/12_13:14:24 info: cib_client_status_callback: Status > update: Client server3/cib now has status [online] > heartbeat[3143]: 2009/08/12_13:14:24 info: Starting > "/usr/lib64/heartbeat/mgmtd -v -t" as uid 0 gid 0 (pid 3143) > mgmtd[3143]: 2009/08/12_13:14:24 info: Pacemaker-mgmt Hg Version: > 1f7d1be39d34c38d64d971ac79a21339917fba4a > mgmtd[3143]: 2009/08/12_13:14:24 info: G_main_add_SignalHandler: Added > signal handler for signal 15 > mgmtd[3143]: 2009/08/12_13:14:24 debug: Enabling coredumps > mgmtd[3143]: 2009/08/12_13:14:24 info: G_main_add_SignalHandler: Added > signal handler for signal 10 > mgmtd[3143]: 2009/08/12_13:14:24 info: G_main_add_SignalHandler: Added > signal handler for signal 12 > mgmtd[3143]: 2009/08/12_13:14:24 info: init_crm: live > attrd[3141]: 2009/08/12_13:14:24 info: cib_connect: Connected to the CIB > after 7 signon attempts > attrd[3141]: 2009/08/12_13:14:24 info: cib_connect: Sending full refresh > crmd[3142]: 2009/08/12_13:14:24 info: do_cib_control: CIB connection > established > crmd[3142]: 2009/08/12_13:14:24 info: crm_cluster_connect: Connecting to > Heartbeat > crmd[3142]: 2009/08/12_13:14:24 info: register_heartbeat_conn: Hostname: > server3 > crmd[3142]: 2009/08/12_13:14:24 info: register_heartbeat_conn: UUID: > 7c6eb9f6-c938-4012-8ba0-4d5c21dd1315 > mgmtd[3143]: 2009/08/12_13:14:24 debug: main: run the loop... > mgmtd[3143]: 2009/08/12_13:14:24 info: Started. > crmd[3142]: 2009/08/12_13:14:25 info: do_ha_control: Connected to the > cluster > crmd[3142]: 2009/08/12_13:14:25 info: do_ccm_control: CCM connection > established... waiting for first callback > crmd[3142]: 2009/08/12_13:14:25 info: do_started: Delaying start, CCM > (0000000000100000) not connected > crmd[3142]: 2009/08/12_13:14:25 info: config_query_callback: Checking > for expired actions every 900000ms > crmd[3142]: 2009/08/12_13:14:25 notice: crmd_client_status_callback: > Status update: Client server3/crmd now has status [online] (DC=false) > crmd[3142]: 2009/08/12_13:14:25 info: crm_new_peer: Node 0 is now known > as server3 > crmd[3142]: 2009/08/12_13:14:25 info: crm_update_peer_proc: server3.crmd > is now online > crmd[3142]: 2009/08/12_13:14:25 info: crmd_client_status_callback: Not > the DC > crmd[3142]: 2009/08/12_13:14:25 notice: crmd_client_status_callback: > Status update: Client server3/crmd now has status [online] (DC=false) > crmd[3142]: 2009/08/12_13:14:25 info: crmd_client_status_callback: Not > the DC > crmd[3142]: 2009/08/12_13:14:25 info: mem_handle_event: Got an event > OC_EV_MS_NEW_MEMBERSHIP from ccm > crmd[3142]: 2009/08/12_13:14:25 info: mem_handle_event: instance=1, > nodes=1, new=1, lost=0, n_idx=0, new_idx=0, old_idx=3 > crmd[3142]: 2009/08/12_13:14:25 info: crmd_ccm_msg_callback: Quorum > (re)attained after event=NEW MEMBERSHIP (id=1) > crmd[3142]: 2009/08/12_13:14:25 info: ccm_event_detail: NEW MEMBERSHIP: > trans=1, nodes=1, new=1, lost=0 n_idx=0, new_idx=0, old_idx=3 > crmd[3142]: 2009/08/12_13:14:25 info: ccm_event_detail: CURRENT: > server3 [nodeid=1, born=1] > crmd[3142]: 2009/08/12_13:14:25 info: ccm_event_detail: NEW: > server3 [nodeid=1, born=1] > crmd[3142]: 2009/08/12_13:14:25 info: crm_get_peer: Node server3 now has > id: 1 > crmd[3142]: 2009/08/12_13:14:25 info: crm_update_peer: Node server3: > id=1 state=member (new) addr=(null) votes=-1 born=1 seen=1 > proc=00000000000000000000000000000200 > crmd[3142]: 2009/08/12_13:14:25 info: crm_update_peer_proc: server3.ais > is now online > crmd[3142]: 2009/08/12_13:14:25 info: do_started: The local CRM is > operational > crmd[3142]: 2009/08/12_13:14:25 info: do_state_transition: State > transition S_STARTING -> S_PENDING [ input=I_PENDING > cause=C_FSA_INTERNAL origin=do_started ] > crmd[3142]: 2009/08/12_13:14:36 info: crm_timer_popped: Election Trigger > (I_DC_TIMEOUT) just popped! > crmd[3142]: 2009/08/12_13:14:36 WARN: do_log: FSA: Input I_DC_TIMEOUT > from crm_timer_popped() received in state S_PENDING > crmd[3142]: 2009/08/12_13:14:36 info: do_state_transition: State > transition S_PENDING -> S_ELECTION [ input=I_DC_TIMEOUT > cause=C_TIMER_POPPED origin=crm_timer_popped > crmd[3142]: 2009/08/12_13:14:36 info: do_state_transition: State > transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC > cause=C_FSA_INTERNAL origin=do_election_check ] > crmd[3142]: 2009/08/12_13:14:36 info: do_te_control: Registering TE > UUID: ecf08a7c-6fb8-43e0-a8b6-011ab83ad5f7 > crmd[3142]: 2009/08/12_13:14:40 info: update_dc: Set DC to server3 (3.0.1) > crmd[3142]: 2009/08/12_13:14:41 info: do_state_transition: State > transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED > cause=C_FSA_INTERNAL origin=check_join_state ] > crmd[3142]: 2009/08/12_13:14:41 info: do_state_transition: All 1 cluster > nodes responded to the join offer. > crmd[3142]: 2009/08/12_13:14:41 info: do_dc_join_finalize: join-1: > Syncing the CIB from server3 to the rest of the cluster > cib[3137]: 2009/08/12_13:14:41 info: cib_process_request: Operation > complete: op cib_sync for section 'all' (origin=local/crmd/14, > version=0.7.0): ok (rc=0) > cib[3137]: 2009/08/12_13:14:41 info: cib_process_request: Operation > complete: op cib_modify for section nodes (origin=local/crmd/15, > version=0.7.0): ok (rc=0) > crmd[3142]: 2009/08/12_13:14:41 info: update_attrd: Connecting to attrd... > crmd[3142]: 2009/08/12_13:14:41 info: attrd_update: Sent update: > terminate=(null) for server3 > crmd[3142]: 2009/08/12_13:14:41 info: attrd_update: Sent update: > shutdown=(null) for server3 > attrd[3141]: 2009/08/12_13:14:41 info: find_hash_entry: Creating hash > entry for terminate > attrd[3141]: 2009/08/12_13:14:41 info: find_hash_entry: Creating hash > entry for shutdown > crmd[3142]: 2009/08/12_13:14:41 info: do_dc_join_ack: join-1: Updating > node state to member for server3 > cib[3137]: 2009/08/12_13:14:41 info: cib_process_request: Operation > complete: op cib_delete for section > //node_sta...@uname='server3']/transient_attributes > (origin=local/crmd/16, version=0.7.0): ok (rc=0) > crmd[3142]: 2009/08/12_13:14:41 info: erase_xpath_callback: Deletion of > "//node_sta...@uname='server3']/transient_attributes": ok (rc=0) > cib[3137]: 2009/08/12_13:14:41 info: cib_process_request: Operation > complete: op cib_delete for section //node_sta...@uname='server3']/lrm > (origin=local/crmd/17, version=0.7.0): ok (rc=0) > crmd[3142]: 2009/08/12_13:14:41 info: erase_xpath_callback: Deletion of > "//node_sta...@uname='server3']/lrm": ok (rc=0) > cib[3137]: 2009/08/12_13:14:41 info: cib_process_request: Operation > complete: op cib_delete for section //node_sta...@uname='server3']/lrm > (origin=local/crmd/18, version=0.7.0): ok (rc=0) > crmd[3142]: 2009/08/12_13:14:41 info: erase_xpath_callback: Deletion of > "//node_sta...@uname='server3']/lrm": ok (rc=0) > crmd[3142]: 2009/08/12_13:14:41 info: do_state_transition: State > transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED > cause=C_FSA_INTERNAL origin=check_join_state ] > crmd[3142]: 2009/08/12_13:14:41 info: populate_cib_nodes_ha: Requesting > the list of configured nodes > crmd[3142]: 2009/08/12_13:14:43 WARN: get_uuid: Could not calculate UUID > for server2 > crmd[3142]: 2009/08/12_13:14:43 WARN: populate_cib_nodes_ha: Node > server2: no uuid found > crmd[3142]: 2009/08/12_13:14:43 info: do_state_transition: All 1 cluster > nodes are eligible to run resources. > crmd[3142]: 2009/08/12_13:14:43 info: do_dc_join_final: Ensuring DC, > quorum and node attributes are up-to-date > crmd[3142]: 2009/08/12_13:14:43 info: attrd_update: Sent update: > (null)=(null) for localhost > attrd[3141]: 2009/08/12_13:14:43 info: attrd_local_callback: Sending > full refresh (origin=crmd) > crmd[3142]: 2009/08/12_13:14:43 info: crm_update_quorum: Updating quorum > status to true (call=22) > attrd[3141]: 2009/08/12_13:14:43 info: attrd_trigger_update: Sending > flush op to all hosts for: shutdown (<null>) > crmd[3142]: 2009/08/12_13:14:43 info: abort_transition_graph: > do_te_invoke:190 - Triggered transition abort (complete=1) : Peer Cancelled > crmd[3142]: 2009/08/12_13:14:43 info: do_pe_invoke: Query 23: Requesting > the current CIB: S_POLICY_ENGINE > attrd[3141]: 2009/08/12_13:14:43 info: attrd_trigger_update: Sending > flush op to all hosts for: terminate (<null>) > cib[3137]: 2009/08/12_13:14:43 info: cib_process_request: Operation > complete: op cib_modify for section nodes (origin=local/crmd/20, > version=0.7.1): ok (rc=0) > crmd[3142]: 2009/08/12_13:14:43 info: abort_transition_graph: > need_abort:59 - Triggered transition abort (complete=1) : Non-status change > crmd[3142]: 2009/08/12_13:14:43 info: need_abort: Aborting on change to > admin_epoch > crmd[3142]: 2009/08/12_13:14:43 info: do_pe_invoke: Query 24: Requesting > the current CIB: S_POLICY_ENGINE > cib[3137]: 2009/08/12_13:14:43 info: log_data_element: cib:diff: - <cib > admin_epoch="0" epoch="7" num_updates="1" /> > cib[3137]: 2009/08/12_13:14:43 info: log_data_element: cib:diff: + <cib > dc-uuid="7c6eb9f6-c938-4012-8ba0-4d5c21dd1315" admin_epoch="0" epoch="8" > num_updates="1" /> > cib[3137]: 2009/08/12_13:14:43 info: cib_process_request: Operation > complete: op cib_modify for section cib (origin=local/crmd/22, > version=0.8.1): ok (rc=0) > crmd[3142]: 2009/08/12_13:14:43 info: do_pe_invoke_callback: Invoking > the PE: ref=pe_calc-dc-1250075683-7, seq=1, quorate=1 > pengine[3145]: 2009/08/12_13:14:43 info: unpack_config: Node scores: > 'red' = -INFINITY, 'yellow' = 0, 'green' = 0 > pengine[3145]: 2009/08/12_13:14:43 ERROR: unpack_resources: No STONITH > resources have been defined > pengine[3145]: 2009/08/12_13:14:43 ERROR: unpack_resources: Either > configure some or disable STONITH with the stonith-enabled option > pengine[3145]: 2009/08/12_13:14:43 ERROR: unpack_resources: NOTE: > Clusters with shared data need STONITH to ensure data integrity > pengine[3145]: 2009/08/12_13:14:43 info: determine_online_status: Node > server3 is online > pengine[3145]: 2009/08/12_13:14:43 info: stage6: Delaying fencing > operations until there are resources to manage > cib[3146]: 2009/08/12_13:14:43 info: write_cib_contents: Archived > previous version as /var/lib/heartbeat/crm/cib-10.raw > cib[3146]: 2009/08/12_13:14:43 info: write_cib_contents: Wrote version > 0.8.0 of the CIB to disk (digest: a037845ebd38e720357a1dbf3e1faf68) > > crmd[3142]: 2009/08/12_13:14:36 WARN: cib_client_add_notify_callback: > Callback already present > crmd[3142]: 2009/08/12_13:14:36 info: set_graph_functions: Setting > custom graph functions > crmd[3142]: 2009/08/12_13:14:36 info: unpack_graph: Unpacked transition > -1: 0 actions in 0 synapses > crmd[3142]: 2009/08/12_13:14:36 info: start_subsystem: Starting > sub-system "pengine" > pengine[3145]: 2009/08/12_13:14:36 info: main: Starting pengine > crmd[3142]: 2009/08/12_13:14:39 info: do_dc_takeover: Taking over DC > status for this partition > cib[3137]: 2009/08/12_13:14:39 info: cib_process_readwrite: We are now > in R/W mode > cib[3137]: 2009/08/12_13:14:39 info: cib_process_request: Operation > complete: op cib_master for section 'all' (origin=local/crmd/6, > version=0.7.0): ok (rc=0) > cib[3137]: 2009/08/12_13:14:39 info: cib_process_request: Operation > complete: op cib_modify for section cib (origin=local/crmd/7, > version=0.7.0): ok (rc=0) > cib[3137]: 2009/08/12_13:14:39 info: cib_process_request: Operation > complete: op cib_modify for section crm_config (origin=local/crmd/9, > version=0.7.0): ok (rc=0) > crmd[3142]: 2009/08/12_13:14:39 info: join_make_offer: Making join > offers based on membership 1 > crmd[3142]: 2009/08/12_13:14:39 info: do_dc_join_offer_all: join-1: > Waiting on 1 outstanding join acks > crmd[3142]: 2009/08/12_13:14:39 info: te_connect_stonith: Attempting > connection to fencing daemon... > cib[3137]: 2009/08/12_13:14:39 info: cib_process_request: Operation > complete: op cib_modify for section crm_config (origin=local/crmd/11, > version=0.7.0): ok (rc=0) > crmd[3142]: 2009/08/12_13:14:40 info: te_connect_stonith: Connected > crmd[3142]: 2009/08/12_13:14:40 info: config_query_callback: Checking > for expired actions every 900000ms > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
