Hi again! I haven't fully understood the problem, but it looks as if pacemaker likes to shoot himself in the foot, and then go crazy when it feels the pain:
Shortly after maintenance mode was turned on, there was a communication problem (my guess is due to the massive communication triggered by the cluster). I see many "old event" messages: crmd: [7285]: info: process_graph_event: Detected action prm_O2CB:0_monitor_0 from a different transition: 1 vs. 1301 crmd: [7285]: info: abort_transition_graph: process_graph_event:476 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=prm_O2CB:0_last_0, magic=0:7;8:1:7:d6d69f10-a1da-4fde-9156-c2c3ea8be821, cib=0.1587.25) : Old event [...] Then bad things start: crmd: [7285]: ERROR: lrm_get_rsc(673): failed to receive a reply message of getrsc. crmd: [7285]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel. [...] crmd: [7285]: ERROR: get_lrm_resource: Could not add resource prm_t11_db_raid1 to LRM crmd: [7285]: ERROR: do_lrm_invoke: Invalid resource definition Hey folks!: If crm cannot talk to lrm, how does it know the resource definition is invalid? [Crazy things go on, until it changes to:] crmd: [7285]: ERROR: verify_stopped: Resource prm_ping_gw1-v582:1 was active at shutdown. You may ignore this error if it is unmanaged. Hey folks!: Why emitting Errors and saying you can ignore it if the cluster is unmanaged? Doesn't the cluster know it's unmanaged? These error messages are of little help! crmd: [7285]: info: lrm_connection_destroy: LRM Connection disconnected crmd: [7285]: info: do_lrm_control: Disconnected from the LRM [peace, hopefully?] Unfortunately not: crmd: [7285]: WARN: register_fsa_input_adv: do_te_invoke stalled the FSA with pending inputs [...] crmd: [7285]: info: ais_dispatch_message: Membership 1764: quorum retained crmd: [7285]: notice: crmd_peer_update: Status update: Client h06/crmd now has status [offline] (DC=true) The cluster is joking? The two-node cluster has a quorum with the DC being offline??? Bad phantasies continue: crmd: [7285]: WARN: fail_incompletable_actions: Node h06 shutdown resulted in un-runnable actions crmd: [7285]: info: ais_dispatch_message: Membership 1764: quorum retained crmd: [7285]: notice: crmd_peer_update: Status update: Client h06/crmd now has status [online] (DC=true) So the node went offline and back online in less than one second? Amazing: The node needs 5 minutes to boot. crmd: [7285]: WARN: do_log: FSA: Input I_FAIL from get_lrm_resource() received in state S_FINALIZE_JOIN crmd: [7285]: notice: do_state_transition: State transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_FAIL cause=C_FSA_INTERNAL origin=get_lrm_resource ] [...] crmd: [7285]: info: handle_request: Current ping state: S_INTEGRATION crmd: [7285]: WARN: do_log: FSA: Input I_FAIL from get_lrm_resource() received in state S_INTEGRATION crmd: [7285]: info: abort_transition_graph: do_te_invoke:162 - Triggered transition abort (complete=0) : Peer Cancelled [...] crmd: [7285]: WARN: crmd_ha_msg_filter: Another DC detected: h06 (op=join_offer) crmd: [7285]: notice: do_state_transition: State transition S_INTEGRATION -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=crmd_ha_msg_filter ] [...] crmd: [7285]: WARN: do_log: FSA: Input I_FAIL from get_lrm_resource() received in state S_ELECTION crmd: [7285]: notice: do_state_transition: State transition S_ELECTION -> S_RELEASE_DC [ input=I_FAIL cause=C_FSA_INTERNAL origin=get_lrm_resource ] [...] crmd: [7285]: WARN: do_log: FSA: Input I_NODE_JOIN from route_message() received in state S_RELEASE_DC [...] crmd: [7285]: WARN: do_log: FSA: Input I_ELECTION_DC from do_dc_join_finalize() received in state S_RELEASE_DC crmd: [7285]: ERROR: do_pe_invoke: Attempted to invoke the PE without a consistent copy of the CIB! crmd: [7285]: WARN: do_log: FSA: Input I_JOIN_REQUEST from route_message() received in state S_RELEASE_DC crmd: [7285]: WARN: do_log: FSA: Input I_JOIN_OFFER from route_message() received in state S_RELEASE_DC crmd: [7285]: info: update_dc: Set DC to h02 (3.0.6) crmd: [7285]: WARN: do_log: FSA: Input I_JOIN_OFFER from route_message() received in state S_RELEASE_DC crmd: [7285]: WARN: do_log: FSA: Input I_JOIN_REQUEST from route_message() received in state S_RELEASE_DC crmd: [7285]: WARN: do_log: FSA: Input I_JOIN_REQUEST from route_message() received in state S_RELEASE_DC crmd: [7285]: info: do_election_count_vote: Election 3 (owner: h06) pass: vote from h06 (Uptime) crmd: [7285]: ERROR: crm_timer_popped: Election Timeout (I_ELECTION_DC) just popped in state S_RELEASE_DC! (120000ms) crmd: [7285]: WARN: do_log: FSA: Input I_ELECTION_DC from crm_timer_popped() received in state S_RELEASE_DC [...] last messages repeat for days, even after the cluster was switched back to managed mode...!!! crmd: [7285]: ERROR: crm_timer_popped: Election Timeout (I_ELECTION_DC) just popped in state S_RELEASE_DC! (120000ms) crmd: [7285]: WARN: do_log: FSA: Input I_ELECTION_DC from crm_timer_popped() received in state S_RELEASE_DC crmd: [7285]: info: do_election_count_vote: Election 3226 (owner: rksaph06) pass: vote from rksaph06 (Uptime) Regards, Ulrich >>> "Ulrich Windl" <[email protected]> schrieb am 05.06.2013 um 14:25 in Nachricht <[email protected]>: > Hi! > > When the cluster (SLES11 SP2, current) was put in maintenance mode, two > messages repeat periodically, filling the log: > ERROR: crm_timer_popped: Election Timeout (I_ELECTION_DC) just popped in > state S_RELEASE_DC! (120000ms) > WARN: do_log: FSA: Input I_ELECTION_DC from crm_timer_popped() received in > state S_RELEASE_DC > > crmd: [7285]: info: handle_request: Current ping state: S_RELEASE_DC > > Is this expected, or is it yet another bug (pacemaker-1.1.7-0.13.9)? > > Regards, > Ulrich > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
