On 05/06/2013, at 11:22 PM, Ulrich Windl <[email protected]> wrote:
> Hi again! > > I haven't fully understood the problem, but it looks as if pacemaker likes to > shoot himself in the foot, and then go crazy when it feels the pain: > > Shortly after maintenance mode was turned on, there was a communication > problem (my guess is due to the massive communication triggered by the > cluster). An interesting assumption given that crmd <-> lrmd comms is local IPC and therefor unaffected by how much traffic is happening between nodes. > I see many "old event" messages: > crmd: [7285]: info: process_graph_event: Detected action prm_O2CB:0_monitor_0 > from a different transition: 1 vs. 1301 > crmd: [7285]: info: abort_transition_graph: process_graph_event:476 - > Triggered transition abort (complete=0, tag=lrm_rsc_op, id=prm_O2CB:0_last_0, > magic=0:7;8:1:7:d6d69f10-a1da-4fde-9156-c2c3ea8be821, cib=0.1587.25) : Old > event > [...] > Then bad things start: > crmd: [7285]: ERROR: lrm_get_rsc(673): failed to receive a reply message of > getrsc. > crmd: [7285]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to > lrmd via ch_cmd channel. > [...] > crmd: [7285]: ERROR: get_lrm_resource: Could not add resource > prm_t11_db_raid1 to LRM > crmd: [7285]: ERROR: do_lrm_invoke: Invalid resource definition > > Hey folks!: If crm cannot talk to lrm, how does it know the resource > definition is invalid? It doesn't so it (must) take the paranoid option. > > [Crazy things go on, until it changes to:] > crmd: [7285]: ERROR: verify_stopped: Resource prm_ping_gw1-v582:1 was active > at shutdown. You may ignore this error if it is unmanaged. > > Hey folks!: Why emitting Errors and saying you can ignore it if the cluster > is unmanaged? The reason its an error is because a) if they're not unmanaged, its a pretty bad situation b) no-one would take any notice otherwise > Doesn't the cluster know it's unmanaged? The cluster is not monolithic, so the PE does but the crmd does not. > These error messages are of little help! > > crmd: [7285]: info: lrm_connection_destroy: LRM Connection disconnected > crmd: [7285]: info: do_lrm_control: Disconnected from the LRM > [peace, hopefully?] > > Unfortunately not: > crmd: [7285]: WARN: register_fsa_input_adv: do_te_invoke stalled the FSA with > pending inputs > [...] > crmd: [7285]: info: ais_dispatch_message: Membership 1764: quorum retained > crmd: [7285]: notice: crmd_peer_update: Status update: Client h06/crmd now > has status [offline] (DC=true) > > The cluster is joking? The two-node cluster has a quorum with the DC being > offline??? "DC=true" refers to _this_ node being the DC. So no, the DC is not offline. > Bad phantasies continue: Thankyou for your assessment. > crmd: [7285]: WARN: fail_incompletable_actions: Node h06 shutdown resulted in > un-runnable actions > crmd: [7285]: info: ais_dispatch_message: Membership 1764: quorum retained > crmd: [7285]: notice: crmd_peer_update: Status update: Client h06/crmd now > has status [online] (DC=true) > > So the node went offline and back online in less than one second? Amazing: > The node needs 5 minutes to boot. And the cluster knows how long your servers take to reboot... just because. Plenty of people use VMs which come back in a couple of seconds. Also, your point is irrelevant as its not the _node_ bouncing, its the crmd _daemon_ quitting and being respawned which can easily happen that fast. And I would remind you that at the point the crmd dies, there is absolutely no way to know that it will come back a second later. Everything beyond this point is unnecessary. Although some timestamps would be useful to understand the scale of the problem, the crmd has no business doing any of the below as its in a recovery state and should be trying to exit ASAP. > crmd: [7285]: WARN: do_log: FSA: Input I_FAIL from get_lrm_resource() > received in state S_FINALIZE_JOIN > crmd: [7285]: notice: do_state_transition: State transition S_FINALIZE_JOIN > -> S_INTEGRATION [ input=I_FAIL cause=C_FSA_INTERNAL origin=get_lrm_resource ] > [...] > crmd: [7285]: info: handle_request: Current ping state: S_INTEGRATION > crmd: [7285]: WARN: do_log: FSA: Input I_FAIL from get_lrm_resource() > received in state S_INTEGRATION > crmd: [7285]: info: abort_transition_graph: do_te_invoke:162 - Triggered > transition abort (complete=0) : Peer Cancelled > [...] > crmd: [7285]: WARN: crmd_ha_msg_filter: Another DC detected: h06 > (op=join_offer) > crmd: [7285]: notice: do_state_transition: State transition S_INTEGRATION -> > S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=crmd_ha_msg_filter ] > [...] > crmd: [7285]: WARN: do_log: FSA: Input I_FAIL from get_lrm_resource() > received in state S_ELECTION > crmd: [7285]: notice: do_state_transition: State transition S_ELECTION -> > S_RELEASE_DC [ input=I_FAIL cause=C_FSA_INTERNAL origin=get_lrm_resource ] > [...] > crmd: [7285]: WARN: do_log: FSA: Input I_NODE_JOIN from route_message() > received in state S_RELEASE_DC > [...] > crmd: [7285]: WARN: do_log: FSA: Input I_ELECTION_DC from > do_dc_join_finalize() received in state S_RELEASE_DC > crmd: [7285]: ERROR: do_pe_invoke: Attempted to invoke the PE without a > consistent copy of the CIB! > crmd: [7285]: WARN: do_log: FSA: Input I_JOIN_REQUEST from route_message() > received in state S_RELEASE_DC > crmd: [7285]: WARN: do_log: FSA: Input I_JOIN_OFFER from route_message() > received in state S_RELEASE_DC > crmd: [7285]: info: update_dc: Set DC to h02 (3.0.6) > crmd: [7285]: WARN: do_log: FSA: Input I_JOIN_OFFER from route_message() > received in state S_RELEASE_DC > crmd: [7285]: WARN: do_log: FSA: Input I_JOIN_REQUEST from route_message() > received in state S_RELEASE_DC > crmd: [7285]: WARN: do_log: FSA: Input I_JOIN_REQUEST from route_message() > received in state S_RELEASE_DC > crmd: [7285]: info: do_election_count_vote: Election 3 (owner: h06) pass: > vote from h06 (Uptime) > crmd: [7285]: ERROR: crm_timer_popped: Election Timeout (I_ELECTION_DC) just > popped in state S_RELEASE_DC! (120000ms) > crmd: [7285]: WARN: do_log: FSA: Input I_ELECTION_DC from crm_timer_popped() > received in state S_RELEASE_DC > [...] > last messages repeat for days, even after the cluster was switched back to > managed mode...!!! > crmd: [7285]: ERROR: crm_timer_popped: Election Timeout (I_ELECTION_DC) just > popped in state S_RELEASE_DC! (120000ms) > crmd: [7285]: WARN: do_log: FSA: Input I_ELECTION_DC from crm_timer_popped() > received in state S_RELEASE_DC > crmd: [7285]: info: do_election_count_vote: Election 3226 (owner: rksaph06) > pass: vote from rksaph06 (Uptime) > > Regards, > Ulrich > >>>> "Ulrich Windl" <[email protected]> schrieb am 05.06.2013 um > 14:25 in Nachricht <[email protected]>: >> Hi! >> >> When the cluster (SLES11 SP2, current) was put in maintenance mode, two >> messages repeat periodically, filling the log: >> ERROR: crm_timer_popped: Election Timeout (I_ELECTION_DC) just popped in >> state S_RELEASE_DC! (120000ms) >> WARN: do_log: FSA: Input I_ELECTION_DC from crm_timer_popped() received in >> state S_RELEASE_DC >> >> crmd: [7285]: info: handle_request: Current ping state: S_RELEASE_DC >> >> Is this expected, or is it yet another bug (pacemaker-1.1.7-0.13.9)? >> >> Regards, >> Ulrich >> >> >> _______________________________________________ >> Linux-HA mailing list >> [email protected] >> http://lists.linux-ha.org/mailman/listinfo/linux-ha >> See also: http://linux-ha.org/ReportingProblems > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
