[Linux-HA] Antw: Q: ERROR: crm_timer_popped: Election Timeout (I_ELECTION_DC) just popped in state S_RELEASE_DC! (120000ms)

Ulrich Windl Wed, 05 Jun 2013 06:24:16 -0700

Hi again!

I haven't fully understood the problem, but it looks as if pacemaker likes to 
shoot himself in the foot, and then go crazy when it feels the pain:


Shortly after maintenance mode was turned on, there was a communication problem 
(my guess is due to the massive communication triggered by the cluster). I see 
many "old event" messages:
crmd: [7285]: info: process_graph_event: Detected action prm_O2CB:0_monitor_0 
from a different transition: 1 vs. 1301
crmd: [7285]: info: abort_transition_graph: process_graph_event:476 - Triggered 
transition abort (complete=0, tag=lrm_rsc_op, id=prm_O2CB:0_last_0, 
magic=0:7;8:1:7:d6d69f10-a1da-4fde-9156-c2c3ea8be821, cib=0.1587.25) : Old event
[...]
Then bad things start:
crmd: [7285]: ERROR: lrm_get_rsc(673): failed to receive a reply message of 
getrsc.
crmd: [7285]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd 
via ch_cmd channel.
[...]
crmd: [7285]: ERROR: get_lrm_resource: Could not add resource prm_t11_db_raid1 
to LRM
crmd: [7285]: ERROR: do_lrm_invoke: Invalid resource definition

Hey folks!: If crm cannot talk to lrm, how does it know the resource definition 
is invalid?

[Crazy things go on, until it changes to:]
crmd: [7285]: ERROR: verify_stopped: Resource prm_ping_gw1-v582:1 was active at 
shutdown.  You may ignore this error if it is unmanaged.

Hey folks!: Why emitting Errors and saying you can ignore it if the cluster is 
unmanaged? Doesn't the cluster know it's unmanaged? These error messages are of 
little help!

crmd: [7285]: info: lrm_connection_destroy: LRM Connection disconnected
crmd: [7285]: info: do_lrm_control: Disconnected from the LRM
[peace, hopefully?]

Unfortunately not:
crmd: [7285]: WARN: register_fsa_input_adv: do_te_invoke stalled the FSA with 
pending inputs
[...]
crmd: [7285]: info: ais_dispatch_message: Membership 1764: quorum retained
crmd: [7285]: notice: crmd_peer_update: Status update: Client h06/crmd now has 
status [offline] (DC=true)

The cluster is joking? The two-node cluster has a quorum with the DC being 
offline??? Bad phantasies continue:
crmd: [7285]: WARN: fail_incompletable_actions: Node h06 shutdown resulted in 
un-runnable actions
crmd: [7285]: info: ais_dispatch_message: Membership 1764: quorum retained
crmd: [7285]: notice: crmd_peer_update: Status update: Client h06/crmd now has 
status [online] (DC=true)

So the node went offline and back online in less than one second? Amazing: The 
node needs 5 minutes to boot.

crmd: [7285]: WARN: do_log: FSA: Input I_FAIL from get_lrm_resource() received 
in state S_FINALIZE_JOIN
crmd: [7285]: notice: do_state_transition: State transition S_FINALIZE_JOIN -> 
S_INTEGRATION [ input=I_FAIL cause=C_FSA_INTERNAL origin=get_lrm_resource ]
[...]
crmd: [7285]: info: handle_request: Current ping state: S_INTEGRATION
crmd: [7285]: WARN: do_log: FSA: Input I_FAIL from get_lrm_resource() received 
in state S_INTEGRATION
crmd: [7285]: info: abort_transition_graph: do_te_invoke:162 - Triggered 
transition abort (complete=0) : Peer Cancelled
[...]
crmd: [7285]: WARN: crmd_ha_msg_filter: Another DC detected: h06 (op=join_offer)
crmd: [7285]: notice: do_state_transition: State transition S_INTEGRATION -> 
S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=crmd_ha_msg_filter ]
[...]
crmd: [7285]: WARN: do_log: FSA: Input I_FAIL from get_lrm_resource() received 
in state S_ELECTION
crmd: [7285]: notice: do_state_transition: State transition S_ELECTION -> 
S_RELEASE_DC [ input=I_FAIL cause=C_FSA_INTERNAL origin=get_lrm_resource ]
[...]
crmd: [7285]: WARN: do_log: FSA: Input I_NODE_JOIN from route_message() 
received in state S_RELEASE_DC
[...]
crmd: [7285]: WARN: do_log: FSA: Input I_ELECTION_DC from do_dc_join_finalize() 
received in state S_RELEASE_DC
crmd: [7285]: ERROR: do_pe_invoke: Attempted to invoke the PE without a 
consistent copy of the CIB!
crmd: [7285]: WARN: do_log: FSA: Input I_JOIN_REQUEST from route_message() 
received in state S_RELEASE_DC
crmd: [7285]: WARN: do_log: FSA: Input I_JOIN_OFFER from route_message() 
received in state S_RELEASE_DC
crmd: [7285]: info: update_dc: Set DC to h02 (3.0.6)
crmd: [7285]: WARN: do_log: FSA: Input I_JOIN_OFFER from route_message() 
received in state S_RELEASE_DC
crmd: [7285]: WARN: do_log: FSA: Input I_JOIN_REQUEST from route_message() 
received in state S_RELEASE_DC
crmd: [7285]: WARN: do_log: FSA: Input I_JOIN_REQUEST from route_message() 
received in state S_RELEASE_DC
crmd: [7285]: info: do_election_count_vote: Election 3 (owner: h06) pass: vote 
from h06 (Uptime)
crmd: [7285]: ERROR: crm_timer_popped: Election Timeout (I_ELECTION_DC) just 
popped in state S_RELEASE_DC! (120000ms)
crmd: [7285]: WARN: do_log: FSA: Input I_ELECTION_DC from crm_timer_popped() 
received in state S_RELEASE_DC
[...]
last messages repeat for days, even after the cluster was switched back to 
managed mode...!!!
crmd: [7285]: ERROR: crm_timer_popped: Election Timeout (I_ELECTION_DC) just 
popped in state S_RELEASE_DC! (120000ms)
crmd: [7285]: WARN: do_log: FSA: Input I_ELECTION_DC from crm_timer_popped() 
received in state S_RELEASE_DC
crmd: [7285]: info: do_election_count_vote: Election 3226 (owner: rksaph06) 
pass: vote from rksaph06 (Uptime)

Regards,
Ulrich

>>> "Ulrich Windl" <[email protected]> schrieb am 05.06.2013 um
14:25 in Nachricht <[email protected]>:
> Hi!
> 
> When the cluster (SLES11 SP2, current) was put in maintenance mode, two 
> messages repeat periodically, filling the log:
> ERROR: crm_timer_popped: Election Timeout (I_ELECTION_DC) just popped in 
> state S_RELEASE_DC! (120000ms)
> WARN: do_log: FSA: Input I_ELECTION_DC from crm_timer_popped() received in 
> state S_RELEASE_DC
> 
> crmd: [7285]: info: handle_request: Current ping state: S_RELEASE_DC
> 
> Is this expected, or is it yet another bug (pacemaker-1.1.7-0.13.9)?
> 
> Regards,
> Ulrich
> 
> 
> _______________________________________________
> Linux-HA mailing list
> [email protected] 
> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> See also: http://linux-ha.org/ReportingProblems 


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Antw: Q: ERROR: crm_timer_popped: Election Timeout (I_ELECTION_DC) just popped in state S_RELEASE_DC! (120000ms)

Reply via email to