> On 21 Apr 2015, at 6:24 am, Rolf Weber <rolf.we...@asamnet.de> wrote: > > Hi all! > > I encountered some strange behaviour of the PEngine Recheck Timer while in > maintenance mode ending in a reboot. > > Setup is a 2 node cluster, 1 resource group consisting of several drbds and > filesystems that are exported via nfs and iscsi via a roaming IP (see > attached output of crm_report for details). > > I put the cluster in maintenance mode using: > crm configure property maintenance-mode=true > Cluster ended up in unmanaged mode, as expected. > > 109 Apr 19 23:01:50 astorage1 crmd: [4150]: info: do_te_invoke: > Processing graph 473 (ref=pe_calc-dc-1429477310-1155) derived from > /var/lib/pengine/pe-input-688.bz2 > ... > 112 Apr 19 23:01:50 astorage1 lrmd: [4147]: info: cancel_op: operation > monitor[43] on astorage2-fencing for client 4150, its parameters: > hostname=[astorage2] userid=[...] CRM_meta_timeout=[20000] > CRM_meta_name=[monitor] passwd=[...] crm_feature_set=[3.0.6] > ipaddr=[10.10.30.22] CRM_meta_interval=[60000] cancelled > ... > 168 Apr 19 23:01:51 astorage1 crmd: [4150]: notice: run_graph: ==== > Transition 473 (Complete=31, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pengine/pe-input-688.bz2): Complete > > > At that point I started to do some work on the resources.
What exactly were you doing at this point? I ask because: Apr 19 23:16:51 astorage1 crmd: [4150]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel. Apr 19 23:16:51 astorage1 crmd: [4150]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel. Apr 19 23:16:51 astorage1 crmd: [4150]: ERROR: lrm_add_rsc(870): failed to send a addrsc message to lrmd via ch_cmd channel. Apr 19 23:16:51 astorage1 crmd: [4150]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel. and Apr 19 23:16:52 astorage2 crmd: [4018]: CRIT: lrm_connection_destroy: LRM Connection failed Apr 19 23:16:52 astorage2 crmd: [4018]: info: lrm_connection_destroy: LRM Connection disconnected suggest that the lrmd processes on both machines crashed or failed. I would definitely recommend an upgrade from 1.1.7 > > However, about 15 mins later: > 191 Apr 19 23:16:51 astorage1 crmd: [4150]: info: crm_timer_popped: > PEngine Recheck Timer (I_PE_CALC) just popped (900000ms) > 192 Apr 19 23:16:51 astorage1 crmd: [4150]: notice: > do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ > input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ] > 193 Apr 19 23:16:51 astorage1 crmd: [4150]: info: do_state_transition: > Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED > > The PEngine Recheck Timer triggered and tried to cancel the monitors again, > this appears to have worked for all monitors except: > > 207 Apr 19 23:16:51 astorage1 crmd: [4150]: info: te_rsc_command: > Initiating action 4: cancel astorage2-fencing_monitor_60000 on astorage1 > (local) > 208 Apr 19 23:16:51 astorage1 crmd: [4150]: info: cancel_op: No pending > op found for astorage2-fencing:43 > 209 Apr 19 23:16:51 astorage1 lrmd: [4147]: info: on_msg_cancel_op: no > operation with id 43 > ... > 227 Apr 19 23:16:51 astorage1 crmd: [4150]: ERROR: do_lrm_invoke: > Invalid resource definition > > other node: > > 108 Apr 19 23:16:52 astorage2 lrmd: [4015]: info: on_msg_cancel_op: no > operation with id 43 > 109 Apr 19 23:16:52 astorage2 crmd: [4018]: CRIT: > lrm_connection_destroy: LRM Connection failed > 110 Apr 19 23:16:52 astorage2 crmd: [4018]: info: > lrm_connection_destroy: LRM Connection disconnected > 111 Apr 19 23:16:52 astorage2 crmd: [4018]: ERROR: do_log: FSA: Input > I_ERROR from lrm_connection_destroy() received in state S_NOT_DC > > crmd was not too happy about that, cleaned up and exited with rc=2. > As I have 'crm on' in ha.cfg, heartbeat rebooted the system (both nodes). > > > As a workaround I have (for now) disabled the PEngine Recheck Timer: > crm_attribute -n cluster-recheck-interval -v 0 > > > IMHO the PEngine Recheck Timer should not take any action while in > unmanaged mode, the same way regular action timers' timeouts are ignored. > Additionally, I don't quite understand why crmd got so confused by the > missing op id 43 that it decided to quit (lrmd appears not to have crashed). > > Was my assumption about that timer wrong or is this indeed unexpected > behaviour? > > Please CC replies to my email-address, I am not subscribed to the list. > > Greets, > Rolf Weber > > <reboot-while-in-maintenance-mode.tbz2>_______________________________________________ > Users mailing list: us...@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org