Re: [Pacemaker] [ClusterLabs] Unexpected behaviour of PEngine Recheck Timer while in maintenance mode

Andrew Beekhof Sun, 26 Apr 2015 14:59:20 -0700

> On 21 Apr 2015, at 6:24 am, Rolf Weber <rolf.we...@asamnet.de> wrote:
> 
> Hi all!
> 
> I encountered some strange behaviour of the PEngine Recheck Timer while in
> maintenance mode ending in a reboot.
> 
> Setup is a 2 node cluster, 1 resource group consisting of several drbds and
> filesystems that are exported via nfs and iscsi via a roaming IP (see
> attached output of crm_report for details).
> 
> I put the cluster in maintenance mode using:
>       crm configure property maintenance-mode=true
> Cluster ended up in unmanaged mode, as expected.
> 
>       109 Apr 19 23:01:50 astorage1 crmd: [4150]: info: do_te_invoke: 
> Processing graph 473 (ref=pe_calc-dc-1429477310-1155) derived from 
> /var/lib/pengine/pe-input-688.bz2
>       ...
>       112 Apr 19 23:01:50 astorage1 lrmd: [4147]: info: cancel_op: operation 
> monitor[43] on astorage2-fencing for client 4150, its parameters: 
> hostname=[astorage2] userid=[...] CRM_meta_timeout=[20000] 
> CRM_meta_name=[monitor] passwd=[...] crm_feature_set=[3.0.6] 
> ipaddr=[10.10.30.22] CRM_meta_interval=[60000]  cancelled
>       ...
>       168 Apr 19 23:01:51 astorage1 crmd: [4150]: notice: run_graph: ==== 
> Transition 473 (Complete=31, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
> Source=/var/lib/pengine/pe-input-688.bz2): Complete
> 
> 
> At that point I started to do some work on the resources.


What exactly were you doing at this point?
I ask because:

Apr 19 23:16:51 astorage1 crmd: [4150]: ERROR: lrm_get_rsc(666): failed to send 
a getrsc message to lrmd via ch_cmd channel.
Apr 19 23:16:51 astorage1 crmd: [4150]: ERROR: lrm_get_rsc(666): failed to send 
a getrsc message to lrmd via ch_cmd channel.
Apr 19 23:16:51 astorage1 crmd: [4150]: ERROR: lrm_add_rsc(870): failed to send 
a addrsc message to lrmd via ch_cmd channel.
Apr 19 23:16:51 astorage1 crmd: [4150]: ERROR: lrm_get_rsc(666): failed to send 
a getrsc message to lrmd via ch_cmd channel.

and 

Apr 19 23:16:52 astorage2 crmd: [4018]: CRIT: lrm_connection_destroy: LRM 
Connection failed
Apr 19 23:16:52 astorage2 crmd: [4018]: info: lrm_connection_destroy: LRM 
Connection disconnected

suggest that the lrmd processes on both machines crashed or failed.

I would definitely recommend an upgrade from 1.1.7

> 
> However, about 15 mins later:
>       191 Apr 19 23:16:51 astorage1 crmd: [4150]: info: crm_timer_popped: 
> PEngine Recheck Timer (I_PE_CALC) just popped (900000ms)
>       192 Apr 19 23:16:51 astorage1 crmd: [4150]: notice: 
> do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ 
> input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
>       193 Apr 19 23:16:51 astorage1 crmd: [4150]: info: do_state_transition: 
> Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
> 
> The PEngine Recheck Timer triggered and tried to cancel the monitors again,
> this appears to have worked for all monitors except:
> 
>       207 Apr 19 23:16:51 astorage1 crmd: [4150]: info: te_rsc_command: 
> Initiating action 4: cancel astorage2-fencing_monitor_60000 on astorage1 
> (local)
>       208 Apr 19 23:16:51 astorage1 crmd: [4150]: info: cancel_op: No pending 
> op found for astorage2-fencing:43
>       209 Apr 19 23:16:51 astorage1 lrmd: [4147]: info: on_msg_cancel_op: no 
> operation with id 43
>       ...
>       227 Apr 19 23:16:51 astorage1 crmd: [4150]: ERROR: do_lrm_invoke: 
> Invalid resource definition
> 
> other node:
> 
>       108 Apr 19 23:16:52 astorage2 lrmd: [4015]: info: on_msg_cancel_op: no 
> operation with id 43
>       109 Apr 19 23:16:52 astorage2 crmd: [4018]: CRIT: 
> lrm_connection_destroy: LRM Connection failed
>       110 Apr 19 23:16:52 astorage2 crmd: [4018]: info: 
> lrm_connection_destroy: LRM Connection disconnected
>       111 Apr 19 23:16:52 astorage2 crmd: [4018]: ERROR: do_log: FSA: Input 
> I_ERROR from lrm_connection_destroy() received in state S_NOT_DC
> 
> crmd was not too happy about that, cleaned up and exited with rc=2.
> As I have 'crm on' in ha.cfg, heartbeat rebooted the system (both nodes).
> 
> 
> As a workaround I have (for now) disabled the PEngine Recheck Timer:
>       crm_attribute -n cluster-recheck-interval -v 0
> 
> 
> IMHO the PEngine Recheck Timer should not take any action while in
> unmanaged mode, the same way regular action timers' timeouts are ignored.
> Additionally, I don't quite understand why crmd got so confused by the
> missing op id 43 that it decided to quit (lrmd appears not to have crashed).
> 
> Was my assumption about that timer wrong or is this indeed unexpected
> behaviour?
> 
> Please CC replies to my email-address, I am not subscribed to the list.
> 
> Greets,
> Rolf Weber
> 
> <reboot-while-in-maintenance-mode.tbz2>_______________________________________________
> Users mailing list: us...@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] [ClusterLabs] Unexpected behaviour of PEngine Recheck Timer while in maintenance mode

Reply via email to