Re: [ClusterLabs] Pacemaker reload Master/Slave resource

2016-06-06 Thread Felix Zachlod (Lists)
> -Ursprüngliche Nachricht-
> Von: Ken Gaillot [mailto:kgail...@redhat.com]
> Gesendet: Montag, 6. Juni 2016 23:31
> An: users@clusterlabs.org
> Betreff: Re: [ClusterLabs] Pacemaker reload Master/Slave resource
> 
> I think it depends on your point of view :)
> 
> Reload is implemented as an alternative to stop-then-start. For m/s
> clones, start leaves the resource in slave state.

I actually thought should reconfigure the resource WITHOUT restarting it and in 
my opinion this should be agnostic to the slave/master state of the resource. 
Just pass the new parameters so that it is reconfigured. But ACTUALLY this 
works just fine. The weird behavior I observed and described here was caused by 
my resource setting the master scores wrongly, or better to say always the same 
way (no matter if running slave or master). I did understand master score as a 
score that just helps the cluster manager to decide WHERE to run the resource 
so I did not see why it necessarily has to have different scores for different 
states. But obviously it does a bit more. After adjusting the master scores in 
a way that gives the current master a higher preference the problem went away. 
I can now reload my resource no matter if it is slave or master and it will 
NEITHER stop/start nor demote/stop/start/promote but just call reload() 
whatever this actually does internally - just as I imagined and what is just 
the most meaningful way I think. Reload returns with rc 0 and the cluster 
manager is happily assuming the resource still master (if it had been before).

regards, Felix

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker reload Master/Slave resource

2016-06-06 Thread Ken Gaillot
On 05/20/2016 06:20 AM, Felix Zachlod (Lists) wrote:
> version 1.1.13-10.el7_2.2-44eb2dd
> 
> Hello!
> 
> I am currently developing a master/slave resource agent. So far it is working 
> just fine, but this resource agent implements reload() and this does not work 
> as expected when running as Master:
> The reload action is invoked and it succeeds returning 0. The resource is 
> still Master and monitor will return $OCF_RUNNING_MASTER.
> 
> But Pacemaker considers the instance being slave afterwards. Actually only 
> reload is invoked, no monitor, no demote etc.
> 
> I first thought that reload should possibly return $OCF_RUNNING_MASTER too 
> but this leads to the resource failing on reload. It seems 0 is the only 
> valid return code.
> 
> I can recover the cluster state running resource $resourcename promote, which 
> will call
> 
> notify
> promote
> notify
> 
> Afterwards my resource is considered Master again. After  PEngine Recheck 
> Timer (I_PE_CALC) just popped (90ms), the cluster manager will promote 
> the resource itself.
> But this can lead to unexpected results, it could promote the resource on the 
> wrong node so that both sides are actually running master, the cluster will 
> not even notice it does not call monitor either.
> 
> Is this a bug?
> 
> regards, Felix

I think it depends on your point of view :)

Reload is implemented as an alternative to stop-then-start. For m/s
clones, start leaves the resource in slave state.

So on the one hand, it makes sense that Pacemaker would expect a m/s
reload to end up in slave state, regardless of the initial state, since
it should be equivalent to stop-then-start.

On the other hand, you could argue that a reload for a master should
logically be an alternative to demote-stop-start-promote.

On the third hand ;) you could argue that reload is ambiguous for master
resources and thus shouldn't be supported at all.

Feel free to open a feature request at http://bugs.clusterlabs.org/ to
say how you think it should work.

As an aside, I think the current implementation of reload in pacemaker
is unsatisfactory for two reasons:

* Using the "unique" attribute to determine whether a parameter is
reloadable was a bad idea. For example, the location of a daemon binary
is generally set to unique=0, which is sensible in that multiple RA
instances can use the same binary, but a reload could not handle that
change. It is not a problem only because no one ever changes that.

* There is a fundamental misunderstanding between pacemaker and most RA
developers as to what reload means. Pacemaker uses the reload action to
make parameter changes in the resource's *pacemaker* configuration take
effect, but RA developers tend to use it to reload the service's own
configuration files (a more natural interpretation, but completely
different from how pacemaker uses it).

> trace   May 20 12:58:31 cib_create_op(609):0: Sending call options: 0010, 
> 1048576
> trace   May 20 12:58:31 cib_native_perform_op_delegate(384):0: Sending 
> cib_modify message to CIB service (timeout=120s)
> trace   May 20 12:58:31 crm_ipc_send(1175):0: Sending from client: cib_shm 
> request id: 745 bytes: 1070 timeout:12 msg...
> trace   May 20 12:58:31 crm_ipc_send(1188):0: Message sent, not waiting for 
> reply to 745 from cib_shm to 1070 bytes...
> trace   May 20 12:58:31 cib_native_perform_op_delegate(395):0: Reply: No data 
> to dump as XML
> trace   May 20 12:58:31 cib_native_perform_op_delegate(398):0: Async call, 
> returning 268
> trace   May 20 12:58:31 do_update_resource(2274):0: Sent resource state 
> update message: 268 for reload=0 on scst_dg_ssd
> trace   May 20 12:58:31 cib_client_register_callback_full(606):0: Adding 
> callback cib_rsc_callback for call 268
> trace   May 20 12:58:31 process_lrm_event(2374):0: Op scst_dg_ssd_reload_0 
> (call=449, stop-id=scst_dg_ssd:449, remaining=3): Confirmed
> notice  May 20 12:58:31 process_lrm_event(2392):0: Operation 
> scst_dg_ssd_reload_0: ok (node=alpha, call=449, rc=0, cib-update=268, 
> confirmed=true)
> debug   May 20 12:58:31 update_history_cache(196):0: Updating history for 
> 'scst_dg_ssd' with reload op
> trace   May 20 12:58:31 crm_ipc_read(992):0: No message from lrmd received: 
> Resource temporarily unavailable
> trace   May 20 12:58:31 mainloop_gio_callback(654):0: Message acquisition 
> from lrmd[0x22b0ec0] failed: No message of desired type (-42)
> trace   May 20 12:58:31 crm_fsa_trigger(293):0: Invoked (queue len: 0)
> trace   May 20 12:58:31 s_crmd_fsa(159):0: FSA invoked with Cause: 
> C_FSA_INTERNAL   State: S_NOT_DC
> trace   May 20 12:58:31 s_crmd_fsa(246):0: Exiting the FSA
> trace   May 20 12:58:31 crm_fsa_trigger(295):0: Exited  (queue len: 0)
> trace   May 20 12:58:31 crm_ipc_read(989):0: Received cib_shm event 2108, 
> size=183, rc=183, text:  cib_callid="268" cib_clientid="60010689-7350-4916-a7bd-bd85ff
> trace   May 20 12:58:31 mainloop_gio_callback(659):0: New message from 
> cib_shm[0x23b7ab0] 

[ClusterLabs] Pacemaker reload Master/Slave resource

2016-05-20 Thread Felix Zachlod (Lists)
version 1.1.13-10.el7_2.2-44eb2dd

Hello!

I am currently developing a master/slave resource agent. So far it is working 
just fine, but this resource agent implements reload() and this does not work 
as expected when running as Master:
The reload action is invoked and it succeeds returning 0. The resource is still 
Master and monitor will return $OCF_RUNNING_MASTER.

But Pacemaker considers the instance being slave afterwards. Actually only 
reload is invoked, no monitor, no demote etc.

I first thought that reload should possibly return $OCF_RUNNING_MASTER too but 
this leads to the resource failing on reload. It seems 0 is the only valid 
return code.

I can recover the cluster state running resource $resourcename promote, which 
will call

notify
promote
notify

Afterwards my resource is considered Master again. After  PEngine Recheck Timer 
(I_PE_CALC) just popped (90ms), the cluster manager will promote the 
resource itself.
But this can lead to unexpected results, it could promote the resource on the 
wrong node so that both sides are actually running master, the cluster will not 
even notice it does not call monitor either.

Is this a bug?

regards, Felix


trace   May 20 12:58:31 cib_create_op(609):0: Sending call options: 0010, 
1048576
trace   May 20 12:58:31 cib_native_perform_op_delegate(384):0: Sending 
cib_modify message to CIB service (timeout=120s)
trace   May 20 12:58:31 crm_ipc_send(1175):0: Sending from client: cib_shm 
request id: 745 bytes: 1070 timeout:12 msg...
trace   May 20 12:58:31 crm_ipc_send(1188):0: Message sent, not waiting for 
reply to 745 from cib_shm to 1070 bytes...
trace   May 20 12:58:31 cib_native_perform_op_delegate(395):0: Reply: No data 
to dump as XML
trace   May 20 12:58:31 cib_native_perform_op_delegate(398):0: Async call, 
returning 268
trace   May 20 12:58:31 do_update_resource(2274):0: Sent resource state update 
message: 268 for reload=0 on scst_dg_ssd
trace   May 20 12:58:31 cib_client_register_callback_full(606):0: Adding 
callback cib_rsc_callback for call 268
trace   May 20 12:58:31 process_lrm_event(2374):0: Op scst_dg_ssd_reload_0 
(call=449, stop-id=scst_dg_ssd:449, remaining=3): Confirmed
notice  May 20 12:58:31 process_lrm_event(2392):0: Operation 
scst_dg_ssd_reload_0: ok (node=alpha, call=449, rc=0, cib-update=268, 
confirmed=true)
debug   May 20 12:58:31 update_history_cache(196):0: Updating history for 
'scst_dg_ssd' with reload op
trace   May 20 12:58:31 crm_ipc_read(992):0: No message from lrmd received: 
Resource temporarily unavailable
trace   May 20 12:58:31 mainloop_gio_callback(654):0: Message acquisition from 
lrmd[0x22b0ec0] failed: No message of desired type (-42)
trace   May 20 12:58:31 crm_fsa_trigger(293):0: Invoked (queue len: 0)
trace   May 20 12:58:31 s_crmd_fsa(159):0: FSA invoked with Cause: 
C_FSA_INTERNAL   State: S_NOT_DC
trace   May 20 12:58:31 s_crmd_fsa(246):0: Exiting the FSA
trace   May 20 12:58:31 crm_fsa_trigger(295):0: Exited  (queue len: 0)
trace   May 20 12:58:31 crm_ipc_read(989):0: Received cib_shm event 2108, 
size=183, rc=183, text: http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org