Thank you Andrew. Sorry for misspell your name in the previous email.

on 2015/05/05 14:25, Andrew Beekhof wrote:
>> On 5 May 2015, at 2:31 pm, Zhou Zheng Sheng / 周征晟 <[email protected]> 
>> wrote:
>>
>> Thank you Bogdan for clearing the pacemaker promotion process for me.
>>
>> on 2015/05/05 10:32, Andrew Beekhof wrote:
>>>> On 29 Apr 2015, at 5:38 pm, Zhou Zheng Sheng / 周征晟 
>>>> <[email protected]> wrote:
>>> [snip]
>>>
>>>> Batch is a pacemaker concept I found when I was reading its
>>>> documentation and code. There is a "batch-limit: 30" in the output of
>>>> "pcs property list --all". The pacemaker official documentation
>>>> explanation is that it's "The number of jobs that the TE is allowed to
>>>> execute in parallel." From my understanding, pacemaker maintains cluster
>>>> states, and when we start/stop/promote/demote a resource, it triggers a
>>>> state transition. Pacemaker puts as many as possible transition jobs
>>>> into a batch, and process them in parallel.
>>> Technically it calculates an ordered graph of actions that need to be 
>>> performed for a set of related resources.
>>> You can see an example of the kinds of graphs it produces at:
>>>
>>>   
>>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/s-config-testing-changes.html
>>>
>>> There is a more complex one which includes promotion and demotion on the 
>>> next page.
>>>
>>> The number of actions that can run at any one time is therefor limited by
>>> - the value of batch-limit (the total number of in-flight actions)
>>> - the number of resources that do not have ordering constraints between 
>>> them (eg. rsc{1,2,3} in the above example)  
>>>
>>> So in the above example, if batch-limit >= 3, the monitor_0 actions will 
>>> still all execute in parallel.
>>> If batch-limit == 2, one of them will be deferred until the others complete.
>>>
>>> Processing of the graph stops the moment any action returns a value that 
>>> was not expected.
>>> If that happens, we wait for currently in-flight actions to complete, 
>>> re-calculate a new graph based on the new information and start again.
>> So can I infer the following statement? In a big cluster with many
>> resources, chances are some resource agent actions return unexpected
>> values,
> The size of the cluster shouldn’t increase the chance of this happening 
> unless you’ve set the timeouts too aggressively.

If there are many types of resource agents, and anyone of them is not
well written, it might cause trouble, right?

>> and if any of the in-flight action timeout is long, it would
>> block pacemaker from re-calculating a new transition graph?
> Yes, but its actually an argument for making the timeouts longer, not shorter.
> Setting the timeouts too aggressively actually increases downtime because of 
> all the extra delays and recovery it induces.
> So set them to be long enough that there is unquestionably a problem if you 
> hit them.
>
> But we absolutely recognise that starting/stopping a database can take a very 
> long time comparatively and that it shouldn’t block recovery of other 
> unrelated services.
> I would expect to see this land in Pacemaker 1.1.14

It will be great to see this in Pacemaker 1.1.14. From my experience
using Pacemaker, I think customized resource agents are possibly the
weakest part. This feature should improve the handling for resource
action timeouts.

>> I see the
>> current batch-limit is 30 and I tried to increase it to 100, but did not
>> help.
> Correct.  It only puts an upper limit on the number of in-flight actions, 
> actions still need to wait for all their dependants to complete before 
> executing.
>
>> I'm sure that the cloned MySQL Galera resource is not related to
>> master-slave RabbitMQ resource. I don't find any dependency, order or
>> rule connecting them in the cluster deployed by Fuel [1].
> In general it should not have needed to wait, but if you send me a crm_report 
> covering the period you’re talking about I’ll be able to comment specifically 
> about the behaviour you saw.

You are very nice, thank you. I uploaded the file generated by
crm_report to google drive.

https://drive.google.com/file/d/0B_vDkYRYHPSIZ29NdzV3NXotYU0/view?usp=sharing

>> Is there anything I can do to make sure all the resource actions return
>> expected values in a full reassembling?
> In general, if we say ‘start’, do your best to start or return ‘0’ if you 
> already were started.
> Likewise for stop.
>
> Otherwise its really specific to your agent.
> For example an IP resource just needs to add itself to an interface - it cant 
> do much differently, if it times out then the system much be very very busy.
>
> The only other thing I would say is:
> - avoid blocking calls where possible
> - have empathy for the machine (do as little as is needed)
>

+1 for the empathy :)
>> Is it because node-1 and node-2
>> happen to boot faster than node-3 and form a cluster, when node-3 joins,
>> it triggers new state transition? Or may because some resources are
>> already started, so pacemaker needs to stop them firstly?
> We only stop them if they shouldn’t yet be running (ie. a colocation or 
> ordering dependancy is not yet started also).
>
>
>> Does setting
>> default-resource-stickiness to 1 help?
> From 0 or INFINITY?

From 0 to 1. Is it enough for preventing the resource from being moved
when some nodes recovered from power failure?

>> I also tried "crm history XXX" commands in a live and correct cluster,
> I’m not familiar with that tool anymore.
>
>> but didn't find much information. I can see there are many log entries
>> like "run_graph: Transition 7108 ...". Next I'll inspect the pacemaker
>> log to see which resource action returns the unexpected value or which
>> thing triggers new state transition.
>>
>> [1] http://paste.openstack.org/show/214919/
> I’d not recommend mixing the two CLI tools.
>
>>>> The problem is that pacemaker can only promote a resource after it
>>>> detects the resource is started.
>>> First we do a non-recurring monitor (*_monitor_0) to check what state the 
>>> resource is in.
>>> We can’t assume its off because a) we might have crashed, b) the admin 
>>> might have accidentally configured it to start at boot or c) the admin may 
>>> have asked us to re-check everything.
>>>
>>>> During a full reassemble, in the first
>>>> transition batch, pacemaker starts all the resources including MySQL and
>>>> RabbitMQ. Pacemaker issues resource agent "start" invocation in parallel
>>>> and reaps the results.
>>>>
>>>> For a multi-state resource agent like RabbitMQ, pacemaker needs the
>>>> start result reported in the first batch, then transition engine and
>>>> policy engine decide if it has to retry starting or promote, and put
>>>> this new transition job into a new batch.
>>> Also important to know, the order of actions is:
>>>
>>> 1. any necessary demotions
>>> 2. any necessary stops
>>> 3. any necessary starts
>>> 4. any necessary promotions
>>>
>>>
>>>
>>> __________________________________________________________________________
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe: [email protected]?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>> -- 
>> Best wishes!
>> Zhou Zheng Sheng / 周征晟  Software Engineer
>> Beijing AWcloud Software Co., Ltd.
>>
>>
>>
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: [email protected]?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [email protected]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to