> On 5 May 2015, at 9:30 pm, Zhou Zheng Sheng / 周征晟 <[email protected]> 
> wrote:
> 
> Thank you Andrew. Sorry for misspell your name in the previous email.
> 
> on 2015/05/05 14:25, Andrew Beekhof wrote:
>>> On 5 May 2015, at 2:31 pm, Zhou Zheng Sheng / 周征晟 <[email protected]> 
>>> wrote:
>>> 
>>> Thank you Bogdan for clearing the pacemaker promotion process for me.
>>> 
>>> on 2015/05/05 10:32, Andrew Beekhof wrote:
>>>>> On 29 Apr 2015, at 5:38 pm, Zhou Zheng Sheng / 周征晟 
>>>>> <[email protected]> wrote:
>>>> [snip]
>>>> 
>>>>> Batch is a pacemaker concept I found when I was reading its
>>>>> documentation and code. There is a "batch-limit: 30" in the output of
>>>>> "pcs property list --all". The pacemaker official documentation
>>>>> explanation is that it's "The number of jobs that the TE is allowed to
>>>>> execute in parallel." From my understanding, pacemaker maintains cluster
>>>>> states, and when we start/stop/promote/demote a resource, it triggers a
>>>>> state transition. Pacemaker puts as many as possible transition jobs
>>>>> into a batch, and process them in parallel.
>>>> Technically it calculates an ordered graph of actions that need to be 
>>>> performed for a set of related resources.
>>>> You can see an example of the kinds of graphs it produces at:
>>>> 
>>>>  
>>>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/s-config-testing-changes.html
>>>> 
>>>> There is a more complex one which includes promotion and demotion on the 
>>>> next page.
>>>> 
>>>> The number of actions that can run at any one time is therefor limited by
>>>> - the value of batch-limit (the total number of in-flight actions)
>>>> - the number of resources that do not have ordering constraints between 
>>>> them (eg. rsc{1,2,3} in the above example)  
>>>> 
>>>> So in the above example, if batch-limit >= 3, the monitor_0 actions will 
>>>> still all execute in parallel.
>>>> If batch-limit == 2, one of them will be deferred until the others 
>>>> complete.
>>>> 
>>>> Processing of the graph stops the moment any action returns a value that 
>>>> was not expected.
>>>> If that happens, we wait for currently in-flight actions to complete, 
>>>> re-calculate a new graph based on the new information and start again.
>>> So can I infer the following statement? In a big cluster with many
>>> resources, chances are some resource agent actions return unexpected
>>> values,
>> The size of the cluster shouldn’t increase the chance of this happening 
>> unless you’ve set the timeouts too aggressively.
> 
> If there are many types of resource agents, and anyone of them is not
> well written, it might cause trouble, right?

Yes, but really only for the things that depend on it.

For example if resources B, C, D, E all depend (in some way) on A, then their 
startup is going to be delayed.
But F, G, H and J will be able to start while we wait around for B to time out.

> 
>>> and if any of the in-flight action timeout is long, it would
>>> block pacemaker from re-calculating a new transition graph?
>> Yes, but its actually an argument for making the timeouts longer, not 
>> shorter.
>> Setting the timeouts too aggressively actually increases downtime because of 
>> all the extra delays and recovery it induces.
>> So set them to be long enough that there is unquestionably a problem if you 
>> hit them.
>> 
>> But we absolutely recognise that starting/stopping a database can take a 
>> very long time comparatively and that it shouldn’t block recovery of other 
>> unrelated services.
>> I would expect to see this land in Pacemaker 1.1.14
> 
> It will be great to see this in Pacemaker 1.1.14. From my experience
> using Pacemaker, I think customized resource agents are possibly the
> weakest part.

This is why we encourage people wanting new agents to get involved with the 
upstream resource-agents project :-)

> This feature should improve the handling for resource
> action timeouts.
> 
>>> I see the
>>> current batch-limit is 30 and I tried to increase it to 100, but did not
>>> help.
>> Correct.  It only puts an upper limit on the number of in-flight actions, 
>> actions still need to wait for all their dependants to complete before 
>> executing.
>> 
>>> I'm sure that the cloned MySQL Galera resource is not related to
>>> master-slave RabbitMQ resource. I don't find any dependency, order or
>>> rule connecting them in the cluster deployed by Fuel [1].
>> In general it should not have needed to wait, but if you send me a 
>> crm_report covering the period you’re talking about I’ll be able to comment 
>> specifically about the behaviour you saw.
> 
> You are very nice, thank you. I uploaded the file generated by
> crm_report to google drive.
> 
> https://drive.google.com/file/d/0B_vDkYRYHPSIZ29NdzV3NXotYU0/view?usp=sharing

Hmmm... there’s no logs included here for some reason.
I suspect it a bug on my part, can you apply this patch to report.collector on 
the machine you’re running crm_report from and retry?

   https://github.com/ClusterLabs/pacemaker/commit/96427ec


> 
>>> Is there anything I can do to make sure all the resource actions return
>>> expected values in a full reassembling?
>> In general, if we say ‘start’, do your best to start or return ‘0’ if you 
>> already were started.
>> Likewise for stop.
>> 
>> Otherwise its really specific to your agent.
>> For example an IP resource just needs to add itself to an interface - it 
>> cant do much differently, if it times out then the system much be very very 
>> busy.
>> 
>> The only other thing I would say is:
>> - avoid blocking calls where possible
>> - have empathy for the machine (do as little as is needed)
>> 
> 
> +1 for the empathy :)
>>> Is it because node-1 and node-2
>>> happen to boot faster than node-3 and form a cluster, when node-3 joins,
>>> it triggers new state transition? Or may because some resources are
>>> already started, so pacemaker needs to stop them firstly?
>> We only stop them if they shouldn’t yet be running (ie. a colocation or 
>> ordering dependancy is not yet started also).
>> 
>> 
>>> Does setting
>>> default-resource-stickiness to 1 help?
>> From 0 or INFINITY?
> 
> From 0 to 1. Is it enough for preventing the resource from being moved
> when some nodes recovered from power failure?

From 0 it would help.
But potentially consider INFINITY if the only circumstance you want something 
moved is if the node is unavailable (either because its dead or in standby 
mode).

> 
>>> I also tried "crm history XXX" commands in a live and correct cluster,
>> I’m not familiar with that tool anymore.
>> 
>>> but didn't find much information. I can see there are many log entries
>>> like "run_graph: Transition 7108 ...". Next I'll inspect the pacemaker
>>> log to see which resource action returns the unexpected value or which
>>> thing triggers new state transition.
>>> 
>>> [1] http://paste.openstack.org/show/214919/
>> I’d not recommend mixing the two CLI tools.
>> 
>>>>> The problem is that pacemaker can only promote a resource after it
>>>>> detects the resource is started.
>>>> First we do a non-recurring monitor (*_monitor_0) to check what state the 
>>>> resource is in.
>>>> We can’t assume its off because a) we might have crashed, b) the admin 
>>>> might have accidentally configured it to start at boot or c) the admin may 
>>>> have asked us to re-check everything.
>>>> 
>>>>> During a full reassemble, in the first
>>>>> transition batch, pacemaker starts all the resources including MySQL and
>>>>> RabbitMQ. Pacemaker issues resource agent "start" invocation in parallel
>>>>> and reaps the results.
>>>>> 
>>>>> For a multi-state resource agent like RabbitMQ, pacemaker needs the
>>>>> start result reported in the first batch, then transition engine and
>>>>> policy engine decide if it has to retry starting or promote, and put
>>>>> this new transition job into a new batch.
>>>> Also important to know, the order of actions is:
>>>> 
>>>> 1. any necessary demotions
>>>> 2. any necessary stops
>>>> 3. any necessary starts
>>>> 4. any necessary promotions
>>>> 
>>>> 
>>>> 
>>>> __________________________________________________________________________
>>>> OpenStack Development Mailing List (not for usage questions)
>>>> Unsubscribe: [email protected]?subject:unsubscribe
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>> -- 
>>> Best wishes!
>>> Zhou Zheng Sheng / 周征晟  Software Engineer
>>> Beijing AWcloud Software Co., Ltd.
>>> 
>>> 
>>> 
>>> 
>>> __________________________________________________________________________
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe: [email protected]?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>> 
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: [email protected]?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> 
> 
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [email protected]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to