> On 5 May 2015, at 9:30 pm, Zhou Zheng Sheng / 周征晟 <[email protected]> > wrote: > > Thank you Andrew. Sorry for misspell your name in the previous email. > > on 2015/05/05 14:25, Andrew Beekhof wrote: >>> On 5 May 2015, at 2:31 pm, Zhou Zheng Sheng / 周征晟 <[email protected]> >>> wrote: >>> >>> Thank you Bogdan for clearing the pacemaker promotion process for me. >>> >>> on 2015/05/05 10:32, Andrew Beekhof wrote: >>>>> On 29 Apr 2015, at 5:38 pm, Zhou Zheng Sheng / 周征晟 >>>>> <[email protected]> wrote: >>>> [snip] >>>> >>>>> Batch is a pacemaker concept I found when I was reading its >>>>> documentation and code. There is a "batch-limit: 30" in the output of >>>>> "pcs property list --all". The pacemaker official documentation >>>>> explanation is that it's "The number of jobs that the TE is allowed to >>>>> execute in parallel." From my understanding, pacemaker maintains cluster >>>>> states, and when we start/stop/promote/demote a resource, it triggers a >>>>> state transition. Pacemaker puts as many as possible transition jobs >>>>> into a batch, and process them in parallel. >>>> Technically it calculates an ordered graph of actions that need to be >>>> performed for a set of related resources. >>>> You can see an example of the kinds of graphs it produces at: >>>> >>>> >>>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/s-config-testing-changes.html >>>> >>>> There is a more complex one which includes promotion and demotion on the >>>> next page. >>>> >>>> The number of actions that can run at any one time is therefor limited by >>>> - the value of batch-limit (the total number of in-flight actions) >>>> - the number of resources that do not have ordering constraints between >>>> them (eg. rsc{1,2,3} in the above example) >>>> >>>> So in the above example, if batch-limit >= 3, the monitor_0 actions will >>>> still all execute in parallel. >>>> If batch-limit == 2, one of them will be deferred until the others >>>> complete. >>>> >>>> Processing of the graph stops the moment any action returns a value that >>>> was not expected. >>>> If that happens, we wait for currently in-flight actions to complete, >>>> re-calculate a new graph based on the new information and start again. >>> So can I infer the following statement? In a big cluster with many >>> resources, chances are some resource agent actions return unexpected >>> values, >> The size of the cluster shouldn’t increase the chance of this happening >> unless you’ve set the timeouts too aggressively. > > If there are many types of resource agents, and anyone of them is not > well written, it might cause trouble, right?
Yes, but really only for the things that depend on it. For example if resources B, C, D, E all depend (in some way) on A, then their startup is going to be delayed. But F, G, H and J will be able to start while we wait around for B to time out. > >>> and if any of the in-flight action timeout is long, it would >>> block pacemaker from re-calculating a new transition graph? >> Yes, but its actually an argument for making the timeouts longer, not >> shorter. >> Setting the timeouts too aggressively actually increases downtime because of >> all the extra delays and recovery it induces. >> So set them to be long enough that there is unquestionably a problem if you >> hit them. >> >> But we absolutely recognise that starting/stopping a database can take a >> very long time comparatively and that it shouldn’t block recovery of other >> unrelated services. >> I would expect to see this land in Pacemaker 1.1.14 > > It will be great to see this in Pacemaker 1.1.14. From my experience > using Pacemaker, I think customized resource agents are possibly the > weakest part. This is why we encourage people wanting new agents to get involved with the upstream resource-agents project :-) > This feature should improve the handling for resource > action timeouts. > >>> I see the >>> current batch-limit is 30 and I tried to increase it to 100, but did not >>> help. >> Correct. It only puts an upper limit on the number of in-flight actions, >> actions still need to wait for all their dependants to complete before >> executing. >> >>> I'm sure that the cloned MySQL Galera resource is not related to >>> master-slave RabbitMQ resource. I don't find any dependency, order or >>> rule connecting them in the cluster deployed by Fuel [1]. >> In general it should not have needed to wait, but if you send me a >> crm_report covering the period you’re talking about I’ll be able to comment >> specifically about the behaviour you saw. > > You are very nice, thank you. I uploaded the file generated by > crm_report to google drive. > > https://drive.google.com/file/d/0B_vDkYRYHPSIZ29NdzV3NXotYU0/view?usp=sharing Hmmm... there’s no logs included here for some reason. I suspect it a bug on my part, can you apply this patch to report.collector on the machine you’re running crm_report from and retry? https://github.com/ClusterLabs/pacemaker/commit/96427ec > >>> Is there anything I can do to make sure all the resource actions return >>> expected values in a full reassembling? >> In general, if we say ‘start’, do your best to start or return ‘0’ if you >> already were started. >> Likewise for stop. >> >> Otherwise its really specific to your agent. >> For example an IP resource just needs to add itself to an interface - it >> cant do much differently, if it times out then the system much be very very >> busy. >> >> The only other thing I would say is: >> - avoid blocking calls where possible >> - have empathy for the machine (do as little as is needed) >> > > +1 for the empathy :) >>> Is it because node-1 and node-2 >>> happen to boot faster than node-3 and form a cluster, when node-3 joins, >>> it triggers new state transition? Or may because some resources are >>> already started, so pacemaker needs to stop them firstly? >> We only stop them if they shouldn’t yet be running (ie. a colocation or >> ordering dependancy is not yet started also). >> >> >>> Does setting >>> default-resource-stickiness to 1 help? >> From 0 or INFINITY? > > From 0 to 1. Is it enough for preventing the resource from being moved > when some nodes recovered from power failure? From 0 it would help. But potentially consider INFINITY if the only circumstance you want something moved is if the node is unavailable (either because its dead or in standby mode). > >>> I also tried "crm history XXX" commands in a live and correct cluster, >> I’m not familiar with that tool anymore. >> >>> but didn't find much information. I can see there are many log entries >>> like "run_graph: Transition 7108 ...". Next I'll inspect the pacemaker >>> log to see which resource action returns the unexpected value or which >>> thing triggers new state transition. >>> >>> [1] http://paste.openstack.org/show/214919/ >> I’d not recommend mixing the two CLI tools. >> >>>>> The problem is that pacemaker can only promote a resource after it >>>>> detects the resource is started. >>>> First we do a non-recurring monitor (*_monitor_0) to check what state the >>>> resource is in. >>>> We can’t assume its off because a) we might have crashed, b) the admin >>>> might have accidentally configured it to start at boot or c) the admin may >>>> have asked us to re-check everything. >>>> >>>>> During a full reassemble, in the first >>>>> transition batch, pacemaker starts all the resources including MySQL and >>>>> RabbitMQ. Pacemaker issues resource agent "start" invocation in parallel >>>>> and reaps the results. >>>>> >>>>> For a multi-state resource agent like RabbitMQ, pacemaker needs the >>>>> start result reported in the first batch, then transition engine and >>>>> policy engine decide if it has to retry starting or promote, and put >>>>> this new transition job into a new batch. >>>> Also important to know, the order of actions is: >>>> >>>> 1. any necessary demotions >>>> 2. any necessary stops >>>> 3. any necessary starts >>>> 4. any necessary promotions >>>> >>>> >>>> >>>> __________________________________________________________________________ >>>> OpenStack Development Mailing List (not for usage questions) >>>> Unsubscribe: [email protected]?subject:unsubscribe >>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>> -- >>> Best wishes! >>> Zhou Zheng Sheng / 周征晟 Software Engineer >>> Beijing AWcloud Software Co., Ltd. >>> >>> >>> >>> >>> __________________________________________________________________________ >>> OpenStack Development Mailing List (not for usage questions) >>> Unsubscribe: [email protected]?subject:unsubscribe >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> >> __________________________________________________________________________ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: [email protected]?subject:unsubscribe >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: [email protected]?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
