On Thu, May 7, 2015 at 5:01 PM Andrew Beekhof <[email protected]> wrote:
> > > On 5 May 2015, at 1:19 pm, Zhou Zheng Sheng / 周征晟 < > [email protected]> wrote: > > > > Thank you Andrew. > > > > on 2015/05/05 08:03, Andrew Beekhof wrote: > >>> On 28 Apr 2015, at 11:15 pm, Bogdan Dobrelya <[email protected]> > wrote: > >>> > >>>> Hello, > >>> Hello, Zhou > >>> > >>>> I using Fuel 6.0.1 and find that RabbitMQ recover time is long after > >>>> power failure. I have a running HA environment, then I reset power of > >>>> all the machines at the same time. I observe that after reboot it > >>>> usually takes 10 minutes for RabittMQ cluster to appear running > >>>> master-slave mode in pacemaker. If I power off all the 3 controllers > and > >>>> only start 2 of them, the downtime sometimes can be as long as 20 > minutes. > >>> Yes, this is a known issue [0]. Note, there were many bugfixes, like > >>> [1],[2],[3], merged for MQ OCF script, so you may want to try to > >>> backport them as well by the following guide [4] > >>> > >>> [0] https://bugs.launchpad.net/fuel/+bug/1432603 > >>> [1] https://review.openstack.org/#/c/175460/ > >>> [2] https://review.openstack.org/#/c/175457/ > >>> [3] https://review.openstack.org/#/c/175371/ > >>> [4] https://review.openstack.org/#/c/170476/ > >> Is there a reason you’re using a custom OCF script instead of the > upstream[a] one? > >> Please have a chat with David (the maintainer, in CC) if there is > something you believe is wrong with it. > >> > >> [a] > https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/rabbitmq-cluster > > > > I'm using the OCF script from the Fuel project, specifically from the > > "6.0" stable branch [alpha]. > > Ah, I’m still learning who is who... i thought you were part of that > project :-) > > > > > Comparing with upstream OCF code, the main difference is that Fuel > > RabbitMQ OCF is a master-slave resource. Fuel RabbitMQ OCF does more > > bookkeeping, for example, blocking client access when RabbitMQ cluster > > is not ready. I beleive the upstream OCF should be OK to use as well > > after I read the code, but it might not fit into the Fuel project. As > > far as I test, the Fuel OCF script is good except sometimes the full > > reassemble time is long, and as I find out, it is mostly because the > > Fuel MySQL Galera OCF script keeps pacemaker from promoting RabbitMQ > > resource, as I mentioned in the previous emails. > > > > Maybe Vladimir and Sergey can give us more insight on why Fuel needs a > > master-slave RabbitMQ. > > That would be good to know. > Browsing the agent, promote seems to be a no-op if rabbit is already > running. > > To the master / slave reason due to how the ocf script is structured to deal with rabbit's poor ability to handle its self in some scenarios. Hopefully the state transition diagram [5] is enough to clarify what's going on. [5] http://goo.gl/PPNrw7 > > I see Vladimir and Sergey works on the original > > Fuel blueprint "RabbitMQ cluster" [beta]. > > > > [alpha] > > > https://github.com/stackforge/fuel-library/blob/stable/6.0/deployment/puppet/nova/files/ocf/rabbitmq > > [beta] > > > https://blueprints.launchpad.net/fuel/+spec/rabbitmq-cluster-controlled-by-pacemaker > > > >>>> I have a little investigation and find out there are some possible > causes. > >>>> > >>>> 1. MySQL Recovery Takes Too Long [1] and Blocking RabbitMQ Clustering > in > >>>> Pacemaker > >>>> > >>>> The pacemaker resource p_mysql start timeout is set to 475s. Sometimes > >>>> MySQL-wss fails to start after power failure, and pacemaker would wait > >>>> 475s before retry starting it. The problem is that pacemaker divides > >>>> resource state transitions into batches. Since RabbitMQ is > master-slave > >>>> resource, I assume that starting all the slaves and promoting master > are > >>>> put into two different batches. If unfortunately starting all RabbitMQ > >>>> slaves are put in the same batch as MySQL starting, even if RabbitMQ > >>>> slaves and all other resources are ready, pacemaker will not continue > >>>> but just wait for MySQL timeout. > >>> Could you please elaborate the what is the same/different batches for > MQ > >>> and DB? Note, there is a MQ clustering logic flow charts available here > >>> [5] and we're planning to release a dedicated technical bulletin for > this. > >>> > >>> [5] http://goo.gl/PPNrw7 > >>> > >>>> I can re-produce this by hard powering off all the controllers and > start > >>>> them again. It's more likely to trigger MySQL failure in this way. > Then > >>>> I observe that if there is one cloned mysql instance not starting, the > >>>> whole pacemaker cluster gets stuck and does not emit any log. On the > >>>> host of the failed instance, I can see a mysql resource agent process > >>>> calling the sleep command. If I kill that process, the pacemaker comes > >>>> back alive and RabbitMQ master gets promoted. In fact this long > timeout > >>>> is blocking every resource from state transition in pacemaker. > >>>> > >>>> This maybe a known problem of pacemaker and there are some discussions > >>>> in Linux-HA mailing list [2]. It might not be fixed in the near > future. > >>>> It seems in generally it's bad to have long timeout in state > transition > >>>> actions (start/stop/promote/demote). There maybe another way to > >>>> implement MySQL-wss resource agent to use a short start timeout and > >>>> monitor the wss cluster state using monitor action. > >>> This is very interesting, thank you! I believe all commands for MySQL > RA > >>> OCF script should be as well wrapped with timeout -SIGTERM or -SIGKILL > >>> as we did for MQ RA OCF. And there should no be any sleep calls. I > >>> created a bug for this [6]. > >>> > >>> [6] https://bugs.launchpad.net/fuel/+bug/1449542 > >>> > >>>> I also find a fix to improve MySQL start timeout [3]. It shortens the > >>>> timeout to 300s. At the time I sending this email, I can not find it > in > >>>> stable/6.0 branch. Maybe the maintainer needs to cherry-pick it to > >>>> stable/6.0 ? > >>>> > >>>> [1] https://bugs.launchpad.net/fuel/+bug/1441885 > >>>> [2] > http://lists.linux-ha.org/pipermail/linux-ha/2014-March/047989.html > >>>> [3] https://review.openstack.org/#/c/171333/ > >>>> > >>>> > >>>> 2. RabbitMQ Resource Agent Breaks Existing Cluster > >>>> > >>>> Read the code of the RabbitMQ resource agent, I find it does the > >>>> following to start RabbitMQ master-slave cluster. > >>>> On all the controllers: > >>>> (1) Start Erlang beam process > >>>> (2) Start RabbitMQ App (If failed, reset mnesia DB and cluster state) > >>>> (3) Stop RabbitMQ App but do not stop the beam process > >>>> > >>>> Then in pacemaker, all the RabbitMQ instances are in slave state. > After > >>>> pacemaker determines the master, it does the following. > >>>> On the to-be-master host: > >>>> (4) Start RabbitMQ App (If failed, reset mnesia DB and cluster state) > >>>> On the slaves hosts: > >>>> (5) Start RabbitMQ App (If failed, reset mnesia DB and cluster state) > >>>> (6) Join RabbitMQ cluster of the master host > >>>> > >>> Yes, something like that. As I mentioned, there were several bug fixes > >>> in the 6.1 dev, and you can also check the MQ clustering flow charts. > >>> > >>>> As far as I can understand, this process is to make sure the master > >>>> determined by pacemaker is the same as the master determined in > RabbitMQ > >>>> cluster. If there is no existing cluster, it's fine. If it is run > >>> after > >>> > >>> Not exactly. There is no master in mirrored MQ cluster. We define the > >>> rabbit_hosts configuration option from Oslo.messaging. What ensures all > >>> queue masters will be spread around all of MQ nodes in a long run. And > >>> we use a master abstraction only for the Pacemaker RA clustering layer. > >>> Here, a "master" is the MQ node what joins the rest of the MQ nodes. > >>> > >>>> power failure and recovery, it introduces the a new problem. > >>> We do erase the node master attribute in CIB for such cases. This > should > >>> not bring problems into the master election logic. > >>> > >>>> After power recovery, if some of the RabbitMQ instances reach step (2) > >>>> roughly at the same time (within 30s which is hard coded in RabbitMQ) > as > >>>> the original RabbitMQ master instance, they form the original cluster > >>>> again and then shutdown. The other instances would have to wait for > 30s > >>>> before it reports failure waiting for tables, and be reset to a > >>>> standalone cluster. > >>>> > >>>> In RabbitMQ documentation [4], it is also mentioned that if we > shutdown > >>>> RabbitMQ master, a new master is elected from the rest of slaves. If > we > >>> (Note, the RabbitMQ documentation mentions *queue* masters and slaves, > >>> which are not the case for the Pacemaker RA clustering abstraction > layer.) > >>> > >>>> continue to shutdown nodes in step (3), we reach a point that the last > >>>> node is the RabbitMQ master, and pacemaker is not aware of it. I can > see > >>>> there is code to bookkeeping a "rabbit-start-time" attribute in > >>>> pacemaker to record the most long lived instance to help pacemaker > >>>> determine the master, but it does not cover the case mentioned above. > >>> We made an assumption what the node with the highest MQ uptime should > >>> know the most about recent cluster state, so other nodes must join it. > >>> RA OCF does not work with queue masters directly. > >>> > >>>> A > >>>> recent patch [5] checks existing "rabbit-master" attribute but it > >>>> neither cover the above case. > >>>> > >>>> So in step (4), pacemaker determines a different master which was a > >>>> RabbitMQ slave last time. It would wait for its original RabbitMQ > master > >>>> for 30s and fail, then it gets reset to a standalone cluster. Here we > >>>> get some different clusters, so in step (5) and (6), it is likely to > >>>> report error in log saying timeout waiting for tables or fail to merge > >>>> mnesia database schema, then the those instances get reset. You can > >>>> easily re-produce the case by hard resetting power of all the > controllers. > >>>> > >>>> As you can see, if you are unlucky, there would be several "30s > timeout > >>>> and reset" before you finally get a healthy RabbitMQ cluster. > >>> The full MQ cluster reassemble logic is far from the perfect state, > >>> indeed. This might erase all mnesia files, hence any custom entities, > >>> like users or vhosts, would be removed as well. Note, we do not > >>> configure durable queues for Openstack so there is nothing to care > about > >>> here - the full cluster downtime assumes there will be no AMQP messages > >>> stored at all. > >>> > >>>> I find three possible solutions. > >>>> A. Using rabbitmqctl force_boot option [6] > >>>> It will skips waiting for 30s and resetting cluster, but just assume > the > >>>> current node is the master and continue to operate. This is feasible > >>>> because the original RabbitMQ master would discards the local state > and > >>>> sync with the new master after it joins a new cluster [7]. So we can > be > >>>> sure that after step (4) and (6), the pacemaker determined master > >>>> instance is started unconditionally, and it will be the same as > RabbitMQ > >>>> master, and all operations run without 30s timeout. I find this option > >>>> is only available in newer RabbitMQ release, and updating RabbitMQ > might > >>>> introduce other compatibility problems. > >>> Yes, this option is only supported for newest RabbitMQ versions. But we > >>> definitely should look how this could help. > >>> > >>>> B. Turn RabbitMQ into cloned instance and use pause_minority instead > of > >>>> autoheal [8] > >>> Indeed, there are cases when MQ's autoheal can do nothing with existing > >>> partitions and remains partitioned for ever, for example: > >>> > >>> Masters: [ node-1 ] > >>> Slaves: [ node-2 node-3 ] > >>> root@node-1:~# rabbitmqctl cluster_status > >>> Cluster status of node 'rabbit@node-1' ... > >>> [{nodes,[{disc,['rabbit@node-1','rabbit@node-2']}]}, > >>> {running_nodes,['rabbit@node-1']}, > >>> {cluster_name,<<"rabbit@node-2">>}, > >>> {partitions,[]}] > >>> ...done. > >>> root@node-2:~# rabbitmqctl cluster_status > >>> Cluster status of node 'rabbit@node-2' ... > >>> [{nodes,[{disc,['rabbit@node-2']}]}] > >>> ...done. > >>> root@node-3:~# rabbitmqctl cluster_status > >>> Cluster status of node 'rabbit@node-3' ... > >>> [{nodes,[{disc,['rabbit@node-1','rabbit@node-2','rabbit@node-3']}]}, > >>> {running_nodes,['rabbit@node-3']}, > >>> {cluster_name,<<"rabbit@node-2">>}, > >>> {partitions,[]}] > >>> > >>> So we should test the pause-minority value as well. > >>> But I strongly believe we should make MQ multi-state clone to support > >>> many masters, related bp [7] > >>> > >>> [7] > >>> > https://blueprints.launchpad.net/fuel/+spec/rabbitmq-pacemaker-multimaster-clone > >>> > >>>> This works like MySQL-wss. It let RabbitMQ cluster itself deal with > >>>> partition in a manner similar to pacemaker quorum mechanism. When > there > >>>> is network partition, instances in the minority partition pauses > >>>> themselves automatically. Pacemaker does not have to track who is the > >>>> RabbitMQ master, who lives longest, who to promote... It just starts > all > >>>> the clones, done. This leads to huge change in RabbitMQ resource > agent, > >>>> and the stability and other impact is to be tested. > >>> Well, we should not mess the queue masters and multi-clone master for > MQ > >>> resource in the pacemaker. > >>> As I said, pacemaker RA has nothing to do with queue masters. And we > >>> introduced this "master" mostly in order to support the full cluster > >>> reassemble case - there must be a node promoted and other nodes should > join. > >>> > >>>> C. Creating a "force_load" file > >>>> After reading RabbitMQ source code, I find that the actual thing it > does > >>>> in solution A is just creating an empty file named "force_load" in > >>>> mnesia database dir, then mnesia thinks it is the last node shut down > in > >>>> the last time and boot itself as the master. This implementation keeps > >>>> the same from v3.1.4 to the latest RabbitMQ master branch. I think we > >>>> can make use of this little trick. The change is adding just one line > in > >>>> "try_to_start_rmq_app()" function. > >>>> > >>>> touch "${MNESIA_FILES}/force_load" && \ > >>>> chown rabbitmq:rabbitmq "${MNESIA_FILES}/force_load" > >>> This is a very good point, thank you. > >>> > >>>> [4] http://www.rabbitmq.com/ha.html > >>>> [5] https://review.openstack.org/#/c/169291/ > >>>> [6] https://www.rabbitmq.com/clustering.html > >>>> [7] http://www.rabbitmq.com/partitions.html#recovering > >>>> [8] http://www.rabbitmq.com/partitions.html#automatic-handling > >>>> > >>>> Maybe you have better ideas on this. Please share your thoughts. > >>> Thank you for a thorough feedback! This was a really great job. > >>> > >>>> ---- > >>>> Best wishes! > >>>> Zhou Zheng Sheng / ??? Software Engineer > >>>> Beijing AWcloud Software Co., Ltd. > >>>> > >>> > >>> -- > >>> Best regards, > >>> Bogdan Dobrelya, > >>> Skype #bogdando_at_yahoo.com > >>> Irc #bogdando > >>> > >>> > __________________________________________________________________________ > >>> OpenStack Development Mailing List (not for usage questions) > >>> Unsubscribe: > [email protected]?subject:unsubscribe > >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > >> > >> > __________________________________________________________________________ > >> OpenStack Development Mailing List (not for usage questions) > >> Unsubscribe: > [email protected]?subject:unsubscribe > >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: [email protected]?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
