Hello,

I using Fuel 6.0.1 and find that RabbitMQ recover time is long after
power failure. I have a running HA environment, then I reset power of
all the machines at the same time. I observe that after reboot it
usually takes 10 minutes for RabittMQ cluster to appear running
master-slave mode in pacemaker. If I power off all the 3 controllers and
only start 2 of them, the downtime sometimes can be as long as 20 minutes.

I have a little investigation and find out there are some possible causes.

1. MySQL Recovery Takes Too Long [1] and Blocking RabbitMQ Clustering in
Pacemaker

The pacemaker resource p_mysql start timeout is set to 475s. Sometimes
MySQL-wss fails to start after power failure, and pacemaker would wait
475s before retry starting it. The problem is that pacemaker divides
resource state transitions into batches. Since RabbitMQ is master-slave
resource, I assume that starting all the slaves and promoting master are
put into two different batches. If unfortunately starting all RabbitMQ
slaves are put in the same batch as MySQL starting, even if RabbitMQ
slaves and all other resources are ready, pacemaker will not continue
but just wait for MySQL timeout.

I can re-produce this by hard powering off all the controllers and start
them again. It's more likely to trigger MySQL failure in this way. Then
I observe that if there is one cloned mysql instance not starting, the
whole pacemaker cluster gets stuck and does not emit any log. On the
host of the failed instance, I can see a mysql resource agent process
calling the sleep command. If I kill that process, the pacemaker comes
back alive and RabbitMQ master gets promoted. In fact this long timeout
is blocking every resource from state transition in pacemaker.

This maybe a known problem of pacemaker and there are some discussions
in Linux-HA mailing list [2]. It might not be fixed in the near future.
It seems in generally it's bad to have long timeout in state transition
actions (start/stop/promote/demote). There maybe another way to
implement MySQL-wss resource agent to use a short start timeout and
monitor the wss cluster state using monitor action.

I also find a fix to improve MySQL start timeout [3]. It shortens the
timeout to 300s. At the time I sending this email, I can not find it in
stable/6.0 branch. Maybe the maintainer needs to cherry-pick it to
stable/6.0 ?

[1] https://bugs.launchpad.net/fuel/+bug/1441885
[2] http://lists.linux-ha.org/pipermail/linux-ha/2014-March/047989.html
[3] https://review.openstack.org/#/c/171333/


2. RabbitMQ Resource Agent Breaks Existing Cluster

Read the code of the RabbitMQ resource agent, I find it does the
following to start RabbitMQ master-slave cluster.
On all the controllers:
(1) Start Erlang beam process
(2) Start RabbitMQ App (If failed, reset mnesia DB and cluster state)
(3) Stop RabbitMQ App but do not stop the beam process

Then in pacemaker, all the RabbitMQ instances are in slave state. After
pacemaker determines the master, it does the following.
On the to-be-master host:
(4) Start RabbitMQ App (If failed, reset mnesia DB and cluster state)
On the slaves hosts:
(5) Start RabbitMQ App (If failed, reset mnesia DB and cluster state)
(6) Join RabbitMQ cluster of the master host

As far as I can understand, this process is to make sure the master
determined by pacemaker is the same as the master determined in RabbitMQ
cluster. If there is no existing cluster, it's fine. If it is run after
power failure and recovery, it introduces the a new problem.

After power recovery, if some of the RabbitMQ instances reach step (2)
roughly at the same time (within 30s which is hard coded in RabbitMQ) as
the original RabbitMQ master instance, they form the original cluster
again and then shutdown. The other instances would have to wait for 30s
before it reports failure waiting for tables, and be  reset to a
standalone cluster.

In RabbitMQ documentation [4], it is also mentioned that if we shutdown
RabbitMQ master, a new master is elected from the rest of slaves. If we
continue to shutdown nodes in step (3), we reach a point that the last
node is the RabbitMQ master, and pacemaker is not aware of it. I can see
there is code to bookkeeping a "rabbit-start-time" attribute in
pacemaker to record the most long lived instance to help pacemaker
determine the master, but it does not cover the case mentioned above. A
recent patch [5] checks existing "rabbit-master" attribute but it
neither cover the above case.

So in step (4), pacemaker determines a different master which was a
RabbitMQ slave last time. It would wait for its original RabbitMQ master
for 30s and fail, then it gets reset to a standalone cluster. Here we
get some different clusters, so in step (5) and (6), it is likely to
report error in log saying timeout waiting for tables or fail to merge
mnesia database schema, then the those instances get reset. You can
easily re-produce the case by hard resetting power of all the controllers.

As you can see, if you are unlucky, there would be several "30s timeout
and reset" before you finally get a healthy RabbitMQ cluster.

I find three possible solutions.
A. Using rabbitmqctl force_boot option [6]
It will skips waiting for 30s and resetting cluster, but just assume the
current node is the master and continue to operate. This is feasible
because the original RabbitMQ master would discards the local state and
sync with the new master after it joins a new cluster [7]. So we can be
sure that after step (4) and (6), the pacemaker determined master
instance is started unconditionally, and it will be the same as RabbitMQ
master, and all operations run without 30s timeout. I find this option
is only available in newer RabbitMQ release, and updating RabbitMQ might
introduce other compatibility problems.

B. Turn RabbitMQ into cloned instance and use pause_minority instead of
autoheal [8]
This works like MySQL-wss. It let RabbitMQ cluster itself deal with
partition in a manner similar to pacemaker quorum mechanism. When there
is network partition, instances in the minority partition pauses
themselves automatically. Pacemaker does not have to track who is the
RabbitMQ master, who lives longest, who to promote... It just starts all
the clones, done. This leads to huge change in RabbitMQ resource agent,
and the stability and other impact is to be tested.

C. Creating a "force_load" file
After reading RabbitMQ source code, I find that the actual thing it does
in solution A is just creating an empty file named "force_load" in
mnesia database dir, then mnesia thinks it is the last node shut down in
the last time and boot itself as the master. This implementation keeps
the same from v3.1.4 to the latest RabbitMQ master branch. I think we
can make use of this little trick. The change is adding just one line in
"try_to_start_rmq_app()" function.

touch "${MNESIA_FILES}/force_load" && \
  chown rabbitmq:rabbitmq "${MNESIA_FILES}/force_load"

[4] http://www.rabbitmq.com/ha.html
[5] https://review.openstack.org/#/c/169291/
[6] https://www.rabbitmq.com/clustering.html
[7] http://www.rabbitmq.com/partitions.html#recovering
[8] http://www.rabbitmq.com/partitions.html#automatic-handling

Maybe you have better ideas on this. Please share your thoughts.

----
Best wishes!
Zhou Zheng Sheng / 周征晟  Software Engineer
Beijing AWcloud Software Co., Ltd.
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to