On 11/29/2013 06:37 PM, David Koo wrote:
On Nov 29, 02:22:17 PM (Friday), Chris Friesen wrote:
We're currently running Grizzly (going to Havana soon) and we're
running into an issue where if the active controller is ungracefully
killed then nova-compute on the compute node doesn't properly
connect to the new rabbitmq server on the newly-active controller
node.

Interestingly, killing and restarting nova-compute on the compute
node seems to work, which implies that the retry code is doing
something less effective than the initial startup.

Has anyone doing HA controller setups run into something similar?

As a followup, it looks like if I wait for 9 minutes or so I see a message in the compute logs:

2013-11-30 00:02:14.756 1246 ERROR nova.openstack.common.rpc.common [-] Failed to consume message from queue: Socket closed

It then reconnects to the AMQP server and everything is fine after that. However, any instances that I tried to boot during those 9 minutes stay stuck in the "BUILD" status.



     So the rabbitmq server and the controller are on the same node?

Yes, they are.

> My
guess is that it's related to this bug 856764 (RabbitMQ connections
lack heartbeat or TCP keepalives). The gist of it is that since there
are no heartbeats between the MQ and nova-compute, if the MQ goes down
ungracefully then nova-compute has no way of knowing. If the MQ goes
down gracefully then the MQ clients are notified and so the problem
doesn't arise.

Sounds about right.

     We got bitten by the same bug a while ago when our controller node
got hard reset without any warning!. It came down to this bug (which,
unfortunately, doesn't have a fix yet). We worked around this bug by
implementing our own crude fix - we wrote a simple app to periodically
check if the MQ was alive (write a short message into the MQ, then
read it out again). When this fails n-times in a row we restart
nova-compute. Very ugly, but it worked!

Sounds reasonable.

I did notice a kombu heartbeat change that was submitted and then backed out again because it was buggy. I guess we're still waiting on the real fix?

Chris


_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to