Re: [openstack-dev] [nova] nova-compute not re-establishing connectivity after controller switchover

Chris Friesen Mon, 24 Mar 2014 10:37:07 -0700

On 03/24/2014 10:59 AM, Dan Smith wrote:

Any ideas on what might be going on would be appreciated.


This looks like something that should be filed as a bug. I don't have
any ideas off hand, bit I will note that the reconnection logic works
fine for us in the upstream upgrade tests. That scenario includes
starting up a full stack, then taking down everything except compute and
rebuilding a new one on master. After the several minutes it takes to
upgrade the controller services, the compute host reconnects and is
ready to go before tempest runs.

I suspect your case wedged itself somehow other than that, which
definitely looks nasty and is worth tracking in a bug.

We've got an HA controller setup using pacemaker and were stress-testingit by doing multiple controlled switchovers while doing other activity.Generally this works okay, but last night we ran into this problem.

I'll file a bug, but in the meantime I've found something that looks abit suspicious. The "Unexpected exception occurred 61 time(s)...retrying." message comes from forever_retry_uncaught_exceptions() inexcutils.py. It looks like we're raising


RecoverableConnectionError: connection already closed

down in /usr/lib64/python2.7/site-packages/amqp/abstract_channel.py, butnothing handles it.

It looks like the most likely place that should be handling it isnova.openstack.common.rpc.impl_kombu.Connection.ensure().

In the current oslo.messaging code the ensure() routine explicitlyhandles connection errors (which RecoverableConnectionError is) andsocket timeouts--the ensure() routine in Havana doesn't do this.



Chris

_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] nova-compute not re-establishing connectivity after controller switchover

Reply via email to