On 03/24/2014 10:59 AM, Dan Smith wrote:
Any ideas on what might be going on would be appreciated.
This looks like something that should be filed as a bug. I don't have
any ideas off hand, bit I will note that the reconnection logic works
fine for us in the upstream upgrade tests. That scenario includes
starting up a full stack, then taking down everything except compute and
rebuilding a new one on master. After the several minutes it takes to
upgrade the controller services, the compute host reconnects and is
ready to go before tempest runs.
I suspect your case wedged itself somehow other than that, which
definitely looks nasty and is worth tracking in a bug.
We've got an HA controller setup using pacemaker and were stress-testing
it by doing multiple controlled switchovers while doing other activity.
Generally this works okay, but last night we ran into this problem.
I'll file a bug, but in the meantime I've found something that looks a
bit suspicious. The "Unexpected exception occurred 61 time(s)...
retrying." message comes from forever_retry_uncaught_exceptions() in
excutils.py. It looks like we're raising
RecoverableConnectionError: connection already closed
down in /usr/lib64/python2.7/site-packages/amqp/abstract_channel.py, but
nothing handles it.
It looks like the most likely place that should be handling it is
nova.openstack.common.rpc.impl_kombu.Connection.ensure().
In the current oslo.messaging code the ensure() routine explicitly
handles connection errors (which RecoverableConnectionError is) and
socket timeouts--the ensure() routine in Havana doesn't do this.
Chris
_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev