On 03/24/2014 10:59 AM, Dan Smith wrote:
Any ideas on what might be going on would be appreciated.

This looks like something that should be filed as a bug. I don't have
any ideas off hand, bit I will note that the reconnection logic works
fine for us in the upstream upgrade tests. That scenario includes
starting up a full stack, then taking down everything except compute and
rebuilding a new one on master. After the several minutes it takes to
upgrade the controller services, the compute host reconnects and is
ready to go before tempest runs.

I suspect your case wedged itself somehow other than that, which
definitely looks nasty and is worth tracking in a bug.

We've got an HA controller setup using pacemaker and were stress-testing it by doing multiple controlled switchovers while doing other activity. Generally this works okay, but last night we ran into this problem.

I'll file a bug, but in the meantime I've found something that looks a bit suspicious. The "Unexpected exception occurred 61 time(s)... retrying." message comes from forever_retry_uncaught_exceptions() in excutils.py. It looks like we're raising

RecoverableConnectionError: connection already closed

down in /usr/lib64/python2.7/site-packages/amqp/abstract_channel.py, but nothing handles it.

It looks like the most likely place that should be handling it is nova.openstack.common.rpc.impl_kombu.Connection.ensure().


In the current oslo.messaging code the ensure() routine explicitly handles connection errors (which RecoverableConnectionError is) and socket timeouts--the ensure() routine in Havana doesn't do this.


Chris

_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to