On Nov 29, 2013, at 9:24 PM, Chris Friesen <chris.frie...@windriver.com> wrote:
> On 11/29/2013 06:37 PM, David Koo wrote: >> On Nov 29, 02:22:17 PM (Friday), Chris Friesen wrote: >>> We're currently running Grizzly (going to Havana soon) and we're >>> running into an issue where if the active controller is ungracefully >>> killed then nova-compute on the compute node doesn't properly >>> connect to the new rabbitmq server on the newly-active controller >>> node. > >>> Interestingly, killing and restarting nova-compute on the compute >>> node seems to work, which implies that the retry code is doing >>> something less effective than the initial startup. >>> >>> Has anyone doing HA controller setups run into something similar? > > As a followup, it looks like if I wait for 9 minutes or so I see a message in > the compute logs: > > 2013-11-30 00:02:14.756 1246 ERROR nova.openstack.common.rpc.common [-] > Failed to consume message from queue: Socket closed > > It then reconnects to the AMQP server and everything is fine after that. > However, any instances that I tried to boot during those 9 minutes stay stuck > in the "BUILD" status. > > >> >> So the rabbitmq server and the controller are on the same node? > > Yes, they are. > > > My >> guess is that it's related to this bug 856764 (RabbitMQ connections >> lack heartbeat or TCP keepalives). The gist of it is that since there >> are no heartbeats between the MQ and nova-compute, if the MQ goes down >> ungracefully then nova-compute has no way of knowing. If the MQ goes >> down gracefully then the MQ clients are notified and so the problem >> doesn't arise. > > Sounds about right. > >> We got bitten by the same bug a while ago when our controller node >> got hard reset without any warning!. It came down to this bug (which, >> unfortunately, doesn't have a fix yet). We worked around this bug by >> implementing our own crude fix - we wrote a simple app to periodically >> check if the MQ was alive (write a short message into the MQ, then >> read it out again). When this fails n-times in a row we restart >> nova-compute. Very ugly, but it worked! > > Sounds reasonable. > > I did notice a kombu heartbeat change that was submitted and then backed out > again because it was buggy. I guess we're still waiting on the real fix? Hi Chris, This general problem comes up a lot, and one fix is to use keepalives. Note that more is needed if you are using multi-master rabbitmq, but for failover I have had great success with the following (also posted to the bug): When a connection to a socket is cut off completely, the receiving side doesn't know that the connection has dropped, so you can end up with a half-open connection. The general solution for this in linux is to turn on TCP_KEEPALIVES. Kombu will enable keepalives if the version number is high enough (>1.0 iirc), but rabbit needs to be specially configured to send keepalives on the connections that it creates. So solving the HA issue generally involves a rabbit config with a section like the following: [ {rabbit, [{tcp_listen_options, [binary, {packet, raw}, {reuseaddr, true}, {backlog, 128}, {nodelay, true}, {exit_on_close, false}, {keepalive, true}]} ]} ]. Then you should also shorten the keepalive sysctl settings or it will still take ~2 hrs to terminate the connections: echo "5" > /proc/sys/net/ipv4/tcp_keepalive_time echo "5" > /proc/sys/net/ipv4/tcp_keepalive_probes echo "1" > /proc/sys/net/ipv4/tcp_keepalive_intvl Obviously this should be done in a sysctl config file instead of at the command line. Note that if you only want to shorten the rabbit keepalives but keep everything else as a default, you can use an LD_PRELOAD library to do so. For example you could use: https://github.com/meebey/force_bind/blob/master/README Vish > > Chris > > > _______________________________________________ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev