We do had the same problem in our deployment. Here is the brief description of what we saw and how we fixed it. http://l4tol7.blogspot.com/2013/12/openstack-rabbitmq-issues.html
On Mon, Dec 2, 2013 at 10:37 AM, Vishvananda Ishaya <[email protected]>wrote: > > On Nov 29, 2013, at 9:24 PM, Chris Friesen <[email protected]> > wrote: > > > On 11/29/2013 06:37 PM, David Koo wrote: > >> On Nov 29, 02:22:17 PM (Friday), Chris Friesen wrote: > >>> We're currently running Grizzly (going to Havana soon) and we're > >>> running into an issue where if the active controller is ungracefully > >>> killed then nova-compute on the compute node doesn't properly > >>> connect to the new rabbitmq server on the newly-active controller > >>> node. > > > >>> Interestingly, killing and restarting nova-compute on the compute > >>> node seems to work, which implies that the retry code is doing > >>> something less effective than the initial startup. > >>> > >>> Has anyone doing HA controller setups run into something similar? > > > > As a followup, it looks like if I wait for 9 minutes or so I see a > message in the compute logs: > > > > 2013-11-30 00:02:14.756 1246 ERROR nova.openstack.common.rpc.common [-] > Failed to consume message from queue: Socket closed > > > > It then reconnects to the AMQP server and everything is fine after that. > However, any instances that I tried to boot during those 9 minutes stay > stuck in the "BUILD" status. > > > > > >> > >> So the rabbitmq server and the controller are on the same node? > > > > Yes, they are. > > > > > My > >> guess is that it's related to this bug 856764 (RabbitMQ connections > >> lack heartbeat or TCP keepalives). The gist of it is that since there > >> are no heartbeats between the MQ and nova-compute, if the MQ goes down > >> ungracefully then nova-compute has no way of knowing. If the MQ goes > >> down gracefully then the MQ clients are notified and so the problem > >> doesn't arise. > > > > Sounds about right. > > > >> We got bitten by the same bug a while ago when our controller node > >> got hard reset without any warning!. It came down to this bug (which, > >> unfortunately, doesn't have a fix yet). We worked around this bug by > >> implementing our own crude fix - we wrote a simple app to periodically > >> check if the MQ was alive (write a short message into the MQ, then > >> read it out again). When this fails n-times in a row we restart > >> nova-compute. Very ugly, but it worked! > > > > Sounds reasonable. > > > > I did notice a kombu heartbeat change that was submitted and then backed > out again because it was buggy. I guess we're still waiting on the real fix? > > Hi Chris, > > This general problem comes up a lot, and one fix is to use keepalives. > Note that more is needed if you are using multi-master rabbitmq, but for > failover I have had great success with the following (also posted to the > bug): > > When a connection to a socket is cut off completely, the receiving side > doesn't know that the connection has dropped, so you can end up with a > half-open connection. The general solution for this in linux is to turn on > TCP_KEEPALIVES. Kombu will enable keepalives if the version number is high > enough (>1.0 iirc), but rabbit needs to be specially configured to send > keepalives on the connections that it creates. > > So solving the HA issue generally involves a rabbit config with a section > like the following: > > [ > {rabbit, [{tcp_listen_options, [binary, > {packet, raw}, > {reuseaddr, true}, > {backlog, 128}, > {nodelay, true}, > {exit_on_close, false}, > {keepalive, true}]} > ]} > ]. > > Then you should also shorten the keepalive sysctl settings or it will > still take ~2 hrs to terminate the connections: > > echo "5" > /proc/sys/net/ipv4/tcp_keepalive_time > echo "5" > /proc/sys/net/ipv4/tcp_keepalive_probes > echo "1" > /proc/sys/net/ipv4/tcp_keepalive_intvl > > Obviously this should be done in a sysctl config file instead of at the > command line. Note that if you only want to shorten the rabbit keepalives > but keep everything else as a default, you can use an LD_PRELOAD library to > do so. For example you could use: > > https://github.com/meebey/force_bind/blob/master/README > > Vish > > > > > Chris > > > > > > _______________________________________________ > > OpenStack-dev mailing list > > [email protected] > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > _______________________________________________ > OpenStack-dev mailing list > [email protected] > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > -- Ravi
_______________________________________________ OpenStack-dev mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
