Re: [openstack-dev] problems with rabbitmq on HA controller failure...anyone seen this?

Ravi Chunduru Mon, 02 Dec 2013 12:28:21 -0800

We do had the same problem in our deployment.  Here is the brief
description of what we saw and how we fixed it.
http://l4tol7.blogspot.com/2013/12/openstack-rabbitmq-issues.html



On Mon, Dec 2, 2013 at 10:37 AM, Vishvananda Ishaya
<[email protected]>wrote:

>
> On Nov 29, 2013, at 9:24 PM, Chris Friesen <[email protected]>
> wrote:
>
> > On 11/29/2013 06:37 PM, David Koo wrote:
> >> On Nov 29, 02:22:17 PM (Friday), Chris Friesen wrote:
> >>> We're currently running Grizzly (going to Havana soon) and we're
> >>> running into an issue where if the active controller is ungracefully
> >>> killed then nova-compute on the compute node doesn't properly
> >>> connect to the new rabbitmq server on the newly-active controller
> >>> node.
> >
> >>> Interestingly, killing and restarting nova-compute on the compute
> >>> node seems to work, which implies that the retry code is doing
> >>> something less effective than the initial startup.
> >>>
> >>> Has anyone doing HA controller setups run into something similar?
> >
> > As a followup, it looks like if I wait for 9 minutes or so I see a
> message in the compute logs:
> >
> > 2013-11-30 00:02:14.756 1246 ERROR nova.openstack.common.rpc.common [-]
> Failed to consume message from queue: Socket closed
> >
> > It then reconnects to the AMQP server and everything is fine after that.
>  However, any instances that I tried to boot during those 9 minutes stay
> stuck in the "BUILD" status.
> >
> >
> >>
> >>     So the rabbitmq server and the controller are on the same node?
> >
> > Yes, they are.
> >
> > > My
> >> guess is that it's related to this bug 856764 (RabbitMQ connections
> >> lack heartbeat or TCP keepalives). The gist of it is that since there
> >> are no heartbeats between the MQ and nova-compute, if the MQ goes down
> >> ungracefully then nova-compute has no way of knowing. If the MQ goes
> >> down gracefully then the MQ clients are notified and so the problem
> >> doesn't arise.
> >
> > Sounds about right.
> >
> >>     We got bitten by the same bug a while ago when our controller node
> >> got hard reset without any warning!. It came down to this bug (which,
> >> unfortunately, doesn't have a fix yet). We worked around this bug by
> >> implementing our own crude fix - we wrote a simple app to periodically
> >> check if the MQ was alive (write a short message into the MQ, then
> >> read it out again). When this fails n-times in a row we restart
> >> nova-compute. Very ugly, but it worked!
> >
> > Sounds reasonable.
> >
> > I did notice a kombu heartbeat change that was submitted and then backed
> out again because it was buggy. I guess we're still waiting on the real fix?
>
> Hi Chris,
>
> This general problem comes up a lot, and one fix is to use keepalives.
> Note that more is needed if you are using multi-master rabbitmq, but for
> failover I have had great success with the following (also posted to the
> bug):
>
> When a connection to a socket is cut off completely, the receiving side
> doesn't know that the connection has dropped, so you can end up with a
> half-open connection. The general solution for this in linux is to turn on
> TCP_KEEPALIVES. Kombu will enable keepalives if the version number is high
> enough (>1.0 iirc), but rabbit needs to be specially configured to send
> keepalives on the connections that it creates.
>
> So solving the HA issue generally involves a rabbit config with a section
> like the following:
>
> [
>  {rabbit, [{tcp_listen_options, [binary,
>                                 {packet, raw},
>                                 {reuseaddr, true},
>                                 {backlog, 128},
>                                 {nodelay, true},
>                                 {exit_on_close, false},
>                                 {keepalive, true}]}
>           ]}
> ].
>
> Then you should also shorten the keepalive sysctl settings or it will
> still take ~2 hrs to terminate the connections:
>
> echo "5" > /proc/sys/net/ipv4/tcp_keepalive_time
> echo "5" > /proc/sys/net/ipv4/tcp_keepalive_probes
> echo "1" > /proc/sys/net/ipv4/tcp_keepalive_intvl
>
> Obviously this should be done in a sysctl config file instead of at the
> command line. Note that if you only want to shorten the rabbit keepalives
> but keep everything else as a default, you can use an LD_PRELOAD library to
> do so. For example you could use:
>
> https://github.com/meebey/force_bind/blob/master/README
>
> Vish
>
> >
> > Chris
> >
> >
> > _______________________________________________
> > OpenStack-dev mailing list
> > [email protected]
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
> _______________________________________________
> OpenStack-dev mailing list
> [email protected]
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>


-- 
Ravi

_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] problems with rabbitmq on HA controller failure...anyone seen this?

Reply via email to