Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

Ajay Kalambur (akalambu) Thu, 21 Apr 2016 12:09:15 -0700

Do you recommend both or can I do away with the system timers and just keep the 
heartbeat?
Ajay

From: "Kris G. Lindgren" <[email protected]<mailto:[email protected]>>
Date: Thursday, April 21, 2016 at 11:54 AM
To: Ajay Kalambur <[email protected]<mailto:[email protected]>>, 
"[email protected]<mailto:[email protected]>"

<[email protected]<mailto:[email protected]>>
Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

Yea, that only fixes part of the issue.  The other part is getting the 
openstack messaging code itself to figure out the connection its using is no 
longer valid.  Heartbeats by itself solved 90%+ of our issues with rabbitmq and 
nodes being disconnected and never reconnecting.

___________________________________________________________________
Kris Lindgren
Senior Linux Systems Engineer
GoDaddy

From: "Ajay Kalambur (akalambu)" <[email protected]<mailto:[email protected]>>
Date: Thursday, April 21, 2016 at 12:51 PM
To: "Kris G. Lindgren" <[email protected]<mailto:[email protected]>>, 
"[email protected]<mailto:[email protected]>"

<[email protected]<mailto:[email protected]>>
Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

Trying that now. I had aggressive system keepalive timers before

net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 5

From: "Kris G. Lindgren" <[email protected]<mailto:[email protected]>>
Date: Thursday, April 21, 2016 at 11:50 AM
To: Ajay Kalambur <[email protected]<mailto:[email protected]>>, 
"[email protected]<mailto:[email protected]>"

<[email protected]<mailto:[email protected]>>
Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

Do you have rabbitmq/oslo messaging heartbeats enabled?

If you aren't using heartbeats it will take a long time  for the nova-compute 
agent to figure out that its actually no longer attached to anything.  
Heartbeat does periodic checks against rabbitmq and will catch this state and 
reconnect.

___________________________________________________________________
Kris Lindgren
Senior Linux Systems Engineer
GoDaddy

From: "Ajay Kalambur (akalambu)" <[email protected]<mailto:[email protected]>>
Date: Thursday, April 21, 2016 at 11:43 AM
To: 
"[email protected]<mailto:[email protected]>"

<[email protected]<mailto:[email protected]>>
Subject: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

Hi
I am seeing on Kilo if I bring down one contoller node sometimes some computes 
report down forever.
I need to restart the compute service on compute node to recover. Looks like 
oslo is not reconnecting in nova-compute
Here is the Trace from nova-compute
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 156, in 
call
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db     
retry=self.retry)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in 
_send
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db     
timeout=timeout, retry=retry)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 
350, in send
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db     retry=retry)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 
339, in _send
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db     result = 
self._waiter.wait(msg_id, timeout)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 
243, in wait
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db     message = 
self.waiters.get(msg_id, timeout=timeout)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 
149, in get
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db     'to message ID 
%s' % msg_id)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db MessagingTimeout: 
Timed out waiting for a reply to message ID e064b5f6c8244818afdc5e91fff8ebf1

Any thougths. I am at stable/kilo for oslo

Ajay

_______________________________________________
OpenStack-operators mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

Reply via email to