Do you recommend both or can I do away with the system timers and just keep the heartbeat? Ajay
From: "Kris G. Lindgren" <[email protected]<mailto:[email protected]>> Date: Thursday, April 21, 2016 at 11:54 AM To: Ajay Kalambur <[email protected]<mailto:[email protected]>>, "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo Yea, that only fixes part of the issue. The other part is getting the openstack messaging code itself to figure out the connection its using is no longer valid. Heartbeats by itself solved 90%+ of our issues with rabbitmq and nodes being disconnected and never reconnecting. ___________________________________________________________________ Kris Lindgren Senior Linux Systems Engineer GoDaddy From: "Ajay Kalambur (akalambu)" <[email protected]<mailto:[email protected]>> Date: Thursday, April 21, 2016 at 12:51 PM To: "Kris G. Lindgren" <[email protected]<mailto:[email protected]>>, "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo Trying that now. I had aggressive system keepalive timers before net.ipv4.tcp_keepalive_intvl = 10 net.ipv4.tcp_keepalive_probes = 9 net.ipv4.tcp_keepalive_time = 5 From: "Kris G. Lindgren" <[email protected]<mailto:[email protected]>> Date: Thursday, April 21, 2016 at 11:50 AM To: Ajay Kalambur <[email protected]<mailto:[email protected]>>, "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo Do you have rabbitmq/oslo messaging heartbeats enabled? If you aren't using heartbeats it will take a long time for the nova-compute agent to figure out that its actually no longer attached to anything. Heartbeat does periodic checks against rabbitmq and will catch this state and reconnect. ___________________________________________________________________ Kris Lindgren Senior Linux Systems Engineer GoDaddy From: "Ajay Kalambur (akalambu)" <[email protected]<mailto:[email protected]>> Date: Thursday, April 21, 2016 at 11:43 AM To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo Hi I am seeing on Kilo if I bring down one contoller node sometimes some computes report down forever. I need to restart the compute service on compute node to recover. Looks like oslo is not reconnecting in nova-compute Here is the Trace from nova-compute 2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 156, in call 2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db retry=self.retry) 2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send 2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db timeout=timeout, retry=retry) 2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 350, in send 2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db retry=retry) 2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 339, in _send 2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db result = self._waiter.wait(msg_id, timeout) 2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 243, in wait 2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db message = self.waiters.get(msg_id, timeout=timeout) 2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 149, in get 2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db 'to message ID %s' % msg_id) 2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db MessagingTimeout: Timed out waiting for a reply to message ID e064b5f6c8244818afdc5e91fff8ebf1 Any thougths. I am at stable/kilo for oslo Ajay
_______________________________________________ OpenStack-operators mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
