Hello Marun,
Please find the details about my setup and tests which i have done so far
Setup
- One Physical Box with 16c, 256G memory. 2 VMs created on this Box - One for
Controller and One for Network Node
- 16x compute nodes (each has 16c, 256G memory)
- All the systems are installed with Ubuntu Precise + Havana Bits from Ubuntu
Cloud Archive
Steps to simulate the issue
1) Concurrently create 30 Instances (m1.small) using REST API with mincount=30
2) sleep for 20min and repeat the step (1)
Issue 1
In Havana, once we cross 150 instances (5 batches x 30) during 6th batch some
instances are going into ERROR state
due to network port not able to create and some instances are getting duplicate
IP address
Per Maru Newby this issue might related to this bug
https://bugs.launchpad.net/bugs/1192381
I have done the similar with Grizzly on the same environment 2 months back,
where I could able to deploy close to 240 instances without any errors
Initially on Grizzly also seen the same behavior but with these tunings based
on this bug
https://bugs.launchpad.net/neutron/+bug/1160442, never had issues (tested more
than 10 times)
sqlalchemy_pool_size = 60
sqlalchemy_max_overflow = 120
sqlalchemy_pool_timeout = 2
agent_down_time = 60
report_internval = 20
In Havana, I have tuned the same tunables but I could never get past 150+
instances. Without the tunables I could not able to get past
100 instances. We are getting many timeout errors from the DHCP agent and
neutron clients
NOTE: After tuning the agent_down_time to 60 and report_interval to 20, we no
longer getting these error messages
2013-12-02 11:44:43.421 28201 WARNING neutron.scheduler.dhcp_agent_scheduler
[-] No more DHCP agents
2013-12-02 11:44:43.439 28201 WARNING neutron.scheduler.dhcp_agent_scheduler
[-] No more DHCP agents
2013-12-02 11:44:43.452 28201 WARNING neutron.scheduler.dhcp_agent_scheduler
[-] No more DHCP agents
In the compute node openvswitch agent logs, we see these errors repeating
continuously
2013-12-04 06:46:02.081 3546 TRACE
neutron.plugins.openvswitch.agent.ovs_neutron_agent Timeout: Timeout while
waiting on RPC response - topic: "q-plugin", RPC method:
"security_group_rules_for_devices" info: "<unknown>"
and WARNING neutron.openstack.common.rpc.amqp [-] No calling threads waiting
for msg_id
DHCP agent has below errors
2013-12-02 15:35:19.557 22125 ERROR neutron.agent.dhcp_agent [-] Unable to
reload_allocations dhcp.
2013-12-02 15:35:19.557 22125 TRACE neutron.agent.dhcp_agent Timeout: Timeout
while waiting on RPC response - topic: "q-plugin", RPC method: "get_dhcp_port"
info: "<unknown>"
2013-12-02 15:35:34.266 22125 ERROR neutron.agent.dhcp_agent [-] Unable to sync
network state.
2013-12-02 15:35:34.266 22125 TRACE neutron.agent.dhcp_agent Timeout: Timeout
while waiting on RPC response - topic: "q-plugin", RPC method:
"get_active_networks_info" info: "<unknown>"
In Havana, I have merged the code from this patch and set api_workers to 8 (My
Controller VM has 8cores/16Hyperthreads)
https://review.openstack.org/#/c/37131/
After this patch and starting 8 neutron-server worker threads, during the batch
creation of 240 instances with 30 concurrent requests during each batch,
238 instances became active and 2 instances went into error. Interesting these
2 instances which went into error state are from the same compute node.
Unlike earlier this time, the errors are due to 'Too Many Connections' to the
MySQL database.
2013-12-04 17:07:59.877 21286 AUDIT nova.compute.manager
[req-26d64693-d1ef-40f3-8350-659e34d5b1d7 c4d609870d4447c684858216da2f8041
9b073211dd5c4988993341cc955e200b] [instance:
c14596fd-13d5-482b-85af-e87077d4ed9b] Terminating instance
2013-12-04 17:08:00.578 21286 ERROR nova.compute.manager
[req-26d64693-d1ef-40f3-8350-659e34d5b1d7 c4d609870d4447c684858216da2f8041
9b073211dd5c4988993341cc955e200b] [instance:
c14596fd-13d5-482b-85af-e87077d4ed9b] Error: Remote error: OperationalError
(OperationalError) (1040, 'Too many connections') None None
Need to back port the patch 'https://review.openstack.org/#/c/37131/' to
address the Neutron Scaling issues in Havana.
Carl already back porting this patch into Havana
https://review.openstack.org/#/c/60082/ which is good.
Issue 2
Grizzly :
During the concurrent instance creation in Grizzly, once we cross 210
instances, during subsequent 30 instance creation some of
the instances could not get their IP address during the first boot with in
first few min. Instance MAC and IP Address details
were updated in the dnsmasq host file but with a delay. Instances are able to
get their IP address with a delay eventually.
If we reboot the instance using 'nova reboot' instance used to get IP Address.
* Amount of delay is depending on number of network ports and delay is in the
range of 8seconds to 2min
Havana :
But in Havana only 81 instances could get the IP during the first boot. Port is
getting created and IP address are getting allocated
very fast, but by the time port is UP its taking quite lot of time. Once the
port is UP, Instances are able to send the DHCP Request
and get the IP address.
During the network port create and network port update, there are lot of
'security_group_rules_for_devices' messages. OVS Agents in the
compute nodes are getting Timeouts during "security_group_rules_for_devices"
Even though this issue exist in Grizzly but we observed this issue only after
200+ active instances (200 network ports), but in Havana
We are having these issues with less than 100 active ports.
In Havana, if we reboot the instance it's not able to get the IP Address even
though its network port entry is already exist in
dnsmasq hosts file. We can't even ping the IP Address now which we were able to
ping before the instance reboot. After restarting the
'neutron-dhcp-agent' service which will restart the 'dnsmasq' and reboot of the
instance could get the IP
This clear shows we have performance regression in neutron/havana compared to
quantum/grizzly
FYI, I attached the my notes for one of the instance which could not get IP
during first boot with this email
I am happy to share the results of my grizzly tests and logs during recent
havana tests
Thanks & Regards,
Sreedhar Nathani
For example
- Instance 2ba403d2-71f8-4ee6-a746-2e398bda9404 could not get IP address intime
so cloud-init cloud not finish its configuration
Instance is created at 2013-12-04 15:23:20 and launched at 2013-12-04 15:23:45
instance id,created at,booted at,boot (seconds)
2ba403d2-71f8-4ee6-a746-2e398bda9404,2013-12-04T15:23:20,??,??
console-log of this instance
2ba403d2-71f8-4ee6-a746-2e398bda9404
[ 0.000000] Initializing cgroup subsys cpuset
[ 0.000000] Initializing cgroup subsys cpu
[ 0.000000] Linux version 3.2.0-53-virtual (buildd@allspice) (gcc version
4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #81-Ubuntu SMP Thu Aug 22 21:21:26 UTC
2013 (Ubuntu 3.2.0-53.81-virtual 3.2.50)
...
cloud-init start-local running: Wed, 04 Dec 2013 15:23:52 +0000. up 1.59 seconds
no instance data found in start-local
cloud-init-nonet waiting 120 seconds for a network device.
cloud-init-nonet gave up waiting for a network device.
ci-info: lo : 1 127.0.0.1 255.0.0.0 .
ci-info: eth0 : 1 . . fa:16:3e:39:68:82
route_info failed
// From the console log, cloud-init waited till Wed, 04 Dec 2013 15:26:02
(cloud-init actually waits for 130 seconds here)
but eth0 could get get IP
# grep 2ba403d2-71f8-4ee6-a746-2e398bda9404 neutronportlist_frommysql
9b073211dd5c4988993341cc955e200b 92a508d4-d664-47d7-8933-5a9826eabe3b
fb4fd94f-9d44-4f22-a347-ffdf8476c148 fa:16:3e:39:68:82 1
ACTIVE2ba403d2-71f8-4ee6-a746-2e398bda9404 compute:None
# grep fa:16:3e:39:68:82 neutronportlist_fromcommand
| 92a508d4-d664-47d7-8933-5a9826eabe3b | | fa:16:3e:39:68:82 |
{"subnet_id": "50918af5-70eb-4cf7-a50e-276334738793", "ip_address":
"192.168.32.64"} |
// But IP Allocated from DHCP At this time (A delay of close to 7 seconds after
instance cloud-init gave up on networking)
Dec 4 15:26:09 osnetworknode dnsmasq-dhcp[10597]: DHCPDISCOVER(tap0f63c330-de)
fa:16:3e:39:68:82
Dec 4 15:26:09 osnetworknode dnsmasq-dhcp[10597]: DHCPOFFER(tap0f63c330-de)
192.168.32.64 fa:16:3e:39:68:82
Dec 4 15:26:09 osnetworknode dnsmasq-dhcp[10597]: DHCPREQUEST(tap0f63c330-de)
192.168.32.64 fa:16:3e:39:68:82
Dec 4 15:26:09 osnetworknode dnsmasq-dhcp[10597]: DHCPACK(tap0f63c330-de)
192.168.32.64 fa:16:3e:39:68:82 host-192-168-32-64
NOTE: This delay will increased if the number of active ports are high
even though instance could not get IP during the first boot, we could ping the
IP after some time without doing any changes
# ip netns exec qdhcp-fb4fd94f-9d44-4f22-a347-ffdf8476c148 ping 192.168.32.64
PING 192.168.32.64 (192.168.32.64) 56(84) bytes of data.
64 bytes from 192.168.32.64: icmp_req=1 ttl=64 time=3.16 ms
64 bytes from 192.168.32.64: icmp_req=2 ttl=64 time=0.808 ms
64 bytes from 192.168.32.64: icmp_req=3 ttl=64 time=0.651 ms
64 bytes from 192.168.32.64: icmp_req=4 ttl=64 time=0.848 ms
64 bytes from 192.168.32.64: icmp_req=5 ttl=64 time=0.688 ms
64 bytes from 192.168.32.64: icmp_req=6 ttl=64 time=0.824 ms
^C
--- 192.168.32.64 ping statistics ---
6 packets transmitted, 6 received, 0% packet loss, time 5001ms
rtt min/avg/max/mdev = 0.651/1.164/3.167/0.898 ms
Can't connect via SSH since ssh server was not started during the first boot
due to unavailabily of network
# ip netns exec qdhcp-fb4fd94f-9d44-4f22-a347-ffdf8476c148 ssh 192.168.32.64
Read from socket failed: Connection reset by peer
From the server log - port is created and IP allocated at 2013-12-04 15:23:31
but port is UP only at 2013-12-04 15:26:03 (a gap of 152 seconds)
2013-12-04 15:23:31.982 17798 DEBUG neutron.db.db_base_plugin_v2 [-] Allocated
IP 192.168.32.64
(fb4fd94f-9d44-4f22-a347-ffdf8476c148/50918af5-70eb-4cf7-a50e-276334738793/92a508d4-d664-47d7-8933-5a9826eabe3b)
create_port
/usr/lib/python2.7/dist-packages/neutron/db/db_base_plugin_v2.py:1360
2013-12-04 15:25:38.592 17784 DEBUG neutron.openstack.common.rpc.amqp [-]
received {u'_context_roles': [u'admin'], u'_msg_id':
u'e18b6849e0bf4ba38350c3e470bb089e', u'_context_read_deleted': u'no',
u'_reply_q': u'reply_e9aab6015394494792574c917fb89490', u'_context_tenant_id':
None, u'args': {u'devices': [u'0ecc4e80-b071-4695-95ca-4382e9d9e281',
u'92a508d4-d664-47d7-8933-5a9826eabe3b']}, u'namespace': None, u'_unique_id':
u'8ffdfc64acb241d094ff3f94c18490db', u'_context_is_admin': True, u'version':
u'1.1', u'_context_project_id': None, u'_context_timestamp': u'2013-12-04
09:57:50.353613', u'_context_user_id': None, u'method':
u'security_group_rules_for_devices'} _safe_log
/usr/lib/python2.7/dist-packages/neutron/openstack/common/rpc/common.py:276
2013-12-04 15:25:39.865 17784 DEBUG neutron.plugins.openvswitch.ovs_db_v2 [-]
get_port_with_securitygroups()
called:port_id=92a508d4-d664-47d7-8933-5a9826eabe3b get_port_from_device
/usr/lib/python2.7/dist-packages/neutron/plugins/openvswitch/ovs_db_v2.py:322
2013-12-04 15:25:58.739 17784 DEBUG neutron.openstack.common.rpc.amqp [-]
received {u'_context_roles': [u'admin'], u'_msg_id':
u'73f2ecc5f46b4516b3fe3fba58413096', u'_context_read_deleted': u'no',
u'_reply_q': u'reply_e9aab6015394494792574c917fb89490', u'_context_tenant_id':
None, u'args': {u'devices': [u'5c7ba334-4581-480c-ad91-5b95ff299a82',
u'0ecc4e80-b071-4695-95ca-4382e9d9e281',
u'92a508d4-d664-47d7-8933-5a9826eabe3b',
u'26a5efd1-0855-40c7-8730-f1098518055d',
u'2eb1f35e-873a-460a-806b-700cc1969289']}, u'namespace': None, u'_unique_id':
u'a7755a2207b84379941aaa68f0258384', u'_context_is_admin': True, u'version':
u'1.1', u'_context_project_id': None, u'_context_timestamp': u'2013-12-04
09:57:50.353613', u'_context_user_id': None, u'method':
u'security_group_rules_for_devices'} _safe_log
/usr/lib/python2.7/dist-packages/neutron/openstack/common/rpc/common.py:276
2013-12-04 15:25:58.804 17784 DEBUG neutron.plugins.openvswitch.ovs_db_v2 [-]
get_port_with_securitygroups()
called:port_id=92a508d4-d664-47d7-8933-5a9826eabe3b get_port_from_device
/usr/lib/python2.7/dist-packages/neutron/plugins/openvswitch/ovs_db_v2.py:322
2013-12-04 15:26:02.942 17784 DEBUG neutron.openstack.common.rpc.amqp [-]
received {u'_context_roles': [u'admin'], u'_msg_id':
u'b6b1639a8304448db0cf6a7c585cfc65', u'_context_read_deleted': u'no',
u'_reply_q': u'reply_e9aab6015394494792574c917fb89490', u'_context_tenant_id':
None, u'args': {u'device': u'92a508d4-d664-47d7-8933-5a9826eabe3b',
u'agent_id': u'ovs4a5403c4074e'}, u'namespace': None, u'_unique_id':
u'e3f3078be0974f83aa8c58b726bfbd5e', u'_context_is_admin': True, u'version':
u'1.1', u'_context_project_id': None, u'_context_timestamp': u'2013-12-04
09:57:50.353613', u'_context_user_id': None, u'method': u'get_device_details'}
_safe_log
/usr/lib/python2.7/dist-packages/neutron/openstack/common/rpc/common.py:276
2013-12-04 15:26:02.943 17784 DEBUG
neutron.plugins.openvswitch.ovs_neutron_plugin [-] Device
92a508d4-d664-47d7-8933-5a9826eabe3b details requested from ovs4a5403c4074e
get_device_details
/usr/lib/python2.7/dist-packages/neutron/plugins/openvswitch/ovs_neutron_plugin.py:105
2013-12-04 15:26:03.207 17784 DEBUG neutron.openstack.common.rpc.amqp [-]
received {u'_context_roles': [u'admin'], u'_msg_id':
u'034caefaa4d0414a912cfb2a35c8c6b0', u'_context_read_deleted': u'no',
u'_reply_q': u'reply_e9aab6015394494792574c917fb89490', u'_context_tenant_id':
None, u'args': {u'device': u'92a508d4-d664-47d7-8933-5a9826eabe3b', u'host':
u'oscompute11', u'agent_id': u'ovs4a5403c4074e'}, u'namespace': None,
u'_unique_id': u'21dfb81aa7234320a7bdcfedef593e25', u'_context_is_admin': True,
u'version': u'1.1', u'_context_project_id': None, u'_context_timestamp':
u'2013-12-04 09:57:50.353613', u'_context_user_id': None, u'method':
u'update_device_up'} _safe_log
/usr/lib/python2.7/dist-packages/neutron/openstack/common/rpc/common.py:276
2013-12-04 15:26:03.253 17784 DEBUG
neutron.plugins.openvswitch.ovs_neutron_plugin [-] Device
92a508d4-d664-47d7-8933-5a9826eabe3b up on ovs4a5403c4074e update_device_up
/usr/lib/python2.7/dist-packages/neutron/plugins/openvswitch/ovs_neutron_plugin.py:159
DHCP Agent log
2013-12-04 15:23:32.054 30756 DEBUG neutron.openstack.common.rpc.amqp [-]
received {u'_context_roles': [u'_member_', u'admin'], u'_context_read_deleted':
u'no', u'_context_tenant_id': u'9b073211dd5c4988993341cc955e200b', u'args':
{u'payload': {u'port': {u'status': u'DOWN', u'binding:host_id': u'oscompute11',
u'name': u'', u'allowed_address_pairs': [], u'admin_state_up': True,
u'network_id': u'fb4fd94f-9d44-4f22-a347-ffdf8476c148', u'tenant_id':
u'9b073211dd5c4988993341cc955e200b', u'binding:vif_type': u'ovs',
u'device_owner': u'compute:None', u'binding:capabilities': {u'port_filter':
True}, u'mac_address': u'fa:16:3e:39:68:82', u'fixed_ips': [{u'subnet_id':
u'50918af5-70eb-4cf7-a50e-276334738793', u'ip_address': u'192.168.32.64'}],
u'id': u'92a508d4-d664-47d7-8933-5a9826eabe3b', u'security_groups':
[u'7823d4f4-24fa-4434-816d-ed839d69a5b6'], u'device_id':
u'2ba403d2-71f8-4ee6-a746-2e398bda9404'}}}, u'namespace': None, u'_unique_id':
u'7b2ad87e6e8440408a5c506467c2f4ca', u'_context_is_admin': True, u'version':
u'1.0', u'_context_project_id': u'9b073211dd5c4988993341cc955e200b',
u'_context_timestamp': u'2013-12-04 15:23:31.922971', u'_context_user_id':
u'add03b5664694d99b151392a615c602b', u'method': u'port_create_end'} _safe_log
/usr/lib/python2.7/dist-packages/neutron/openstack/common/rpc/common.py:276
Still getting below errors in DHCP agent log
2013-12-04 17:13:58.479 30756 ERROR neutron.agent.dhcp_agent [-] Unable to
reload_allocations dhcp.
2013-12-04 17:13:58.479 30756 TRACE neutron.agent.dhcp_agent Traceback (most
recent call last):
..
2013-12-04 17:13:58.479 30756 TRACE neutron.agent.dhcp_agent File
"/usr/lib/python2.7/dist-packages/neutron/openstack/common/rpc/proxy.py", line
130, in call
2013-12-04 17:13:58.479 30756 TRACE neutron.agent.dhcp_agent exc.info,
real_topic, msg.get('method'))
2013-12-04 17:13:58.479 30756 TRACE neutron.agent.dhcp_agent Timeout: Timeout
while waiting on RPC response - topic: "q-plugin", RPC method: "get_dhcp_port"
info: "<unknown>"
2013-12-04 17:14:22.166 30756 ERROR neutron.agent.dhcp_agent [-] Unable to sync
network state.
2013-12-04 17:14:22.166 30756 TRACE neutron.agent.dhcp_agent Traceback (most
recent call last):
...
2013-12-04 17:14:22.166 30756 TRACE neutron.agent.dhcp_agent File
"/usr/lib/python2.7/dist-packages/neutron/openstack/common/rpc/proxy.py", line
130, in call
2013-12-04 17:14:22.166 30756 TRACE neutron.agent.dhcp_agent exc.info,
real_topic, msg.get('method'))
2013-12-04 17:14:22.166 30756 TRACE neutron.agent.dhcp_agent Timeout: Timeout
while waiting on RPC response - topic: "q-plugin", RPC method:
"get_active_networks_info" info: "<unknown>"
In Compute Node openvswitchagent log getting many of these kind of errors
2013-12-04 15:24:47.996 34067 TRACE
neutron.plugins.openvswitch.agent.ovs_neutron_agent Timeout: Timeout while
waiting on RPC response - topic: "q-plugin", RPC method:
"security_group_rules_for_devices" info: "<unknown>"
_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev