Hello Sreedhar, I am focusing only on the OVS agent at the moment. Armando fixed a few issues recently with the DHCP agent; those issues were triggering a perennial resync; with his fixes I reckon DHCP agent response times should be better.
I reckon Maru is also working on architectural improvements for the DHCP agent (see thread on DHCP agent reliability). Regards, Salvatore On 13 December 2013 20:26, Nathani, Sreedhar (APS) <[email protected]>wrote: > Hello All, > > > > Update with my testing. > > > > I have installed one more VM as neutron-server host and configured under > the Load Balancer. > > Currently I have 2 VMs running neutron-server process (one is Controller > and other is dedicated neutron-server VM) > > > > With this configuration during the batch instance deployment with a batch > size of 30 and sleep time of 20min, > > 180 instances could get an IP during the first boot. During 181-210 > instance creation some instances could not get an IP. > > > > This is much better than when running with single neutron server where > only 120 instances could get an IP during the first boot in Havana. > > > > When the instances are getting created, parent neutron-server process > spending close to 90% of the cpu time on both the servers, > > While rest of the neutron-server process (APIs) are spending very low CPU > utilization. > > > > I think it’s good idea to expand the current multiple neutron-server api > process to support rpc messages as well. > > > > Even with current setup (multiple neutron-server hosts), we still see rpc > timeouts in DHCP, L2 agents > > and dnsmasq process is getting restarted due to SIGKILL though. > > > > Thanks & Regards, > > Sreedhar Nathani > > > > *From:* Nathani, Sreedhar (APS) > *Sent:* Friday, December 13, 2013 12:08 AM > > *To:* OpenStack Development Mailing List (not for usage questions) > *Subject:* RE: [openstack-dev] Performance Regression in Neutron/Havana > compared to Quantum/Grizzly > > > > Hello Salvatore, > > > > Thanks for your feedback. Does the patch > https://review.openstack.org/#/c/57420/ which you are working on bug > https://bugs.launchpad.net/neutron/+bug/1253993 > > will help to correct the OVS agent loop slowdown issue? > > Does this patch address the DHCP agent updating the host file once in a > minute and finally sending SIGKILL to dnsmasq process? > > > > I have tested with Marun’s patch > https://review.openstack.org/#/c/61168/regarding ‘Send > DHCP notifications regardless of agent status’ but this patch > > Also observed the same behavior. > > > > > > Thanks & Regards, > > Sreedhar Nathani > > > > *From:* Salvatore Orlando [mailto:[email protected]<[email protected]>] > > *Sent:* Thursday, December 12, 2013 6:21 PM > > *To:* OpenStack Development Mailing List (not for usage questions) > *Subject:* Re: [openstack-dev] Performance Regression in Neutron/Havana > compared to Quantum/Grizzly > > > > I believe your analysis is correct and inline with the findings reported > in the bug concerning OVS agent loop slowdown. > > The issue has become even more prominent with the ML2 plugin due to an > increased number of notifications sent. > > Another issue which makes delays on the DHCP agent worse is that instances > send a discover message once a minute. > > Salvatore > > Il 11/dic/2013 11:50 "Nathani, Sreedhar (APS)" <[email protected]> > ha scritto: > > Hello Peter, > > Here are the tests I have done. Already have 240 instances active across > all the 16 compute nodes. To make the tests and data collection easy, > I have done the tests on single compute node > > First Test - > * 240 instances already active, 16 instances on the compute node > where I am going to do the tests > * deploy 10 instances concurrently using nova boot command with > num-instances option in single compute node > * All the instances could get IP during the instance boot time. > > - Instances are created at 2013-12-10 13:41:01 > - From the compute host, DHCP requests are sent from 13:41:20 but > those are not reaching the DHCP server > Reply from the DHCP server got at 13:43:08 (A delay of 108 seconds) > - DHCP agent updated the host file from 13:41:06 till 13:42:54. > Dnsmasq process got SIGHUP message every time the hosts file is updated > - In compute node tap devices are created between 13:41:08 and > 13:41:18 > Security group rules are received between 13:41:45 and 13:42:56 > IP table rules were updated between 13:41:50 and 13:43:04 > > Second Test - > * Deleted the newly created 10 instances. > * 240 instances already active, 16 instances on the compute node > where I am going to do the tests > * Deploy 30 instances concurrently using nova boot command with > num-instances option in single compute node > * None of the instances could get the IP during the instance boot. > > > - Instances are created at 2013-12-10 14:13:50 > > - From the compute host, DHCP Requests are sent from 14:14:14 but > those are not reaching the DHCP Server > (don't see any DHCP requests are reaching the DHCP > server from the tcpdump on the network node) > > - Reply from the DHCP server only got at 14:22:10 ( A delay of 636 > seconds) > > - From the strace of the DHCP agent process, it first updated the > hosts file at 14:14:05, after this there is a gap of close to 60 min for > Updating next instance address, it repeated till 7th > instance which was updated at 14:19:50. 30th instance updated at 14:20:00 > > - During the 30 instance creation, dnsmasq process got SIGHUP after > the host file is updated, but at 14:19:52 it got SIGKILL and new process > created - 14:19:52.881088 +++ killed by > SIGKILL +++ > > - In the compute node, tap devices are created between 14:14:03 and > 14:14:38 > From the strace of L2 agent log, can see security group related > messages are received from 14:14:27 till 14:20:02 > During this period in the L2 agent log see many rpc timeout > messages like below > Timeout: Timeout while waiting on RPC response - topic: > "q-plugin", RPC method: "security_group_rules_for_devices" info: "<unknown>" > > Due to security group related messages received by this > compute node with delay, it's taking very long time to update the iptable > rules > (Can see it was updated till 14:20) which is causing the > DHCP packets to be dropped at compute node itself without reaching to DHCP > server > > > Here is my understanding based on the tests. > Instances are creating fast and so its TAP devices. But there is a > considerable delay in updating the network port details in dnsmasq host > file and sending > The security group related info to the compute nodes due to which compute > nodes are not able to update the iptable rules fast enough which is causing > Instance not able to get the IP. > > I have collected the tcpdump from controller node, compute nodes + strace > of dhcp, dnsmasq, OVS L2 agents incase if you are interested to look at it > > Thanks & Regards, > Sreedhar Nathani > > > -----Original Message----- > From: Peter Feiner [mailto:[email protected]] > Sent: Tuesday, December 10, 2013 10:32 PM > To: OpenStack Development Mailing List (not for usage questions) > Subject: Re: [openstack-dev] Performance Regression in Neutron/Havana > compared to Quantum/Grizzly > > On Tue, Dec 10, 2013 at 7:48 AM, Nathani, Sreedhar (APS) < > [email protected]> wrote: > > My setup has 17 L2 agents (16 compute nodes, one Network node). > > Setting the minimize_polling helped to reduce the CPU utilization by the > L2 agents but it did not help in instances getting the IP during first boot. > > > > With the minimize_polling polling enabled less number of instances could > get IP than without the minimize_polling fix. > > > > Once the we reach certain number of ports(in my case 120 ports), > > during subsequent concurrent instance deployment(30 instances), updating > the port details in the dnsmasq host is taking long time, which causing the > delay for instances getting IP address. > > To figure out what the next problem is, I recommend that you determine > precisely what "port details in the dnsmasq host [are] taking [a] long > time" to update. Is the DHCPDISCOVER packet from the VM arriving before the > dnsmasq process's hostsfile is updated and dnsmasq is SIGHUP'd? Is the VM > sending the DHCPDISCOVER request before its tap device is wired to the > dnsmasq process (i.e., determine the status of the chain of bridges at the > time the guest sends the DHCPDISCOVER packet)? Perhaps the DHCPDISCOVER > packet is being dropped because the iptables rules for the VM's port > haven't been instantiated when the DHCPDISCOVER packet is sent. Or perhaps > something else, such as the replies being dropped. These are my only > theories at the moment. > > Anyhow, once you determine where the DHCP packets are being lost, you'll > have a much better idea of what needs to be fixed. > > One suggestion I have to make your debugging less onerous is to > reconfigure your guest image's networking init script to retry DHCP > requests indefinitely. That way, you'll see the guests' DHCP traffic when > neutron eventually gets everything in order. On CirrOS, add the following > line to the eht0 stanza in /etc/network/interfaces to retry DHCP requests > 100 times every 3 seconds: > > udhcpc_opts -t 100 -T 3 > > > When I deployed only 5 instances concurrently (already had 211 instances > active) instead of 30, all the instances are able to get the IP. > > But when I deployed 10 instances concurrently (already had 216 > > instances active) instead of 30, none of the instances could able to > > get the IP > > This is reminiscent of yet another problem I saw at scale. If you're using > the security group rule "VMs in this group can talk to everybody else in > this group", which is one of the defaults in devstack, you get > O(N^2) iptables rules for N VMs running on a particular host. When you > have more VMs running, the openvswitch agent, which is responsible for > instantiating the iptables and does so somewhat laboriously with respect to > the number of iptables rules, the opevnswitch agent could take too long to > configure ports before VMs' DHCP clients time out. > However, considering that you're seeing low CPU utilization by the > openvswitch agent, I don't think you're having this problem; since you're > distributing your VMs across numerous compute hosts, N is quite small in > your case. I only saw problems when N was > 100. > > _______________________________________________ > OpenStack-dev mailing list > [email protected] > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > _______________________________________________ > OpenStack-dev mailing list > [email protected] > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > _______________________________________________ > OpenStack-dev mailing list > [email protected] > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > >
_______________________________________________ OpenStack-dev mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
