Sorry to have taken the discussion on a slight tangent. I meant only to offer the solution as a stop-gap. I agree that the fundamental problem should still be addressed.
On Tue, Dec 3, 2013 at 8:01 PM, Maru Newby <ma...@redhat.com> wrote: > > On Dec 4, 2013, at 1:47 AM, Stephen Gran <stephen.g...@theguardian.com> wrote: > >> On 03/12/13 16:08, Maru Newby wrote: >>> I've been investigating a bug that is preventing VM's from receiving IP >>> addresses when a Neutron service is under high load: >>> >>> https://bugs.launchpad.net/neutron/+bug/1192381 >>> >>> High load causes the DHCP agent's status updates to be delayed, causing the >>> Neutron service to assume that the agent is down. This results in the >>> Neutron service not sending notifications of port addition to the DHCP >>> agent. At present, the notifications are simply dropped. A simple fix is >>> to send notifications regardless of agent status. Does anybody have any >>> objections to this stop-gap approach? I'm not clear on the implications of >>> sending notifications to agents that are down, but I'm hoping for a simple >>> fix that can be backported to both havana and grizzly (yes, this bug has >>> been with us that long). >>> >>> Fixing this problem for real, though, will likely be more involved. The >>> proposal to replace the current wsgi framework with Pecan may increase the >>> Neutron service's scalability, but should we continue to use a 'fire and >>> forget' approach to notification? Being able to track the success or >>> failure of a given action outside of the logs would seem pretty important, >>> and allow for more effective coordination with Nova than is currently >>> possible. >> >> It strikes me that we ask an awful lot of a single neutron-server instance - >> it has to take state updates from all the agents, it has to do scheduling, >> it has to respond to API requests, and it has to communicate about actual >> changes with the agents. >> >> Maybe breaking some of these out the way nova has a scheduler and a >> conductor and so on might be a good model (I know there are things people >> are unhappy about with nova-scheduler, but imagine how much worse it would >> be if it was built into the API). >> >> Doing all of those tasks, and doing it largely single threaded, is just >> asking for overload. > > I'm sorry if it wasn't clear in my original message, but my primary concern > lies with the reliability rather than the scalability of the Neutron service. > Carl's addition of multiple workers is a good stop-gap to minimize the > impact of blocking IO calls in the current architecture, and we already have > consensus on the need to separate RPC and WSGI functions as part of the Pecan > rewrite. I am worried, though, that we are not being sufficiently diligent > in how we manage state transitions through notifications. Managing > transitions and their associate error states is needlessly complicated by the > current ad-hoc approach, and I'd appreciate input on the part of distributed > systems experts as to how we could do better. > > > m. > > > _______________________________________________ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev