On Sep 4, 2014, at 3:24 AM, Daniel P. Berrange <berra...@redhat.com> wrote:
> Position statement > ================== > > Over the past year I've increasingly come to the conclusion that > Nova is heading for (or probably already at) a major crisis. If > steps are not taken to avert this, the project is likely to loose > a non-trivial amount of talent, both regular code contributors and > core team members. That includes myself. This is not good for > Nova's long term health and so should be of concern to anyone > involved in Nova and OpenStack. > > For those who don't want to read the whole mail, the executive > summary is that the nova-core team is an unfixable bottleneck > in our development process with our current project structure. > The only way I see to remove the bottleneck is to split the virt > drivers out of tree and let them all have their own core teams > in their area of code, leaving current nova core to focus on > all the common code outside the virt driver impls. I, now, none > the less urge people to read the whole mail. I am highly in favor of this approach (and have been for at least a year). Every time we have brought this up in the past there has been concern about the shared code, but we have to make a change. We have tried various other approaches and none of them have made a dent. +1000 Vish > > > Background information > ====================== > > I see many factors coming together to form the crisis > > - Burn out of core team members from over work > - Difficulty bringing new talent into the core team > - Long delay in getting code reviewed & merged > - Marginalization of code areas which aren't popular > - Increasing size of nova code through new drivers > - Exclusion of developers without corporate backing > > Each item on their own may not seem too bad, but combined they > add up to a big problem. > > Core team burn out > ------------------ > > Having been involved in Nova for several dev cycles now, it is clear > that the backlog of code up for review never goes away. Even > intensive code review efforts at various points in the dev cycle > makes only a small impact on the backlog. This has a pretty > significant impact on core team members, as their work is never > done. At best, the dial is sometimes set to 10, instead of 11. > > Many people, myself included, have built tools to help deal with > the reviews in a more efficient manner than plain gerrit allows > for. These certainly help, but they can't ever solve the problem > on their own - just make it slightly more bearable. And this is > not even considering that core team members might have useful > contributions to make in ways beyond just code review. Ultimately > the workload is just too high to sustain the levels of review > required, so core team members will eventually burn out (as they > have done many times already). > > Even if one person attempts to take the initiative to heavily > invest in review of certain features it is often to no avail. > Unless a second dedicated core reviewer can be found to 'tag > team' it is hard for one person to make a difference. The end > result is that a patch is +2d and then sits idle for weeks or > more until a merge conflict requires it to be reposted at which > point even that one +2 is lost. This is a pretty demotivating > outcome for both reviewers & the patch contributor. > > > New core team talent > -------------------- > > It can't escape attention that the Nova core team does not grow > in size very often. When Nova was younger and its code base was > smaller, it was easier for contributors to get onto core because > the base level of knowledge required was that much smaller. To > get onto core today requires a major investment in learning Nova > over a year or more. Even people who potentially have the latent > skills may not have the time available to invest in learning the > entire of Nova. > > With the number of reviews proposed to Nova, the core team should > probably be at least double its current size[1]. There is plenty of > expertize in the project as a whole but it is typically focused > into specific areas of the codebase. There is nowhere we can find > 20 more people with broad knowledge of the codebase who could be > promoted even over the next year, let alone today. This is ignoring > that many existing members of core are relatively inactive due to > burnout and so need replacing. That means we really need another > 25-30 people for core. That's not going to happen. > > > Code review delays > ------------------ > > The obvious result of having too much work for too few reviewers > is that code contributors face major delays in getting their work > reviewed and merged. From personal experience, during Juno, I've > probably spent 1 week in aggregate on actual code development vs > 8 weeks on waiting on code review. You have to constantly be on > alert for review comments because unless you can respond quickly > (and repost) while you still have the attention of the reviewer, > they may not be look again for days/weeks. > > The length of time to get work merged serves as a demotivator to > actually do work in the first place. I've personally avoided doing > alot of code refactoring & cleanup work that would improve the > maintainability of the libvirt driver in the long term, because > I can't face the battle to get it reviewed & merged. Other people > have told me much the same. It is not uncommon to see changes that > have been pending for 2 dev cycles, not because the code was bad > but because they couldn't get people to review it. Contributors > will simply walk away from nova if that happens too often. > > Even when fate is on your side and code is reviewed, the chances > of it getting a success result from the CI systems first time > around is slim due to false failures. This really compounds the > already poor experiance of submitting code to Nova. > > > Marginalization of areas > ------------------------ > > Since the core team has far more work to do than it can manage, it > has to prioritize what it looks at. The core team figures out what > the overall project priorities are and will focus more effort in > to those areas. Individual members will also focus their attention > in areas where they have personal interest. Unfortunately the core > team is not representative of the entire of Nova codebase. The > inevitable result is that the HyperV and VMWare drivers can often > loose out in the battle for attention. In the past we've said that > it is the responsibility of people in those teams to invest in > learning the entire of Nova so that they have the knowledge required > to be promoted to core. I used to support that approach, but now > consider to be flawed due to the increased difficulty of *anyone* > getting onto core. The time investment required is simply too great > to expect people to undertake it. The marginalized areas have no > freedom to self-organize to solve their own problems because they > are forever dependant on the core team bottleneck. > > > Increasing size > --------------- > > There is a long standing policy that the Nova virt driver API is > considered unstable and thus all virt driver implementations should > ultimately be part of the Nova codebase. In Juno it is likely that > the Ironic driver will be merged into Nova. In a future release we > may yet see the Docker driver return to the Nova tree. > > The result of merging yet more drivers is that there will be yet > more work for nova reviewers to do. It is far from obvious that > merging new drivers will be accompanied by new members on the core > team. So it is likely that the workload is going to get worse over > future releases. > > Splitting out the scheduler will be beneficial in reducing the > review backlog, but probably not enough to counter the growth from > virt drivers. Killing of nova-network is unlikely to help at all, > since that consumes little-to-no review time currently [2]. > > > Exclusion of non-corporate devs > ------------------------------- > > There is a strong push from nova core for everything that is merged > into Nova to be accompanied by CI testing. This certainly makes sense > from the POV of overall product quality and reducing the burden on > the core reviewers to catch all mistakes through code review. What > we don't take into account is that setting up and maintaining such > testing infrastructure requires a major investment in terms of both > hardware costs and man power. It has already been seen that this is > too much to bear for some companies who contribute to Nova, eg with > the Docker driver [3]. Developers who are not affiliated with any > company do not stand any realistic chance of meeting the CI testing > needs unless they're lucky that their feature can be covered by an > existing running CI system. This looks like it could effectively > prevent support for a community submitted FreeBSD BHyve driver from > being merged, no matter how useful it might be to users who want it. > NB, now a FreeBSD BHyve driver would probably be done as part of the > libvirt driver, which complicates this particular point I'm trying > to make, since I don't suggest reducing testing of the libvirt driver > compared to what it has today. > > I don't want to get into a detailed testing discussion here really, > since that's somewhat of a tangent to the question of our dev and > review process. I am, however, concerned when our testing policy > forces maintainers of some virt drivers into the position of being > treated as second class citizens within the project as a whole, with > a different development structure to the in-tree approved drivers. > That said, Docker probably benefits from being out of tree, since it > thus avoids the painful nova core bottleneck entirely. > > > Problem summary > --------------- > > The common thread through most of these problems is that the nova > core team is a massive bottleneck in the development process. > Processes adopted (or under discussion) by the core team are > fundamentally not helping to remove the bottleneck. Rather they are > introducing new layers of beaurocracy so that we can feel justified > in telling contributors that we are going to ignore or reject their > work. At best this is going to result in far less useful work taking > place in Nova. At worst this is further reducing the ability of > people to self organize to solve the problems, will cause our > contribtors to leave the community and possibly even force some virt > drivers to go out of tree to get their work done. Death by a thousand > cuts. > > A sub-thread is around the idea that our current structure of one big > repo also has other negative consequences for drivers who may not be > able to meet the same high standards as the rest of the drivers. A > driver is either in or out of the club, and if its out of the club > life is made comparatively harder for its developers & users. By all > means have rules around that requirements for a release to use the > openstack trademarks based on CI testing coverage, but don't let that > penalize the actual development process itself. > > Overall Nova is being increasingly hostile to its community of > contributors. I don't mean this as a result of any sense of malice > or ill-will. What we're seeing is merely a symptom of a hard worked > team struggling to survive with a burden they can no longer be > reasonably expected to cope with. Nova core has done an amazing job > at surviving for so long as the project grew much larger & more > quickly than anyone probably expected. The time has come for some > radical changes to let nova adapt & evolve to the next level. > > This is a crisis. A large crisis. In fact, if you got a moment, it's > a twelve-storey crisis with a magnificent entrance hall, carpeting > throughout, 24-hour portage, and an enormous sign on the roof, > saying 'This Is a Large Crisis'. A large crisis requires a large > plan. > > > Proposal / solution > =================== > > In the past Nova has spun out its volume layer to form the cinder > project. The Neutron project started as an attempt to solve the > networking space, and ultimately replace the nova-network. It > is likely that the schedular will be spun out to a separate project. > > Now Neutron itself has grown so large and successful that it is > considering going one step further and spinning its actual drivers > out of tree into standalone add-on projects [4]. I've heard on the > grapevine that Ironic is considering similar steps for hardware > drivers. > > The radical (?) solution to the nova core team bottleneck is thus to > follow this lead and split the nova virt drivers out into separate > projects and delegate their maintainence to new dedicated teams. > > - Nova becomes the home for the public APIs, RPC system, database > persistent and the glue that ties all this together with the > virt driver API. > > - Each virt driver project gets its own core team and is responsible > for dealing with review, merge & release of their codebase. > > Note, I really do mean *all* virt drivers should be separate. I do > not want to see some virt drivers split out and others remain in tree > because I feel that signifies that the out of tree ones are second > class citizens. It is important to set up our dev structure so that > every virt driver is treated equally & so has equal chance to achieve > success. As long as one driver remains in tree there will always be > pressure for others to join it, which is exactly what we're trying > to get away from here. By everyone being out of tree, drivers (like > Docker) can take a decision about whether it is the right time for > them to be investing in gating CI systems, without being penalized > in their dev process if they make a decision to not have gate tests > right now. > > This has quite a few implications for the way development would > operate. > > - The Nova core team at least, would be voluntarily giving up a big > amount of responsibility over the evolution of virt drivers. Due > to human nature, people are not good at giving up power, so this > may be painful to swallow. Realistically current nova core are > not experts in most of the virt drivers to start with, and more > important we clearly do not have sufficient time to do a good job > of review with everything submitted. Much of the current need > for core review of virt drivers is to prevent the mis-use of a > poorly defined virt driver API...which can be mitigated - See > later point(s) > > - Nova core would/should not have automatic +2 over the virt driver > repositories since it is unreasonable to assume they have the > suitable domain knowledge for all virt drivers out there. People > would of course be able to be members of multiple core teams. For > example John G would naturally be nova-core and nova-xen-core. I > would aim for nova-core and nova-libvirt-core, and so on. I do not > want any +2 responsibility over VMWare/HyperV/Docker drivers since > they're not my area of expertize - I only look at them today because > they have no other nova-core representation. > > - Not sure if it implies the Nova PTL would be solely focused on > Nova common. eg would there continue to be one PTL over all virt > driver implementation projects, or would each project have its > own PTL. Maybe this is irrelevant if a Czars approach is chosen > by virt driver projects for their work. I'd be inclined to say > that a single PTL should stay as a figurehead to represent all > the virt driver projects, acting as a point of contact to ensure > we keep communication / co-operation between the drivers in sync. > > - A fairly significant amount of nova code would need to be > considered semi-stable API. Certainly everything under nova/virt > and any object which is passed in/out of the virt driver API. > Changes to such APIs would have to be done in a backwards > compatible manner, since it is no longer possible to lock-step > change all the virt driver impls. In some ways I think this would > be a good thing as it will encourage people to put more thought > into the long term maintainability of nova internal code instead > of relying on being able to rip it apart later, at will. > > - The nova/virt/driver.py class would need to be much better > specified. All parameters / return values which are opaque dicts > must be replaced with objects + attributes. Completion of the > objectification work is mandatory, so there is cleaner separation > between virt driver impls & the rest of Nova. > > - If changes are required to common code, the virt driver developer > would first have to get the necccessary pieces merged into Nova > common. Then the follow up virt driver specific changes could be > proposed to their repo. This implies that some changes to virt > drivers will still contend for resource in the common nova repo > and team. This contention should be lower than it is today though > since the current nova core team should have less code to look > after per-person on aggregate. > > - Changes submitted to nova common code would trigger running of CI > tests against the external virt drivers. Each virt driver core team > would decide whether they want their driver to be tested upon Nova > common changes. Expect that all would choose to be included to the > same extent that they are today. So level of validation of nova code > would remain at least at current level. I don't want to reduce the > amount of code testing here since that's contrary to the direction > we're taking wrt testing. > > - Changes submitted to virt drivers would trigger running CI tests > that are applicable. eg changes to libvirt driver repo would not > involve running database migration tests, since all database code > is isolated in nova. libvirt changes would not trigger vmware, > xenserver, ironic, etc CI systems. Virt driver changes should > see fewer false positives in the tests as a result, and those > that do occur should be more explicitly related to the code being > proposed. eg a change to vmware is not going to trigger a tempest > run that uses libvirt, so non-deterministic failures in libvirt > will no longer plague vmware developers reviews. This would also > make it possible for VMWare CI to be made gating for changes to > the VMWare virt driver repository, without negatively impacting > other virt drivers. So this change should increase testing quality > for non-libvirt virt drivers and reduce pain of false failures > for everyone. > > - Virt drivers shouldn't use oslo incubator code from nova, since > that can be replaced any time and isn't upgrade safe. Ideally most > of the incubator stuff virt drivers need should turn into stable > oslo APIs. Failing that, virt drivers would need their own copy > of the incubated code in their module namespace, to avoid clash > or the need to lock-step upgrade code across separate git repos. > > Overall the outcome is that > > - Far larger pool of people able to approve changes for merge > across nova core and the virt driver core teams. > > - Faster review & merge for virt driver patches that don't involve > changes to common nova code, with less CI system testing pain. > > - Ability to set priority of work in virt drivers without a 3rd > party being a bottleneck, where the work doesn't involve changes > to common nova code. > > - Each virt driver team can accept as many features as they feel > able to deal with, without it negatively impacting amount of > features that other virt driver teams can accept. > > - Virt drivers have flexibility to set their own policies on testing > without being penalized in the way they then develop their code. > > > The migration > ------------- > > Obviously a proposal such as this is a pretty major undertaking. It > should be clear that it could not be done in a short amount of time. > It is suggested that it be phased in over two dev cycles. In the Kilo > release the focus would be on prep work: > > - Formalizing the separation between the virt driver impls and the > rest of the nova codebase. Figure out exactly which areas of > Nova internal code will need to be marked as 'semi-stable' for > use by virt drivers, and ensure their APIs are sufficiently > future proof. > > - Discussions with the infrastructure, docs, release, etc teams to > identify impacts on them and do any required prep work. > > - Identify the teams which will lead the new virt driver projects. > eg core reviewers, PTL or Czars for each job if applicable > > - Probably more things I can't think of right now > > Then at the start of the Lxxxx release, the virt drivers would > actually be split out into separate git repos and start their dev > process for the future. So for bulk of Lxxxx the drivers would be > on their own. The two Lxxxx rc milestones would allow us to ensure > our release processes were working well with the split drivers > before the Lxxxx final release. > > > Final thought > ------------- > > Overall consider this a vote of no confidence in nova continuing to > operate as it does today. As mentioned above this is not intended to > be disrepectful to the effort every nova core member has put in, just > a reflection on the changed environment we find ourselves in. Fiddling > with our processes for the prioritization of work cannot fix the > fundamental fact that nova core today is a massive single point of > failure & bottleneck, increasingly crippling the project. The only way > to address this is by a radical re-organization of our project to > remove the bottlenecks by modularization of the project & leaders. > Keeping a single team and adding more/changing process is simply akin > to shifting deckchairs on the titanic and not a viable option to coninue > with long term. > > Now, I'm realistic. Even with every driver separated out, I expect > that each of them will individually still have more work proposed > than their respective core teams have time to review. The new structure > will, however, make it easier for the core individal teams to grow & > adapt in ways that suit their specific needs. For self-contained virt > driver changes it will mean that acceptance of work by one team will > not take away capacity from another team. Further the burden of > knowledge required to make it onto a virt driver core team would be > greatly reduced due to the narrower focus of each core team, so we'll > be able to promote good talent onto virt driver core teams more quickly. > > Thanks for reading so far. Now lets make some real change to prepare > us for future sustainability & even growth. > > Regards, > Daniel > > [1] http://lists.openstack.org/pipermail/openstack-dev/2014-August/044459.html > [2] There was a ban on changes to nova-network for much of the past two > cycles. It was relaxed primarily to allow full conversion of nova > codebase to use objects, not for major new feature development. > [3] http://lists.openstack.org/pipermail/openstack-dev/2014-July/040443.html > [4] http://lists.openstack.org/pipermail/openstack-dev/2014-August/043036.html > > -- > |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| > |: http://libvirt.org -o- http://virt-manager.org :| > |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| > |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| > > _______________________________________________ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev