Increasing size
---------------
There is a long standing policy that the Nova virt driver API is considered
unstable and thus all virt driver implementations should ultimately be part of
the Nova codebase. In Juno it is likely that the Ironic driver will be merged
into Nova. In a future release we may yet see the Docker driver return to the
Nova tree.
The result of merging yet more drivers is that there will be yet more work for
nova reviewers to do. It is far from obvious that merging new drivers will be
accompanied by new members on the core team. So it is likely that the workload
is going to get worse over future releases.
Splitting out the scheduler will be beneficial in reducing the review backlog,
but probably not enough to counter the growth from virt drivers. Killing of
nova-network is unlikely to help at all, since that consumes little-to-no
review time currently [2].
Exclusion of non-corporate devs
-------------------------------
There is a strong push from nova core for everything that is merged into Nova
to be accompanied by CI testing. This certainly makes sense from the POV of
overall product quality and reducing the burden on the core reviewers to catch
all mistakes through code review. What we don't take into account is that
setting up and maintaining such testing infrastructure requires a major
investment in terms of both hardware costs and man power. It has already been
seen that this is too much to bear for some companies who contribute to Nova,
eg with the Docker driver [3]. Developers who are not affiliated with any
company do not stand any realistic chance of meeting the CI testing needs
unless they're lucky that their feature can be covered by an existing running
CI system. This looks like it could effectively prevent support for a community
submitted FreeBSD BHyve driver from being merged, no matter how useful it might
be to users who want it.
NB, now a FreeBSD BHyve driver would probably be done as part of the libvirt
driver, which complicates this particular point I'm trying to make, since I
don't suggest reducing testing of the libvirt driver compared to what it has
today.
I don't want to get into a detailed testing discussion here really, since
that's somewhat of a tangent to the question of our dev and review process. I
am, however, concerned when our testing policy forces maintainers of some virt
drivers into the position of being treated as second class citizens within the
project as a whole, with a different development structure to the in-tree
approved drivers.
That said, Docker probably benefits from being out of tree, since it thus
avoids the painful nova core bottleneck entirely.
Problem summary
---------------
The common thread through most of these problems is that the nova core team is
a massive bottleneck in the development process.
Processes adopted (or under discussion) by the core team are fundamentally not
helping to remove the bottleneck. Rather they are introducing new layers of
beaurocracy so that we can feel justified in telling contributors that we are
going to ignore or reject their work. At best this is going to result in far
less useful work taking place in Nova. At worst this is further reducing the
ability of people to self organize to solve the problems, will cause our
contribtors to leave the community and possibly even force some virt drivers to
go out of tree to get their work done. Death by a thousand cuts.
A sub-thread is around the idea that our current structure of one big repo also has
other negative consequences for drivers who may not be able to meet the same high
standards as the rest of the drivers. A driver is either in or out of the club, and
if its out of the club life is made comparatively harder for its developers &
users. By all means have rules around that requirements for a release to use the
openstack trademarks based on CI testing coverage, but don't let that penalize the
actual development process itself.
Overall Nova is being increasingly hostile to its community of contributors. I don't
mean this as a result of any sense of malice or ill-will. What we're seeing is merely a
symptom of a hard worked team struggling to survive with a burden they can no longer be
reasonably expected to cope with. Nova core has done an amazing job at surviving for so
long as the project grew much larger & more quickly than anyone probably expected.
The time has come for some radical changes to let nova adapt & evolve to the next
level.
This is a crisis. A large crisis. In fact, if you got a moment, it's a
twelve-storey crisis with a magnificent entrance hall, carpeting throughout,
24-hour portage, and an enormous sign on the roof, saying 'This Is a Large
Crisis'. A large crisis requires a large plan.
Proposal / solution
===================
In the past Nova has spun out its volume layer to form the cinder project. The
Neutron project started as an attempt to solve the networking space, and
ultimately replace the nova-network. It is likely that the schedular will be
spun out to a separate project.
Now Neutron itself has grown so large and successful that it is considering
going one step further and spinning its actual drivers out of tree into
standalone add-on projects [4]. I've heard on the grapevine that Ironic is
considering similar steps for hardware drivers.
The radical (?) solution to the nova core team bottleneck is thus to follow
this lead and split the nova virt drivers out into separate projects and
delegate their maintainence to new dedicated teams.
- Nova becomes the home for the public APIs, RPC system, database
persistent and the glue that ties all this together with the
virt driver API.
- Each virt driver project gets its own core team and is responsible
for dealing with review, merge & release of their codebase.
Note, I really do mean *all* virt drivers should be separate. I do not want to see
some virt drivers split out and others remain in tree because I feel that signifies
that the out of tree ones are second class citizens. It is important to set up our
dev structure so that every virt driver is treated equally & so has equal
chance to achieve success. As long as one driver remains in tree there will always
be pressure for others to join it, which is exactly what we're trying to get away
from here. By everyone being out of tree, drivers (like
Docker) can take a decision about whether it is the right time for them to be
investing in gating CI systems, without being penalized in their dev process if
they make a decision to not have gate tests right now.
This has quite a few implications for the way development would operate.
- The Nova core team at least, would be voluntarily giving up a big
amount of responsibility over the evolution of virt drivers. Due
to human nature, people are not good at giving up power, so this
may be painful to swallow. Realistically current nova core are
not experts in most of the virt drivers to start with, and more
important we clearly do not have sufficient time to do a good job
of review with everything submitted. Much of the current need
for core review of virt drivers is to prevent the mis-use of a
poorly defined virt driver API...which can be mitigated - See
later point(s)
- Nova core would/should not have automatic +2 over the virt driver
repositories since it is unreasonable to assume they have the
suitable domain knowledge for all virt drivers out there. People
would of course be able to be members of multiple core teams. For
example John G would naturally be nova-core and nova-xen-core. I
would aim for nova-core and nova-libvirt-core, and so on. I do not
want any +2 responsibility over VMWare/HyperV/Docker drivers since
they're not my area of expertize - I only look at them today because
they have no other nova-core representation.
- Not sure if it implies the Nova PTL would be solely focused on
Nova common. eg would there continue to be one PTL over all virt
driver implementation projects, or would each project have its
own PTL. Maybe this is irrelevant if a Czars approach is chosen
by virt driver projects for their work. I'd be inclined to say
that a single PTL should stay as a figurehead to represent all
the virt driver projects, acting as a point of contact to ensure
we keep communication / co-operation between the drivers in sync.
- A fairly significant amount of nova code would need to be
considered semi-stable API. Certainly everything under nova/virt
and any object which is passed in/out of the virt driver API.
Changes to such APIs would have to be done in a backwards
compatible manner, since it is no longer possible to lock-step
change all the virt driver impls. In some ways I think this would
be a good thing as it will encourage people to put more thought
into the long term maintainability of nova internal code instead
of relying on being able to rip it apart later, at will.
- The nova/virt/driver.py class would need to be much better
specified. All parameters / return values which are opaque dicts
must be replaced with objects + attributes. Completion of the
objectification work is mandatory, so there is cleaner separation
between virt driver impls & the rest of Nova.
- If changes are required to common code, the virt driver developer
would first have to get the necccessary pieces merged into Nova
common. Then the follow up virt driver specific changes could be
proposed to their repo. This implies that some changes to virt
drivers will still contend for resource in the common nova repo
and team. This contention should be lower than it is today though
since the current nova core team should have less code to look
after per-person on aggregate.
- Changes submitted to nova common code would trigger running of CI
tests against the external virt drivers. Each virt driver core team
would decide whether they want their driver to be tested upon Nova
common changes. Expect that all would choose to be included to the
same extent that they are today. So level of validation of nova code
would remain at least at current level. I don't want to reduce the
amount of code testing here since that's contrary to the direction
we're taking wrt testing.
- Changes submitted to virt drivers would trigger running CI tests
that are applicable. eg changes to libvirt driver repo would not
involve running database migration tests, since all database code
is isolated in nova. libvirt changes would not trigger vmware,
xenserver, ironic, etc CI systems. Virt driver changes should
see fewer false positives in the tests as a result, and those
that do occur should be more explicitly related to the code being
proposed. eg a change to vmware is not going to trigger a tempest
run that uses libvirt, so non-deterministic failures in libvirt
will no longer plague vmware developers reviews. This would also
make it possible for VMWare CI to be made gating for changes to
the VMWare virt driver repository, without negatively impacting
other virt drivers. So this change should increase testing quality
for non-libvirt virt drivers and reduce pain of false failures
for everyone.
- Virt drivers shouldn't use oslo incubator code from nova, since
that can be replaced any time and isn't upgrade safe. Ideally most
of the incubator stuff virt drivers need should turn into stable
oslo APIs. Failing that, virt drivers would need their own copy
of the incubated code in their module namespace, to avoid clash
or the need to lock-step upgrade code across separate git repos.
Overall the outcome is that
- Far larger pool of people able to approve changes for merge
across nova core and the virt driver core teams.
- Faster review & merge for virt driver patches that don't involve
changes to common nova code, with less CI system testing pain.
- Ability to set priority of work in virt drivers without a 3rd
party being a bottleneck, where the work doesn't involve changes
to common nova code.
- Each virt driver team can accept as many features as they feel
able to deal with, without it negatively impacting amount of
features that other virt driver teams can accept.
- Virt drivers have flexibility to set their own policies on testing
without being penalized in the way they then develop their code.
The migration
-------------
Obviously a proposal such as this is a pretty major undertaking. It should be
clear that it could not be done in a short amount of time.
It is suggested that it be phased in over two dev cycles. In the Kilo release
the focus would be on prep work:
- Formalizing the separation between the virt driver impls and the
rest of the nova codebase. Figure out exactly which areas of
Nova internal code will need to be marked as 'semi-stable' for
use by virt drivers, and ensure their APIs are sufficiently
future proof.
- Discussions with the infrastructure, docs, release, etc teams to
identify impacts on them and do any required prep work.
- Identify the teams which will lead the new virt driver projects.
eg core reviewers, PTL or Czars for each job if applicable
- Probably more things I can't think of right now
Then at the start of the Lxxxx release, the virt drivers would actually be
split out into separate git repos and start their dev process for the future.
So for bulk of Lxxxx the drivers would be on their own. The two Lxxxx rc
milestones would allow us to ensure our release processes were working well
with the split drivers before the Lxxxx final release.
Final thought
-------------
Overall consider this a vote of no confidence in nova continuing to operate as it does
today. As mentioned above this is not intended to be disrepectful to the effort every
nova core member has put in, just a reflection on the changed environment we find
ourselves in. Fiddling with our processes for the prioritization of work cannot fix the
fundamental fact that nova core today is a massive single point of failure &
bottleneck, increasingly crippling the project. The only way to address this is by a
radical re-organization of our project to remove the bottlenecks by modularization of
the project & leaders.
Keeping a single team and adding more/changing process is simply akin to
shifting deckchairs on the titanic and not a viable option to coninue with long
term.
Now, I'm realistic. Even with every driver separated out, I expect that each of
them will individually still have more work proposed than their respective core
teams have time to review. The new structure will, however, make it easier for the
core individal teams to grow & adapt in ways that suit their specific needs.
For self-contained virt driver changes it will mean that acceptance of work by one
team will not take away capacity from another team. Further the burden of knowledge
required to make it onto a virt driver core team would be greatly reduced due to
the narrower focus of each core team, so we'll be able to promote good talent onto
virt driver core teams more quickly.
Thanks for reading so far. Now lets make some real change to prepare us for future
sustainability & even growth.
Regards,
Daniel
[1] http://lists.openstack.org/pipermail/openstack-dev/2014-August/044459.html
[2] There was a ban on changes to nova-network for much of the past two
cycles. It was relaxed primarily to allow full conversion of nova
codebase to use objects, not for major new feature development.
[3] http://lists.openstack.org/pipermail/openstack-dev/2014-July/040443.html
[4] http://lists.openstack.org/pipermail/openstack-dev/2014-August/043036.html
--
|: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org -o- http://virt-manager.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev