On Tue, Feb 10, 2015 at 11:19:20AM +0100, Thierry Carrez wrote:
Joe, Matt & Matthew:
I hear your frustration with broken stable branches. With my
vulnerability management team member hat, responsible for landing
patches there with a strict deadline, I can certainly relate with the
frustration of having to dive in to unbork the branch in the first
place, rather than concentrate on the work you initially planned on doing.
That said, wearing my stable team member hat, I think it's a bit unfair
to say that things are worse than they were and call for dramatic
action. The stable branch team put a structure in place to try to
continuously fix the stable branches rather than reactively fix it when
we need it to work. Those champions have been quite active unbreaking
it in the past months. I'd argue that the branch is broken much less
often than it used to. That doesn't mean it's never broken, though, or
that those people are magicians.
I don't at all for 2 reasons. The first being in every discussion we had at 2
summits I raised the increased maint. burden for a longer support window and
was told that people were going to stand up so it wouldn't be an issue. I have
yet to see that happen. I have not seen anything to date that would convince
me that we are at all ready to be maintaining 3 stable branches at once.
The second is while I've seen that etherpad, I still view their still being a
huge disconnect here about what actually maintaining the branches requires. The
issue which I'm raising is about issues related to the gating infrastructure and
how to ensure that things stay working. There is a non-linear overhead involved
with making sure any gating job stays working. (on stable or master) People need
to take ownership of jobs to make sure they keep working.
One issue in the current situation is that the two groups (you and the
stable maintainers) seem to work in parallel rather than collaborate.
It's quite telling that the two groups maintained separate etherpads to
keep track of the fixes that needed landing.
I don't actually view it as that. Just looking at the etherpad it has a very
small subset of the actual types of issues we're raising here.
For example, there was a week in late Nov. when 2 consecutive oslo project
releases broke the stable gates. After we unwound all of this and landed the
fixes in the branches the next step was to changes to make sure we didn't allow
breakages in the same way:
This was also happened at the same time as a new testtools stack release which
broke every branch (including master). Another example is all of the setuptools
stack churn from the famed Christmas releases. That was another critical
infrastructure piece that fell apart and was mostly handled by the infra team.
All of these things are getting fixed because they have to be, to make sure
development on master can continue not because those with a vested interest in
the stable branches working for 15 months are working on them.
The other aspect here are development efforts to make things more stable in this
space. Things like the effort to pin the requirements on stable branches which
Joe is spearheading. These are critical to the long term success of the stable
branches yet no one has stepped up to help with it.
I view this as a disconnect between what people think maintaining a stable
branch means and what it actually entails. Sure, the backporting of fixes to
intermittent failures is part of it. But, the most effort is spent on making
sure the gating machinery stays well oiled and doesn't breakdown.
Matthew Treinish wrote:
So I think it's time we called the icehouse branch and marked it EOL. We
originally conditioned the longer support window on extra people stepping
forward to keep things working. I believe this latest issue is just the latest
indication that this hasn't happened. Issue 1 listed above is being caused by
the icehouse branch during upgrades. The fact that a stable release was pushed
at the same time things were wedged on the juno branch is just the latest
evidence to me that things aren't being maintained as they should be. Looking at
the #openstack-qa irc log from today or the etherpad about trying to sort this
issue should be an indication that no one has stepped up to help with the
maintenance and it shows given the poor state of the branch.
I disagree with the assessment. People have stepped up. I think the
stable branches are less often broken than they were, and stable branch
champions (as their tracking etherpad shows) have made a difference.
There just has been more issues as usual recently and they probably
couldn't keep track. It's not a fun job to babysit stable branches,
belittling the stable branch champions results is not the best way to
encourage them to continue in this position. I agree that they could
work more with the QA team when they get overwhelmed, and raise more red
flags when they just can't keep up.
I actually don't see it that way. As one of the few people who has been doing
this stable debug stuff for some time, it's really the same story as always. The
pain points have just shifted. The difference now being instead of everyone
panicking around stable release time that things don't work on the stable
branches, because we've moved to a branchless model for things like tempest,
certain people are seeing the pain constantly.
It's not about sitting around and babysitting necessarily, but at least to start
actually watching jobs that run on the stable branch. The periodic jobs don't
give even close to a complete picture of the state of the world and don't run
frequently enough to catch everything. Part of the issue here is because I work
on tempest, grenade, and devstack I see these failures every time they happen
because it'll inevitably block development on one of those projects since the
stable jobs are gating.
I don't mean to belittle anyone's efforts here, I personally know that I
want or be able to do any of the traditional stable-maint backport work, and I
know it takes time to come up to speed on this work. But, it doesn't change the
position we're in right now.
I also disagree with the proposed solution. We announced a support
timeframe for Icehouse, our downstream users made plans around it, so we
should stick to it as much as we can. If we dropped stable branch
support every time a patch can't be landed there, there would just not
be any stable branch.
It's not just this latest issue which has caused me to raise this. (we have a
fix plan in progress, although EOL would make that moot) It's the same story
almost every other week at this point. The longer window was always just an
experiment and I was of the understanding if we deemed it untenable from a
maintenance POV that we wouldn't do it. I strongly feel that we need to just say
this isn't working right now and EOL especially before we enter a period where
we're maintaining 3 stable branches at once.