Re: 3.0 and the Cassandra release process

Michael Kjellman Wed, 18 Mar 2015 15:53:06 -0700

For most of my life I’ve lived on the software bleeding edge both personally 
and professionally. Maybe it’s a personal weakness, but I guess I get a thrill 
out of the problem solving aspect?


Recently I came to a bit of an epiphany — the closer I keep to the daily build 
— generally the happier I am on a daily basis. Bugs happen, but for the most 
part (aside from show stopper bugs), pain points for myself in a given daily 
build can generally can be debugged to 1 or maybe 2 root causes, fixed in ~24 
hours, and then life is better the next day again. In comparison, the old 
waterfall model generally means taking an “official” release at some point and 
waiting for some poor soul (or developer) to actually run the thing. No matter 
how good the QA team is, until it’s actually used in the real world, most bugs 
aren’t found.

If you and your organization can wait 24 hours * number of bugs discovered 
after people actually started using the thing, you end up with a “usable build” 
around the holy-grail minor X.X.5 release of Cassandra.

I love the idea of the LTS model Jonathan describes because it means more code 
can get real testing and “bake” for longer instead of sitting largely unused on 
some git repository in a datacenter far far away. A lot of code has changed 
between 2.0 and trunk today. The code has diverged to the point that if you 
write something for 2.0 (as the most stable major branch currently available), 
merging it forward to 3.0 or after generally means rewriting it. If the only 
thing that comes out of this is a smaller delta of LOC between the deployable 
version/branch and what we can develop against and what QA is focused on I 
think that’s a massive win.

Something like CASSANDRA-8099 will need 2x the baking time of even many of the 
more risky changes the project has made. While I wouldn’t want to run a build 
with CASSANDRA-8099 in it anytime soon, there are now hundreds of other changes 
blocked, most likely many containing new bugs of their own, but have no 
exposure at all to even the most involved C* developers.

I really think this will be a huge win for the project and I’m super thankful 
for Sylvian, Ariel, Jonathan, Aleksey, and Jake for guiding this change to a 
much more sustainable release model for the entire community.

best,
kjellman

 
> On Mar 18, 2015, at 3:02 PM, Ariel Weisberg <[email protected]> 
> wrote:
> 
> Hi,
> 
> Keep in mind it is a bug fix release every month and a feature release every 
> two months.
> 
> For development that is really a two month cycle with all bug fixes being 
> backported one release. As a developer if you want to get something in a 
> release you have two months and you should be sizing pieces of large tasks so 
> they ship at least every two months.
> 
> Ariel
>> On Mar 18, 2015, at 5:58 PM, Terrance Shepherd <[email protected]> wrote:
>> 
>> I like the idea but I agree that every month is a bit aggressive. I have no
>> say but:
>> 
>> I would say 4 releases a year instead of 12. with 2 months of new features
>> and 1 month of bug squashing per a release. With the 4th quarter just bugs.
>> 
>> I would also proposed 2 year LTS releases for the releases after the 4th
>> quarter. So everyone could get a new feature release every quarter and the
>> stability of super major versions for 2 years.
>> 
>> On Wed, Mar 18, 2015 at 2:34 PM, Dave Brosius <[email protected]>
>> wrote:
>> 
>>> It would seem the practical implications of this is that there would be
>>> significantly more development on branches, with potentially more
>>> significant delays on merging these branches. This would imply to me that
>>> more Jenkins servers would need to be set up to handle auto-testing of more
>>> branches, as if feature work spends more time on external branches, it is
>>> then likely to be be less tested (even if by accident) as less developers
>>> would be working on that branch. Only when a feature was blessed to make it
>>> to the release-tracked branch, would it become exposed to the majority of
>>> developers/testers, etc doing normal running/playing/testing.
>>> 
>>> This isn't to knock the idea in anyway, just wanted to mention what i
>>> think the outcome would be.
>>> 
>>> dave
>>> 
>>> 
>>> 
>>>> 
>>>>>> On Tue, Mar 17, 2015 at 5:06 PM, Jonathan Ellis <[email protected]>
>>>>> wrote:
>>>>>>> Cassandra 2.1 was released in September, which means that if we were
>>>>> on
>>>>>>> track with our stated goal of six month releases, 3.0 would be done
>>>>> about
>>>>>>> now.  Instead, we haven't even delivered a beta.  The immediate cause
>>>>>> this
>>>>>>> time is blocking for 8099
>>>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-8099>, but the
>>>>> reality
>>>>>> is
>>>>>>> that nobody should really be surprised.  Something always comes up --
>>>>>> we've
>>>>>>> averaged about nine months since 1.0, with 2.1 taking an entire year.
>>>>>>> 
>>>>>>> We could make theory align with reality by acknowledging, "if nine
>>>>> months
>>>>>>> is our 'natural' release schedule, then so be it."  But I think we
>>>>> can
>>>>> do
>>>>>>> better.
>>>>>>> 
>>>>>>> Broadly speaking, we have two constituencies with Cassandra releases:
>>>>>>> 
>>>>>>> First, we have the users who are building or porting an application
>>>>> on
>>>>>>> Cassandra.  These users want the newest features to make their job
>>>>>> easier.
>>>>>>> If 2.1.0 has a few bugs, it's not the end of the world.  They have
>>>>> time
>>>>>> to
>>>>>>> wait for 2.1.x to stabilize while they write their code.  They would
>>>>> like
>>>>>>> to see us deliver on our six month schedule or even faster.
>>>>>>> 
>>>>>>> Second, we have the users who have an application in production.
>>>>> These
>>>>>>> users, or their bosses, want Cassandra to be as stable as possible.
>>>>>>> Assuming they deploy on a stable release like 2.0.12, they don't want
>>>>> to
>>>>>>> touch it.  They would like to see us release *less* often.  (Because
>>>>> that
>>>>>>> means they have to do less upgrades while remaining in our backwards
>>>>>>> compatibility window.)
>>>>>>> 
>>>>>>> With our current "big release every X months" model, these users'
>>>>> needs
>>>>>> are
>>>>>>> in tension.
>>>>>>> 
>>>>>>> We discussed this six months ago, and ended up with this:
>>>>>>> 
>>>>>>> What if we tried a [four month] release cycle, BUT we would guarantee
>>>>>> that
>>>>>>>> you could do a rolling upgrade until we bump the supermajor version?
>>>>> So
>>>>>> 2.0
>>>>>>>> could upgrade to 3.0 without having to go through 2.1.  (But to go
>>>>> to
>>>>>> 3.1
>>>>>>>> or 4.0 you would have to go through 3.0.)
>>>>>>>> 
>>>>>>> 
>>>>>>> Crucially, I added
>>>>>>> 
>>>>>>> Whether this is reasonable depends on how fast we can stabilize
>>>>> releases.
>>>>>>>> 2.1.0 will be a good test of this.
>>>>>>>> 
>>>>>>> 
>>>>>>> Unfortunately, even after DataStax hired half a dozen full-time test
>>>>>>> engineers, 2.1.0 continued the proud tradition of being unready for
>>>>>>> production use, with "wait for .5 before upgrading" once again
>>>>> looking
>>>>>> like
>>>>>>> a good guideline.
>>>>>>> 
>>>>>>> I’m starting to think that the entire model of “write a bunch of new
>>>>>>> features all at once and then try to stabilize it for release” is
>>>>> broken.
>>>>>>> We’ve been trying that for years and empirically speaking the
>>>>> evidence
>>>>> is
>>>>>>> that it just doesn’t work, either from a stability standpoint or even
>>>>>> just
>>>>>>> shipping on time.
>>>>>>> 
>>>>>>> A big reason that it takes us so long to stabilize new releases now
>>>>> is
>>>>>>> that, because our major release cycle is so long, it’s super tempting
>>>>> to
>>>>>>> slip in “just one” new feature into bugfix releases, and I’m as
>>>>> guilty
>>>>> of
>>>>>>> that as anyone.
>>>>>>> 
>>>>>>> For similar reasons, it’s difficult to do a meaningful freeze with
>>>>> big
>>>>>>> feature releases.  A look at 3.0 shows why: we have 8099 coming, but
>>>>> we
>>>>>>> also have significant work done (but not finished) on 6230, 7970,
>>>>> 6696,
>>>>>> and
>>>>>>> 6477, all of which are meaningful improvements that address
>>>>> demonstrated
>>>>>>> user pain.  So if we keep doing what we’ve been doing, our choices
>>>>> are
>>>>> to
>>>>>>> either delay 3.0 further while we finish and stabilize these, or we
>>>>> wait
>>>>>>> nine months to a year for the next release.  Either way, one of our
>>>>>>> constituencies gets disappointed.
>>>>>>> 
>>>>>>> So, I’d like to try something different.  I think we were on the
>>>>> right
>>>>>>> track with shorter releases with more compatibility.  But I’d like to
>>>>>> throw
>>>>>>> in a twist.  Intel cuts down on risk with a “tick-tock” schedule for
>>>>> new
>>>>>>> architectures and process shrinks instead of trying to do both at
>>>>> once.
>>>>>> We
>>>>>>> can do something similar here:
>>>>>>> 
>>>>>>> One month releases.  Period.  If it’s not done, it can wait.
>>>>>>> *Every other release only accepts bug fixes.*
>>>>>>> 
>>>>>>> By itself, one-month releases are going to dramatically reduce the
>>>>>>> complexity of testing and debugging new releases -- and bugs that do
>>>>> slip
>>>>>>> past us will only affect a smaller percentage of users, avoiding the
>>>>> “big
>>>>>>> release has a bunch of bugs no one has seen before and pretty much
>>>>>> everyone
>>>>>>> is hit by something” scenario.  But by adding in the second rule, I
>>>>> think
>>>>>>> we have a real chance to make a quantum leap here: stable,
>>>>>> production-ready
>>>>>>> releases every two months.
>>>>>>> 
>>>>>>> So here is my proposal for 3.0:
>>>>>>> 
>>>>>>> We’re just about ready to start serious review of 8099.  When that’s
>>>>>> done,
>>>>>>> we branch 3.0 and cut a beta and then release candidates.  Whatever
>>>>> isn’t
>>>>>>> done by then, has to wait; unlike prior betas, we will only accept
>>>>> bug
>>>>>>> fixes into 3.0 after branching.
>>>>>>> 
>>>>>>> One month after 3.0, we will ship 3.1 (with new features).  At the
>>>>> same
>>>>>>> time, we will branch 3.2.  New features in trunk will go into 3.3.
>>>>> The
>>>>>> 3.2
>>>>>>> branch will only get bug fixes.  We will maintain backwards
>>>>> compatibility
>>>>>>> for all of 3.x; eventually (no less than a year) we will pick a
>>>>> release
>>>>>> to
>>>>>>> be 4.0, and drop deprecated features and old backwards
>>>>> compatibilities.
>>>>>>> Otherwise there will be nothing special about the 4.0 designation.
>>>>> (Note
>>>>>>> that with an “odd releases have new features, even releases only have
>>>>> bug
>>>>>>> fixes” policy, 4.0 will actually be *more* stable than 3.11.)
>>>>>>> 
>>>>>>> Larger features can continue to be developed in separate branches,
>>>>> the
>>>>>> way
>>>>>>> 8099 is being worked on today, and committed to trunk when ready.  So
>>>>>> this
>>>>>>> is not saying that we are limited only to features we can build in a
>>>>>> single
>>>>>>> month.
>>>>>>> 
>>>>>>> Some things will have to change with our dev process, for the better.
>>>>> In
>>>>>>> particular, with one month to commit new features, we don’t have room
>>>>> for
>>>>>>> committing sloppy work and stabilizing it later.  Trunk has to be
>>>>> stable
>>>>>> at
>>>>>>> all times.  I asked Ariel Weisberg to put together his thoughts
>>>>>> separately
>>>>>>> on what worked for his team at VoltDB, and how we can apply that to
>>>>>>> Cassandra -- see his email from Friday <http://bit.ly/1MHaOKX>.
>>>>> (TLDR:
>>>>>>> Redefine “done” to include automated tests.  Infrastructure to run
>>>>> tests
>>>>>>> against github branches before merging to trunk.  A new test harness
>>>>> for
>>>>>>> long-running regression tests.)
>>>>>>> 
>>>>>>> I’m optimistic that as we improve our process this way, our even
>>>>> releases
>>>>>>> will become increasingly stable.  If so, we can skip sub-minor
>>>>> releases
>>>>>>> (3.2.x) entirely, and focus on keeping the release train moving.  In
>>>>> the
>>>>>>> meantime, we will continue delivering 2.1.x stability releases.
>>>>>>> 
>>>>>>> This won’t be an entirely smooth transition.  In particular, you will
>>>>>> have
>>>>>>> noticed that 3.1 will get more than a month’s worth of new features
>>>>> while
>>>>>>> we stabilize 3.0 as the last of the old way of doing things, so some
>>>>>>> patience is in order as we try this out.  By 3.4 and 3.6 later this
>>>>> year
>>>>>> we
>>>>>>> should have a good idea if this is working, and we can make
>>>>> adjustments
>>>>>> as
>>>>>>> warranted.
>>>>>>> 
>>>>>>> --
>>>>>>> Jonathan Ellis
>>>>>>> Project Chair, Apache Cassandra
>>>>>>> co-founder, http://www.datastax.com
>>>>>>> @spyced
>>>>> 
>>>> 
>>> 
>

Re: 3.0 and the Cassandra release process

Reply via email to