Re: 3.0 and the Cassandra release process

Josh McKenzie Wed, 18 Mar 2015 08:00:01 -0700

+1

On Wed, Mar 18, 2015 at 7:54 AM, Jake Luciani <jak...@gmail.com> wrote:


> +1
>
> On Tue, Mar 17, 2015 at 5:06 PM, Jonathan Ellis <jbel...@gmail.com> wrote:
> > Cassandra 2.1 was released in September, which means that if we were on
> > track with our stated goal of six month releases, 3.0 would be done about
> > now.  Instead, we haven't even delivered a beta.  The immediate cause
> this
> > time is blocking for 8099
> > <https://issues.apache.org/jira/browse/CASSANDRA-8099>, but the reality
> is
> > that nobody should really be surprised.  Something always comes up --
> we've
> > averaged about nine months since 1.0, with 2.1 taking an entire year.
> >
> > We could make theory align with reality by acknowledging, "if nine months
> > is our 'natural' release schedule, then so be it."  But I think we can do
> > better.
> >
> > Broadly speaking, we have two constituencies with Cassandra releases:
> >
> > First, we have the users who are building or porting an application on
> > Cassandra.  These users want the newest features to make their job
> easier.
> > If 2.1.0 has a few bugs, it's not the end of the world.  They have time
> to
> > wait for 2.1.x to stabilize while they write their code.  They would like
> > to see us deliver on our six month schedule or even faster.
> >
> > Second, we have the users who have an application in production.  These
> > users, or their bosses, want Cassandra to be as stable as possible.
> > Assuming they deploy on a stable release like 2.0.12, they don't want to
> > touch it.  They would like to see us release *less* often.  (Because that
> > means they have to do less upgrades while remaining in our backwards
> > compatibility window.)
> >
> > With our current "big release every X months" model, these users' needs
> are
> > in tension.
> >
> > We discussed this six months ago, and ended up with this:
> >
> > What if we tried a [four month] release cycle, BUT we would guarantee
> that
> >> you could do a rolling upgrade until we bump the supermajor version? So
> 2.0
> >> could upgrade to 3.0 without having to go through 2.1.  (But to go to
> 3.1
> >> or 4.0 you would have to go through 3.0.)
> >>
> >
> > Crucially, I added
> >
> > Whether this is reasonable depends on how fast we can stabilize releases.
> >> 2.1.0 will be a good test of this.
> >>
> >
> > Unfortunately, even after DataStax hired half a dozen full-time test
> > engineers, 2.1.0 continued the proud tradition of being unready for
> > production use, with "wait for .5 before upgrading" once again looking
> like
> > a good guideline.
> >
> > I’m starting to think that the entire model of “write a bunch of new
> > features all at once and then try to stabilize it for release” is broken.
> > We’ve been trying that for years and empirically speaking the evidence is
> > that it just doesn’t work, either from a stability standpoint or even
> just
> > shipping on time.
> >
> > A big reason that it takes us so long to stabilize new releases now is
> > that, because our major release cycle is so long, it’s super tempting to
> > slip in “just one” new feature into bugfix releases, and I’m as guilty of
> > that as anyone.
> >
> > For similar reasons, it’s difficult to do a meaningful freeze with big
> > feature releases.  A look at 3.0 shows why: we have 8099 coming, but we
> > also have significant work done (but not finished) on 6230, 7970, 6696,
> and
> > 6477, all of which are meaningful improvements that address demonstrated
> > user pain.  So if we keep doing what we’ve been doing, our choices are to
> > either delay 3.0 further while we finish and stabilize these, or we wait
> > nine months to a year for the next release.  Either way, one of our
> > constituencies gets disappointed.
> >
> > So, I’d like to try something different.  I think we were on the right
> > track with shorter releases with more compatibility.  But I’d like to
> throw
> > in a twist.  Intel cuts down on risk with a “tick-tock” schedule for new
> > architectures and process shrinks instead of trying to do both at once.
> We
> > can do something similar here:
> >
> > One month releases.  Period.  If it’s not done, it can wait.
> > *Every other release only accepts bug fixes.*
> >
> > By itself, one-month releases are going to dramatically reduce the
> > complexity of testing and debugging new releases -- and bugs that do slip
> > past us will only affect a smaller percentage of users, avoiding the “big
> > release has a bunch of bugs no one has seen before and pretty much
> everyone
> > is hit by something” scenario.  But by adding in the second rule, I think
> > we have a real chance to make a quantum leap here: stable,
> production-ready
> > releases every two months.
> >
> > So here is my proposal for 3.0:
> >
> > We’re just about ready to start serious review of 8099.  When that’s
> done,
> > we branch 3.0 and cut a beta and then release candidates.  Whatever isn’t
> > done by then, has to wait; unlike prior betas, we will only accept bug
> > fixes into 3.0 after branching.
> >
> > One month after 3.0, we will ship 3.1 (with new features).  At the same
> > time, we will branch 3.2.  New features in trunk will go into 3.3.  The
> 3.2
> > branch will only get bug fixes.  We will maintain backwards compatibility
> > for all of 3.x; eventually (no less than a year) we will pick a release
> to
> > be 4.0, and drop deprecated features and old backwards compatibilities.
> > Otherwise there will be nothing special about the 4.0 designation.  (Note
> > that with an “odd releases have new features, even releases only have bug
> > fixes” policy, 4.0 will actually be *more* stable than 3.11.)
> >
> > Larger features can continue to be developed in separate branches, the
> way
> > 8099 is being worked on today, and committed to trunk when ready.  So
> this
> > is not saying that we are limited only to features we can build in a
> single
> > month.
> >
> > Some things will have to change with our dev process, for the better.  In
> > particular, with one month to commit new features, we don’t have room for
> > committing sloppy work and stabilizing it later.  Trunk has to be stable
> at
> > all times.  I asked Ariel Weisberg to put together his thoughts
> separately
> > on what worked for his team at VoltDB, and how we can apply that to
> > Cassandra -- see his email from Friday <http://bit.ly/1MHaOKX>.  (TLDR:
> > Redefine “done” to include automated tests.  Infrastructure to run tests
> > against github branches before merging to trunk.  A new test harness for
> > long-running regression tests.)
> >
> > I’m optimistic that as we improve our process this way, our even releases
> > will become increasingly stable.  If so, we can skip sub-minor releases
> > (3.2.x) entirely, and focus on keeping the release train moving.  In the
> > meantime, we will continue delivering 2.1.x stability releases.
> >
> > This won’t be an entirely smooth transition.  In particular, you will
> have
> > noticed that 3.1 will get more than a month’s worth of new features while
> > we stabilize 3.0 as the last of the old way of doing things, so some
> > patience is in order as we try this out.  By 3.4 and 3.6 later this year
> we
> > should have a good idea if this is working, and we can make adjustments
> as
> > warranted.
> >
> > --
> > Jonathan Ellis
> > Project Chair, Apache Cassandra
> > co-founder, http://www.datastax.com
> > @spyced
>
>
>
> --
> http://twitter.com/tjake
>



-- 
Joshua McKenzie
DataStax -- The Apache Cassandra Company

Re: 3.0 and the Cassandra release process

Reply via email to