Re: 3.0 and the Cassandra release process

Jason Brown Thu, 19 Mar 2015 04:26:31 -0700

+1 to this general proposal. I think the time has finally come for us to
try something new, and this sounds legit. Thanks!


On Thu, Mar 19, 2015 at 12:49 AM, Phil Yang <[email protected]> wrote:

> Can I regard the odd version as the "development preview" and the even
> version as the "production ready"?
>
> IMO, as a database infrastructure project, "stable" is more important than
> other kinds of projects. LTS is a good idea, but if we don't support
> non-LTS releases for enough time to fix their bugs, users on non-LTS
> release may have to upgrade a new major release to fix the bugs and may
> have to handle some new bugs by the new features. I'm afraid that
> eventually people would only think about the LTS one.
>
>
> 2015-03-19 8:48 GMT+08:00 Pavel Yaskevich <[email protected]>:
>
> > +1
> >
> > On Wed, Mar 18, 2015 at 3:50 PM, Michael Kjellman <
> > [email protected]> wrote:
> >
> > > For most of my life I’ve lived on the software bleeding edge both
> > > personally and professionally. Maybe it’s a personal weakness, but I
> > guess
> > > I get a thrill out of the problem solving aspect?
> > >
> > > Recently I came to a bit of an epiphany — the closer I keep to the
> daily
> > > build — generally the happier I am on a daily basis. Bugs happen, but
> for
> > > the most part (aside from show stopper bugs), pain points for myself
> in a
> > > given daily build can generally can be debugged to 1 or maybe 2 root
> > > causes, fixed in ~24 hours, and then life is better the next day again.
> > In
> > > comparison, the old waterfall model generally means taking an
> “official”
> > > release at some point and waiting for some poor soul (or developer) to
> > > actually run the thing. No matter how good the QA team is, until it’s
> > > actually used in the real world, most bugs aren’t found.
> > >
> > > If you and your organization can wait 24 hours * number of bugs
> > discovered
> > > after people actually started using the thing, you end up with a
> “usable
> > > build” around the holy-grail minor X.X.5 release of Cassandra.
> > >
> > > I love the idea of the LTS model Jonathan describes because it means
> more
> > > code can get real testing and “bake” for longer instead of sitting
> > largely
> > > unused on some git repository in a datacenter far far away. A lot of
> code
> > > has changed between 2.0 and trunk today. The code has diverged to the
> > point
> > > that if you write something for 2.0 (as the most stable major branch
> > > currently available), merging it forward to 3.0 or after generally
> means
> > > rewriting it. If the only thing that comes out of this is a smaller
> delta
> > > of LOC between the deployable version/branch and what we can develop
> > > against and what QA is focused on I think that’s a massive win.
> > >
> > > Something like CASSANDRA-8099 will need 2x the baking time of even many
> > of
> > > the more risky changes the project has made. While I wouldn’t want to
> > run a
> > > build with CASSANDRA-8099 in it anytime soon, there are now hundreds of
> > > other changes blocked, most likely many containing new bugs of their
> own,
> > > but have no exposure at all to even the most involved C* developers.
> > >
> > > I really think this will be a huge win for the project and I’m super
> > > thankful for Sylvian, Ariel, Jonathan, Aleksey, and Jake for guiding
> this
> > > change to a much more sustainable release model for the entire
> community.
> > >
> > > best,
> > > kjellman
> > >
> > >
> > > > On Mar 18, 2015, at 3:02 PM, Ariel Weisberg <
> > [email protected]>
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > > Keep in mind it is a bug fix release every month and a feature
> release
> > > every two months.
> > > >
> > > > For development that is really a two month cycle with all bug fixes
> > > being backported one release. As a developer if you want to get
> something
> > > in a release you have two months and you should be sizing pieces of
> large
> > > tasks so they ship at least every two months.
> > > >
> > > > Ariel
> > > >> On Mar 18, 2015, at 5:58 PM, Terrance Shepherd <[email protected]
> >
> > > wrote:
> > > >>
> > > >> I like the idea but I agree that every month is a bit aggressive. I
> > > have no
> > > >> say but:
> > > >>
> > > >> I would say 4 releases a year instead of 12. with 2 months of new
> > > features
> > > >> and 1 month of bug squashing per a release. With the 4th quarter
> just
> > > bugs.
> > > >>
> > > >> I would also proposed 2 year LTS releases for the releases after the
> > 4th
> > > >> quarter. So everyone could get a new feature release every quarter
> and
> > > the
> > > >> stability of super major versions for 2 years.
> > > >>
> > > >> On Wed, Mar 18, 2015 at 2:34 PM, Dave Brosius <
> > [email protected]
> > > >
> > > >> wrote:
> > > >>
> > > >>> It would seem the practical implications of this is that there
> would
> > be
> > > >>> significantly more development on branches, with potentially more
> > > >>> significant delays on merging these branches. This would imply to
> me
> > > that
> > > >>> more Jenkins servers would need to be set up to handle auto-testing
> > of
> > > more
> > > >>> branches, as if feature work spends more time on external branches,
> > it
> > > is
> > > >>> then likely to be be less tested (even if by accident) as less
> > > developers
> > > >>> would be working on that branch. Only when a feature was blessed to
> > > make it
> > > >>> to the release-tracked branch, would it become exposed to the
> > majority
> > > of
> > > >>> developers/testers, etc doing normal running/playing/testing.
> > > >>>
> > > >>> This isn't to knock the idea in anyway, just wanted to mention
> what i
> > > >>> think the outcome would be.
> > > >>>
> > > >>> dave
> > > >>>
> > > >>>
> > > >>>
> > > >>>>
> > > >>>>>> On Tue, Mar 17, 2015 at 5:06 PM, Jonathan Ellis <
> > [email protected]>
> > > >>>>> wrote:
> > > >>>>>>> Cassandra 2.1 was released in September, which means that if we
> > > were
> > > >>>>> on
> > > >>>>>>> track with our stated goal of six month releases, 3.0 would be
> > done
> > > >>>>> about
> > > >>>>>>> now.  Instead, we haven't even delivered a beta.  The immediate
> > > cause
> > > >>>>>> this
> > > >>>>>>> time is blocking for 8099
> > > >>>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-8099>, but
> the
> > > >>>>> reality
> > > >>>>>> is
> > > >>>>>>> that nobody should really be surprised.  Something always comes
> > up
> > > --
> > > >>>>>> we've
> > > >>>>>>> averaged about nine months since 1.0, with 2.1 taking an entire
> > > year.
> > > >>>>>>>
> > > >>>>>>> We could make theory align with reality by acknowledging, "if
> > nine
> > > >>>>> months
> > > >>>>>>> is our 'natural' release schedule, then so be it."  But I think
> > we
> > > >>>>> can
> > > >>>>> do
> > > >>>>>>> better.
> > > >>>>>>>
> > > >>>>>>> Broadly speaking, we have two constituencies with Cassandra
> > > releases:
> > > >>>>>>>
> > > >>>>>>> First, we have the users who are building or porting an
> > application
> > > >>>>> on
> > > >>>>>>> Cassandra.  These users want the newest features to make their
> > job
> > > >>>>>> easier.
> > > >>>>>>> If 2.1.0 has a few bugs, it's not the end of the world.  They
> > have
> > > >>>>> time
> > > >>>>>> to
> > > >>>>>>> wait for 2.1.x to stabilize while they write their code.  They
> > > would
> > > >>>>> like
> > > >>>>>>> to see us deliver on our six month schedule or even faster.
> > > >>>>>>>
> > > >>>>>>> Second, we have the users who have an application in
> production.
> > > >>>>> These
> > > >>>>>>> users, or their bosses, want Cassandra to be as stable as
> > possible.
> > > >>>>>>> Assuming they deploy on a stable release like 2.0.12, they
> don't
> > > want
> > > >>>>> to
> > > >>>>>>> touch it.  They would like to see us release *less* often.
> > > (Because
> > > >>>>> that
> > > >>>>>>> means they have to do less upgrades while remaining in our
> > > backwards
> > > >>>>>>> compatibility window.)
> > > >>>>>>>
> > > >>>>>>> With our current "big release every X months" model, these
> users'
> > > >>>>> needs
> > > >>>>>> are
> > > >>>>>>> in tension.
> > > >>>>>>>
> > > >>>>>>> We discussed this six months ago, and ended up with this:
> > > >>>>>>>
> > > >>>>>>> What if we tried a [four month] release cycle, BUT we would
> > > guarantee
> > > >>>>>> that
> > > >>>>>>>> you could do a rolling upgrade until we bump the supermajor
> > > version?
> > > >>>>> So
> > > >>>>>> 2.0
> > > >>>>>>>> could upgrade to 3.0 without having to go through 2.1.  (But
> to
> > go
> > > >>>>> to
> > > >>>>>> 3.1
> > > >>>>>>>> or 4.0 you would have to go through 3.0.)
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>> Crucially, I added
> > > >>>>>>>
> > > >>>>>>> Whether this is reasonable depends on how fast we can stabilize
> > > >>>>> releases.
> > > >>>>>>>> 2.1.0 will be a good test of this.
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>> Unfortunately, even after DataStax hired half a dozen full-time
> > > test
> > > >>>>>>> engineers, 2.1.0 continued the proud tradition of being unready
> > for
> > > >>>>>>> production use, with "wait for .5 before upgrading" once again
> > > >>>>> looking
> > > >>>>>> like
> > > >>>>>>> a good guideline.
> > > >>>>>>>
> > > >>>>>>> I’m starting to think that the entire model of “write a bunch
> of
> > > new
> > > >>>>>>> features all at once and then try to stabilize it for release”
> is
> > > >>>>> broken.
> > > >>>>>>> We’ve been trying that for years and empirically speaking the
> > > >>>>> evidence
> > > >>>>> is
> > > >>>>>>> that it just doesn’t work, either from a stability standpoint
> or
> > > even
> > > >>>>>> just
> > > >>>>>>> shipping on time.
> > > >>>>>>>
> > > >>>>>>> A big reason that it takes us so long to stabilize new releases
> > now
> > > >>>>> is
> > > >>>>>>> that, because our major release cycle is so long, it’s super
> > > tempting
> > > >>>>> to
> > > >>>>>>> slip in “just one” new feature into bugfix releases, and I’m as
> > > >>>>> guilty
> > > >>>>> of
> > > >>>>>>> that as anyone.
> > > >>>>>>>
> > > >>>>>>> For similar reasons, it’s difficult to do a meaningful freeze
> > with
> > > >>>>> big
> > > >>>>>>> feature releases.  A look at 3.0 shows why: we have 8099
> coming,
> > > but
> > > >>>>> we
> > > >>>>>>> also have significant work done (but not finished) on 6230,
> 7970,
> > > >>>>> 6696,
> > > >>>>>> and
> > > >>>>>>> 6477, all of which are meaningful improvements that address
> > > >>>>> demonstrated
> > > >>>>>>> user pain.  So if we keep doing what we’ve been doing, our
> > choices
> > > >>>>> are
> > > >>>>> to
> > > >>>>>>> either delay 3.0 further while we finish and stabilize these,
> or
> > we
> > > >>>>> wait
> > > >>>>>>> nine months to a year for the next release.  Either way, one of
> > our
> > > >>>>>>> constituencies gets disappointed.
> > > >>>>>>>
> > > >>>>>>> So, I’d like to try something different.  I think we were on
> the
> > > >>>>> right
> > > >>>>>>> track with shorter releases with more compatibility.  But I’d
> > like
> > > to
> > > >>>>>> throw
> > > >>>>>>> in a twist.  Intel cuts down on risk with a “tick-tock”
> schedule
> > > for
> > > >>>>> new
> > > >>>>>>> architectures and process shrinks instead of trying to do both
> at
> > > >>>>> once.
> > > >>>>>> We
> > > >>>>>>> can do something similar here:
> > > >>>>>>>
> > > >>>>>>> One month releases.  Period.  If it’s not done, it can wait.
> > > >>>>>>> *Every other release only accepts bug fixes.*
> > > >>>>>>>
> > > >>>>>>> By itself, one-month releases are going to dramatically reduce
> > the
> > > >>>>>>> complexity of testing and debugging new releases -- and bugs
> that
> > > do
> > > >>>>> slip
> > > >>>>>>> past us will only affect a smaller percentage of users,
> avoiding
> > > the
> > > >>>>> “big
> > > >>>>>>> release has a bunch of bugs no one has seen before and pretty
> > much
> > > >>>>>> everyone
> > > >>>>>>> is hit by something” scenario.  But by adding in the second
> > rule, I
> > > >>>>> think
> > > >>>>>>> we have a real chance to make a quantum leap here: stable,
> > > >>>>>> production-ready
> > > >>>>>>> releases every two months.
> > > >>>>>>>
> > > >>>>>>> So here is my proposal for 3.0:
> > > >>>>>>>
> > > >>>>>>> We’re just about ready to start serious review of 8099.  When
> > > that’s
> > > >>>>>> done,
> > > >>>>>>> we branch 3.0 and cut a beta and then release candidates.
> > Whatever
> > > >>>>> isn’t
> > > >>>>>>> done by then, has to wait; unlike prior betas, we will only
> > accept
> > > >>>>> bug
> > > >>>>>>> fixes into 3.0 after branching.
> > > >>>>>>>
> > > >>>>>>> One month after 3.0, we will ship 3.1 (with new features).  At
> > the
> > > >>>>> same
> > > >>>>>>> time, we will branch 3.2.  New features in trunk will go into
> > 3.3.
> > > >>>>> The
> > > >>>>>> 3.2
> > > >>>>>>> branch will only get bug fixes.  We will maintain backwards
> > > >>>>> compatibility
> > > >>>>>>> for all of 3.x; eventually (no less than a year) we will pick a
> > > >>>>> release
> > > >>>>>> to
> > > >>>>>>> be 4.0, and drop deprecated features and old backwards
> > > >>>>> compatibilities.
> > > >>>>>>> Otherwise there will be nothing special about the 4.0
> > designation.
> > > >>>>> (Note
> > > >>>>>>> that with an “odd releases have new features, even releases
> only
> > > have
> > > >>>>> bug
> > > >>>>>>> fixes” policy, 4.0 will actually be *more* stable than 3.11.)
> > > >>>>>>>
> > > >>>>>>> Larger features can continue to be developed in separate
> > branches,
> > > >>>>> the
> > > >>>>>> way
> > > >>>>>>> 8099 is being worked on today, and committed to trunk when
> ready.
> > > So
> > > >>>>>> this
> > > >>>>>>> is not saying that we are limited only to features we can build
> > in
> > > a
> > > >>>>>> single
> > > >>>>>>> month.
> > > >>>>>>>
> > > >>>>>>> Some things will have to change with our dev process, for the
> > > better.
> > > >>>>> In
> > > >>>>>>> particular, with one month to commit new features, we don’t
> have
> > > room
> > > >>>>> for
> > > >>>>>>> committing sloppy work and stabilizing it later.  Trunk has to
> be
> > > >>>>> stable
> > > >>>>>> at
> > > >>>>>>> all times.  I asked Ariel Weisberg to put together his thoughts
> > > >>>>>> separately
> > > >>>>>>> on what worked for his team at VoltDB, and how we can apply
> that
> > to
> > > >>>>>>> Cassandra -- see his email from Friday <http://bit.ly/1MHaOKX
> >.
> > > >>>>> (TLDR:
> > > >>>>>>> Redefine “done” to include automated tests.  Infrastructure to
> > run
> > > >>>>> tests
> > > >>>>>>> against github branches before merging to trunk.  A new test
> > > harness
> > > >>>>> for
> > > >>>>>>> long-running regression tests.)
> > > >>>>>>>
> > > >>>>>>> I’m optimistic that as we improve our process this way, our
> even
> > > >>>>> releases
> > > >>>>>>> will become increasingly stable.  If so, we can skip sub-minor
> > > >>>>> releases
> > > >>>>>>> (3.2.x) entirely, and focus on keeping the release train
> moving.
> > > In
> > > >>>>> the
> > > >>>>>>> meantime, we will continue delivering 2.1.x stability releases.
> > > >>>>>>>
> > > >>>>>>> This won’t be an entirely smooth transition.  In particular,
> you
> > > will
> > > >>>>>> have
> > > >>>>>>> noticed that 3.1 will get more than a month’s worth of new
> > features
> > > >>>>> while
> > > >>>>>>> we stabilize 3.0 as the last of the old way of doing things, so
> > > some
> > > >>>>>>> patience is in order as we try this out.  By 3.4 and 3.6 later
> > this
> > > >>>>> year
> > > >>>>>> we
> > > >>>>>>> should have a good idea if this is working, and we can make
> > > >>>>> adjustments
> > > >>>>>> as
> > > >>>>>>> warranted.
> > > >>>>>>>
> > > >>>>>>> --
> > > >>>>>>> Jonathan Ellis
> > > >>>>>>> Project Chair, Apache Cassandra
> > > >>>>>>> co-founder, http://www.datastax.com
> > > >>>>>>> @spyced
> > > >>>>>
> > > >>>>
> > > >>>
> > > >
> > >
> > >
> >
>
>
>
> --
> Thanks,
> Phil Yang
>

Re: 3.0 and the Cassandra release process

Reply via email to