Re: 3.0 and the Cassandra release process

Ariel Weisberg Thu, 19 Mar 2015 10:10:17 -0700

Hi,

I realized one of the documents we didn't send out was the infrastructure
side changes I am looking for. This one is maybe a little rougher as it was
the first one I wrote on the subject.


https://docs.google.com/document/d/1Seku0vPwChbnH3uYYxon0UO-b6LDtSqluZiH--sWWi0/edit?usp=sharing

The goal is to have infrastructure that gives developers as close to
immediate feedback as possible on their code before they merge. Feedback
that is delayed to after merging to trunk should come in a day or two and
there is a product owner (Michael Shuler) responsible for making sure that
issues are addressed quickly.

QA is going to help by providing developers with a better tools for writing
higher level functional tests that explore all of the functions together
along with the configuration space without developers having to do any work
other then plugging in functionality to exercise and then validate
something specific. This kind of harness is hard to get right and make
reliable and expressive so they have their work cut out for them.

It's going to be an iterative process where the tests improve as new work
introduces missing coverage and as bugs/regressions drive the introduction
of new tests. The monthly retrospective (planning on doing that first of
the month) is also going to help us refine the testing and development
process.

Ariel

On Thu, Mar 19, 2015 at 7:23 AM, Jason Brown <[email protected]> wrote:

> +1 to this general proposal. I think the time has finally come for us to
> try something new, and this sounds legit. Thanks!
>
> On Thu, Mar 19, 2015 at 12:49 AM, Phil Yang <[email protected]> wrote:
>
> > Can I regard the odd version as the "development preview" and the even
> > version as the "production ready"?
> >
> > IMO, as a database infrastructure project, "stable" is more important
> than
> > other kinds of projects. LTS is a good idea, but if we don't support
> > non-LTS releases for enough time to fix their bugs, users on non-LTS
> > release may have to upgrade a new major release to fix the bugs and may
> > have to handle some new bugs by the new features. I'm afraid that
> > eventually people would only think about the LTS one.
> >
> >
> > 2015-03-19 8:48 GMT+08:00 Pavel Yaskevich <[email protected]>:
> >
> > > +1
> > >
> > > On Wed, Mar 18, 2015 at 3:50 PM, Michael Kjellman <
> > > [email protected]> wrote:
> > >
> > > > For most of my life I’ve lived on the software bleeding edge both
> > > > personally and professionally. Maybe it’s a personal weakness, but I
> > > guess
> > > > I get a thrill out of the problem solving aspect?
> > > >
> > > > Recently I came to a bit of an epiphany — the closer I keep to the
> > daily
> > > > build — generally the happier I am on a daily basis. Bugs happen, but
> > for
> > > > the most part (aside from show stopper bugs), pain points for myself
> > in a
> > > > given daily build can generally can be debugged to 1 or maybe 2 root
> > > > causes, fixed in ~24 hours, and then life is better the next day
> again.
> > > In
> > > > comparison, the old waterfall model generally means taking an
> > “official”
> > > > release at some point and waiting for some poor soul (or developer)
> to
> > > > actually run the thing. No matter how good the QA team is, until it’s
> > > > actually used in the real world, most bugs aren’t found.
> > > >
> > > > If you and your organization can wait 24 hours * number of bugs
> > > discovered
> > > > after people actually started using the thing, you end up with a
> > “usable
> > > > build” around the holy-grail minor X.X.5 release of Cassandra.
> > > >
> > > > I love the idea of the LTS model Jonathan describes because it means
> > more
> > > > code can get real testing and “bake” for longer instead of sitting
> > > largely
> > > > unused on some git repository in a datacenter far far away. A lot of
> > code
> > > > has changed between 2.0 and trunk today. The code has diverged to the
> > > point
> > > > that if you write something for 2.0 (as the most stable major branch
> > > > currently available), merging it forward to 3.0 or after generally
> > means
> > > > rewriting it. If the only thing that comes out of this is a smaller
> > delta
> > > > of LOC between the deployable version/branch and what we can develop
> > > > against and what QA is focused on I think that’s a massive win.
> > > >
> > > > Something like CASSANDRA-8099 will need 2x the baking time of even
> many
> > > of
> > > > the more risky changes the project has made. While I wouldn’t want to
> > > run a
> > > > build with CASSANDRA-8099 in it anytime soon, there are now hundreds
> of
> > > > other changes blocked, most likely many containing new bugs of their
> > own,
> > > > but have no exposure at all to even the most involved C* developers.
> > > >
> > > > I really think this will be a huge win for the project and I’m super
> > > > thankful for Sylvian, Ariel, Jonathan, Aleksey, and Jake for guiding
> > this
> > > > change to a much more sustainable release model for the entire
> > community.
> > > >
> > > > best,
> > > > kjellman
> > > >
> > > >
> > > > > On Mar 18, 2015, at 3:02 PM, Ariel Weisberg <
> > > [email protected]>
> > > > wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > Keep in mind it is a bug fix release every month and a feature
> > release
> > > > every two months.
> > > > >
> > > > > For development that is really a two month cycle with all bug fixes
> > > > being backported one release. As a developer if you want to get
> > something
> > > > in a release you have two months and you should be sizing pieces of
> > large
> > > > tasks so they ship at least every two months.
> > > > >
> > > > > Ariel
> > > > >> On Mar 18, 2015, at 5:58 PM, Terrance Shepherd <
> [email protected]
> > >
> > > > wrote:
> > > > >>
> > > > >> I like the idea but I agree that every month is a bit aggressive.
> I
> > > > have no
> > > > >> say but:
> > > > >>
> > > > >> I would say 4 releases a year instead of 12. with 2 months of new
> > > > features
> > > > >> and 1 month of bug squashing per a release. With the 4th quarter
> > just
> > > > bugs.
> > > > >>
> > > > >> I would also proposed 2 year LTS releases for the releases after
> the
> > > 4th
> > > > >> quarter. So everyone could get a new feature release every quarter
> > and
> > > > the
> > > > >> stability of super major versions for 2 years.
> > > > >>
> > > > >> On Wed, Mar 18, 2015 at 2:34 PM, Dave Brosius <
> > > [email protected]
> > > > >
> > > > >> wrote:
> > > > >>
> > > > >>> It would seem the practical implications of this is that there
> > would
> > > be
> > > > >>> significantly more development on branches, with potentially more
> > > > >>> significant delays on merging these branches. This would imply to
> > me
> > > > that
> > > > >>> more Jenkins servers would need to be set up to handle
> auto-testing
> > > of
> > > > more
> > > > >>> branches, as if feature work spends more time on external
> branches,
> > > it
> > > > is
> > > > >>> then likely to be be less tested (even if by accident) as less
> > > > developers
> > > > >>> would be working on that branch. Only when a feature was blessed
> to
> > > > make it
> > > > >>> to the release-tracked branch, would it become exposed to the
> > > majority
> > > > of
> > > > >>> developers/testers, etc doing normal running/playing/testing.
> > > > >>>
> > > > >>> This isn't to knock the idea in anyway, just wanted to mention
> > what i
> > > > >>> think the outcome would be.
> > > > >>>
> > > > >>> dave
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>>
> > > > >>>>>> On Tue, Mar 17, 2015 at 5:06 PM, Jonathan Ellis <
> > > [email protected]>
> > > > >>>>> wrote:
> > > > >>>>>>> Cassandra 2.1 was released in September, which means that if
> we
> > > > were
> > > > >>>>> on
> > > > >>>>>>> track with our stated goal of six month releases, 3.0 would
> be
> > > done
> > > > >>>>> about
> > > > >>>>>>> now.  Instead, we haven't even delivered a beta.  The
> immediate
> > > > cause
> > > > >>>>>> this
> > > > >>>>>>> time is blocking for 8099
> > > > >>>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-8099>, but
> > the
> > > > >>>>> reality
> > > > >>>>>> is
> > > > >>>>>>> that nobody should really be surprised.  Something always
> comes
> > > up
> > > > --
> > > > >>>>>> we've
> > > > >>>>>>> averaged about nine months since 1.0, with 2.1 taking an
> entire
> > > > year.
> > > > >>>>>>>
> > > > >>>>>>> We could make theory align with reality by acknowledging, "if
> > > nine
> > > > >>>>> months
> > > > >>>>>>> is our 'natural' release schedule, then so be it."  But I
> think
> > > we
> > > > >>>>> can
> > > > >>>>> do
> > > > >>>>>>> better.
> > > > >>>>>>>
> > > > >>>>>>> Broadly speaking, we have two constituencies with Cassandra
> > > > releases:
> > > > >>>>>>>
> > > > >>>>>>> First, we have the users who are building or porting an
> > > application
> > > > >>>>> on
> > > > >>>>>>> Cassandra.  These users want the newest features to make
> their
> > > job
> > > > >>>>>> easier.
> > > > >>>>>>> If 2.1.0 has a few bugs, it's not the end of the world.  They
> > > have
> > > > >>>>> time
> > > > >>>>>> to
> > > > >>>>>>> wait for 2.1.x to stabilize while they write their code.
> They
> > > > would
> > > > >>>>> like
> > > > >>>>>>> to see us deliver on our six month schedule or even faster.
> > > > >>>>>>>
> > > > >>>>>>> Second, we have the users who have an application in
> > production.
> > > > >>>>> These
> > > > >>>>>>> users, or their bosses, want Cassandra to be as stable as
> > > possible.
> > > > >>>>>>> Assuming they deploy on a stable release like 2.0.12, they
> > don't
> > > > want
> > > > >>>>> to
> > > > >>>>>>> touch it.  They would like to see us release *less* often.
> > > > (Because
> > > > >>>>> that
> > > > >>>>>>> means they have to do less upgrades while remaining in our
> > > > backwards
> > > > >>>>>>> compatibility window.)
> > > > >>>>>>>
> > > > >>>>>>> With our current "big release every X months" model, these
> > users'
> > > > >>>>> needs
> > > > >>>>>> are
> > > > >>>>>>> in tension.
> > > > >>>>>>>
> > > > >>>>>>> We discussed this six months ago, and ended up with this:
> > > > >>>>>>>
> > > > >>>>>>> What if we tried a [four month] release cycle, BUT we would
> > > > guarantee
> > > > >>>>>> that
> > > > >>>>>>>> you could do a rolling upgrade until we bump the supermajor
> > > > version?
> > > > >>>>> So
> > > > >>>>>> 2.0
> > > > >>>>>>>> could upgrade to 3.0 without having to go through 2.1.  (But
> > to
> > > go
> > > > >>>>> to
> > > > >>>>>> 3.1
> > > > >>>>>>>> or 4.0 you would have to go through 3.0.)
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> Crucially, I added
> > > > >>>>>>>
> > > > >>>>>>> Whether this is reasonable depends on how fast we can
> stabilize
> > > > >>>>> releases.
> > > > >>>>>>>> 2.1.0 will be a good test of this.
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> Unfortunately, even after DataStax hired half a dozen
> full-time
> > > > test
> > > > >>>>>>> engineers, 2.1.0 continued the proud tradition of being
> unready
> > > for
> > > > >>>>>>> production use, with "wait for .5 before upgrading" once
> again
> > > > >>>>> looking
> > > > >>>>>> like
> > > > >>>>>>> a good guideline.
> > > > >>>>>>>
> > > > >>>>>>> I’m starting to think that the entire model of “write a bunch
> > of
> > > > new
> > > > >>>>>>> features all at once and then try to stabilize it for
> release”
> > is
> > > > >>>>> broken.
> > > > >>>>>>> We’ve been trying that for years and empirically speaking the
> > > > >>>>> evidence
> > > > >>>>> is
> > > > >>>>>>> that it just doesn’t work, either from a stability standpoint
> > or
> > > > even
> > > > >>>>>> just
> > > > >>>>>>> shipping on time.
> > > > >>>>>>>
> > > > >>>>>>> A big reason that it takes us so long to stabilize new
> releases
> > > now
> > > > >>>>> is
> > > > >>>>>>> that, because our major release cycle is so long, it’s super
> > > > tempting
> > > > >>>>> to
> > > > >>>>>>> slip in “just one” new feature into bugfix releases, and I’m
> as
> > > > >>>>> guilty
> > > > >>>>> of
> > > > >>>>>>> that as anyone.
> > > > >>>>>>>
> > > > >>>>>>> For similar reasons, it’s difficult to do a meaningful freeze
> > > with
> > > > >>>>> big
> > > > >>>>>>> feature releases.  A look at 3.0 shows why: we have 8099
> > coming,
> > > > but
> > > > >>>>> we
> > > > >>>>>>> also have significant work done (but not finished) on 6230,
> > 7970,
> > > > >>>>> 6696,
> > > > >>>>>> and
> > > > >>>>>>> 6477, all of which are meaningful improvements that address
> > > > >>>>> demonstrated
> > > > >>>>>>> user pain.  So if we keep doing what we’ve been doing, our
> > > choices
> > > > >>>>> are
> > > > >>>>> to
> > > > >>>>>>> either delay 3.0 further while we finish and stabilize these,
> > or
> > > we
> > > > >>>>> wait
> > > > >>>>>>> nine months to a year for the next release.  Either way, one
> of
> > > our
> > > > >>>>>>> constituencies gets disappointed.
> > > > >>>>>>>
> > > > >>>>>>> So, I’d like to try something different.  I think we were on
> > the
> > > > >>>>> right
> > > > >>>>>>> track with shorter releases with more compatibility.  But I’d
> > > like
> > > > to
> > > > >>>>>> throw
> > > > >>>>>>> in a twist.  Intel cuts down on risk with a “tick-tock”
> > schedule
> > > > for
> > > > >>>>> new
> > > > >>>>>>> architectures and process shrinks instead of trying to do
> both
> > at
> > > > >>>>> once.
> > > > >>>>>> We
> > > > >>>>>>> can do something similar here:
> > > > >>>>>>>
> > > > >>>>>>> One month releases.  Period.  If it’s not done, it can wait.
> > > > >>>>>>> *Every other release only accepts bug fixes.*
> > > > >>>>>>>
> > > > >>>>>>> By itself, one-month releases are going to dramatically
> reduce
> > > the
> > > > >>>>>>> complexity of testing and debugging new releases -- and bugs
> > that
> > > > do
> > > > >>>>> slip
> > > > >>>>>>> past us will only affect a smaller percentage of users,
> > avoiding
> > > > the
> > > > >>>>> “big
> > > > >>>>>>> release has a bunch of bugs no one has seen before and pretty
> > > much
> > > > >>>>>> everyone
> > > > >>>>>>> is hit by something” scenario.  But by adding in the second
> > > rule, I
> > > > >>>>> think
> > > > >>>>>>> we have a real chance to make a quantum leap here: stable,
> > > > >>>>>> production-ready
> > > > >>>>>>> releases every two months.
> > > > >>>>>>>
> > > > >>>>>>> So here is my proposal for 3.0:
> > > > >>>>>>>
> > > > >>>>>>> We’re just about ready to start serious review of 8099.  When
> > > > that’s
> > > > >>>>>> done,
> > > > >>>>>>> we branch 3.0 and cut a beta and then release candidates.
> > > Whatever
> > > > >>>>> isn’t
> > > > >>>>>>> done by then, has to wait; unlike prior betas, we will only
> > > accept
> > > > >>>>> bug
> > > > >>>>>>> fixes into 3.0 after branching.
> > > > >>>>>>>
> > > > >>>>>>> One month after 3.0, we will ship 3.1 (with new features).
> At
> > > the
> > > > >>>>> same
> > > > >>>>>>> time, we will branch 3.2.  New features in trunk will go into
> > > 3.3.
> > > > >>>>> The
> > > > >>>>>> 3.2
> > > > >>>>>>> branch will only get bug fixes.  We will maintain backwards
> > > > >>>>> compatibility
> > > > >>>>>>> for all of 3.x; eventually (no less than a year) we will
> pick a
> > > > >>>>> release
> > > > >>>>>> to
> > > > >>>>>>> be 4.0, and drop deprecated features and old backwards
> > > > >>>>> compatibilities.
> > > > >>>>>>> Otherwise there will be nothing special about the 4.0
> > > designation.
> > > > >>>>> (Note
> > > > >>>>>>> that with an “odd releases have new features, even releases
> > only
> > > > have
> > > > >>>>> bug
> > > > >>>>>>> fixes” policy, 4.0 will actually be *more* stable than 3.11.)
> > > > >>>>>>>
> > > > >>>>>>> Larger features can continue to be developed in separate
> > > branches,
> > > > >>>>> the
> > > > >>>>>> way
> > > > >>>>>>> 8099 is being worked on today, and committed to trunk when
> > ready.
> > > > So
> > > > >>>>>> this
> > > > >>>>>>> is not saying that we are limited only to features we can
> build
> > > in
> > > > a
> > > > >>>>>> single
> > > > >>>>>>> month.
> > > > >>>>>>>
> > > > >>>>>>> Some things will have to change with our dev process, for the
> > > > better.
> > > > >>>>> In
> > > > >>>>>>> particular, with one month to commit new features, we don’t
> > have
> > > > room
> > > > >>>>> for
> > > > >>>>>>> committing sloppy work and stabilizing it later.  Trunk has
> to
> > be
> > > > >>>>> stable
> > > > >>>>>> at
> > > > >>>>>>> all times.  I asked Ariel Weisberg to put together his
> thoughts
> > > > >>>>>> separately
> > > > >>>>>>> on what worked for his team at VoltDB, and how we can apply
> > that
> > > to
> > > > >>>>>>> Cassandra -- see his email from Friday <
> http://bit.ly/1MHaOKX
> > >.
> > > > >>>>> (TLDR:
> > > > >>>>>>> Redefine “done” to include automated tests.  Infrastructure
> to
> > > run
> > > > >>>>> tests
> > > > >>>>>>> against github branches before merging to trunk.  A new test
> > > > harness
> > > > >>>>> for
> > > > >>>>>>> long-running regression tests.)
> > > > >>>>>>>
> > > > >>>>>>> I’m optimistic that as we improve our process this way, our
> > even
> > > > >>>>> releases
> > > > >>>>>>> will become increasingly stable.  If so, we can skip
> sub-minor
> > > > >>>>> releases
> > > > >>>>>>> (3.2.x) entirely, and focus on keeping the release train
> > moving.
> > > > In
> > > > >>>>> the
> > > > >>>>>>> meantime, we will continue delivering 2.1.x stability
> releases.
> > > > >>>>>>>
> > > > >>>>>>> This won’t be an entirely smooth transition.  In particular,
> > you
> > > > will
> > > > >>>>>> have
> > > > >>>>>>> noticed that 3.1 will get more than a month’s worth of new
> > > features
> > > > >>>>> while
> > > > >>>>>>> we stabilize 3.0 as the last of the old way of doing things,
> so
> > > > some
> > > > >>>>>>> patience is in order as we try this out.  By 3.4 and 3.6
> later
> > > this
> > > > >>>>> year
> > > > >>>>>> we
> > > > >>>>>>> should have a good idea if this is working, and we can make
> > > > >>>>> adjustments
> > > > >>>>>> as
> > > > >>>>>>> warranted.
> > > > >>>>>>>
> > > > >>>>>>> --
> > > > >>>>>>> Jonathan Ellis
> > > > >>>>>>> Project Chair, Apache Cassandra
> > > > >>>>>>> co-founder, http://www.datastax.com
> > > > >>>>>>> @spyced
> > > > >>>>>
> > > > >>>>
> > > > >>>
> > > > >
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Thanks,
> > Phil Yang
> >
>

Re: 3.0 and the Cassandra release process

Reply via email to