+1

-- 
AY

On March 17, 2015 at 14:07:03, Jonathan Ellis (jbel...@gmail.com) wrote:

Cassandra 2.1 was released in September, which means that if we were on  
track with our stated goal of six month releases, 3.0 would be done about  
now. Instead, we haven't even delivered a beta. The immediate cause this  
time is blocking for 8099  
<https://issues.apache.org/jira/browse/CASSANDRA-8099>, but the reality is  
that nobody should really be surprised. Something always comes up -- we've  
averaged about nine months since 1.0, with 2.1 taking an entire year.  

We could make theory align with reality by acknowledging, "if nine months  
is our 'natural' release schedule, then so be it." But I think we can do  
better.  

Broadly speaking, we have two constituencies with Cassandra releases:  

First, we have the users who are building or porting an application on  
Cassandra. These users want the newest features to make their job easier.  
If 2.1.0 has a few bugs, it's not the end of the world. They have time to  
wait for 2.1.x to stabilize while they write their code. They would like  
to see us deliver on our six month schedule or even faster.  

Second, we have the users who have an application in production. These  
users, or their bosses, want Cassandra to be as stable as possible.  
Assuming they deploy on a stable release like 2.0.12, they don't want to  
touch it. They would like to see us release *less* often. (Because that  
means they have to do less upgrades while remaining in our backwards  
compatibility window.)  

With our current "big release every X months" model, these users' needs are  
in tension.  

We discussed this six months ago, and ended up with this:  

What if we tried a [four month] release cycle, BUT we would guarantee that  
> you could do a rolling upgrade until we bump the supermajor version? So 2.0  
> could upgrade to 3.0 without having to go through 2.1. (But to go to 3.1  
> or 4.0 you would have to go through 3.0.)  
>  

Crucially, I added  

Whether this is reasonable depends on how fast we can stabilize releases.  
> 2.1.0 will be a good test of this.  
>  

Unfortunately, even after DataStax hired half a dozen full-time test  
engineers, 2.1.0 continued the proud tradition of being unready for  
production use, with "wait for .5 before upgrading" once again looking like  
a good guideline.  

I’m starting to think that the entire model of “write a bunch of new  
features all at once and then try to stabilize it for release” is broken.  
We’ve been trying that for years and empirically speaking the evidence is  
that it just doesn’t work, either from a stability standpoint or even just  
shipping on time.  

A big reason that it takes us so long to stabilize new releases now is  
that, because our major release cycle is so long, it’s super tempting to  
slip in “just one” new feature into bugfix releases, and I’m as guilty of  
that as anyone.  

For similar reasons, it’s difficult to do a meaningful freeze with big  
feature releases. A look at 3.0 shows why: we have 8099 coming, but we  
also have significant work done (but not finished) on 6230, 7970, 6696, and  
6477, all of which are meaningful improvements that address demonstrated  
user pain. So if we keep doing what we’ve been doing, our choices are to  
either delay 3.0 further while we finish and stabilize these, or we wait  
nine months to a year for the next release. Either way, one of our  
constituencies gets disappointed.  

So, I’d like to try something different. I think we were on the right  
track with shorter releases with more compatibility. But I’d like to throw  
in a twist. Intel cuts down on risk with a “tick-tock” schedule for new  
architectures and process shrinks instead of trying to do both at once. We  
can do something similar here:  

One month releases. Period. If it’s not done, it can wait.  
*Every other release only accepts bug fixes.*  

By itself, one-month releases are going to dramatically reduce the  
complexity of testing and debugging new releases -- and bugs that do slip  
past us will only affect a smaller percentage of users, avoiding the “big  
release has a bunch of bugs no one has seen before and pretty much everyone  
is hit by something” scenario. But by adding in the second rule, I think  
we have a real chance to make a quantum leap here: stable, production-ready  
releases every two months.  

So here is my proposal for 3.0:  

We’re just about ready to start serious review of 8099. When that’s done,  
we branch 3.0 and cut a beta and then release candidates. Whatever isn’t  
done by then, has to wait; unlike prior betas, we will only accept bug  
fixes into 3.0 after branching.  

One month after 3.0, we will ship 3.1 (with new features). At the same  
time, we will branch 3.2. New features in trunk will go into 3.3. The 3.2  
branch will only get bug fixes. We will maintain backwards compatibility  
for all of 3.x; eventually (no less than a year) we will pick a release to  
be 4.0, and drop deprecated features and old backwards compatibilities.  
Otherwise there will be nothing special about the 4.0 designation. (Note  
that with an “odd releases have new features, even releases only have bug  
fixes” policy, 4.0 will actually be *more* stable than 3.11.)  

Larger features can continue to be developed in separate branches, the way  
8099 is being worked on today, and committed to trunk when ready. So this  
is not saying that we are limited only to features we can build in a single  
month.  

Some things will have to change with our dev process, for the better. In  
particular, with one month to commit new features, we don’t have room for  
committing sloppy work and stabilizing it later. Trunk has to be stable at  
all times. I asked Ariel Weisberg to put together his thoughts separately  
on what worked for his team at VoltDB, and how we can apply that to  
Cassandra -- see his email from Friday <http://bit.ly/1MHaOKX>. (TLDR:  
Redefine “done” to include automated tests. Infrastructure to run tests  
against github branches before merging to trunk. A new test harness for  
long-running regression tests.)  

I’m optimistic that as we improve our process this way, our even releases  
will become increasingly stable. If so, we can skip sub-minor releases  
(3.2.x) entirely, and focus on keeping the release train moving. In the  
meantime, we will continue delivering 2.1.x stability releases.  

This won’t be an entirely smooth transition. In particular, you will have  
noticed that 3.1 will get more than a month’s worth of new features while  
we stabilize 3.0 as the last of the old way of doing things, so some  
patience is in order as we try this out. By 3.4 and 3.6 later this year we  
should have a good idea if this is working, and we can make adjustments as  
warranted.  

--  
Jonathan Ellis  
Project Chair, Apache Cassandra  
co-founder, http://www.datastax.com  
@spyced  

Reply via email to