> there was "some" reason that even major changes had to be
> squeezed into 3.0 before it was released
The TL;DR is: having One Version to Rule Them All forces a slew of
changes into majors only, since bumping the MessagingService Version
has far-reaching impacts. Reference:
https://issues.apache.org/jira/browse/CASSANDRA-12042

With this setup, it doesn't matter what arbitrary date we have on a
calendar for a release; there's always going to be a bunch of things
in flight that end up cut in scope to try and get in since it's a
12-15 month delay to get out the door otherwise as they're blocked by
protocol bumps. In part, tick-tock was an effort to try and ease some
of that 'release infrequently and you are pressured to get things into
a release since protocol version changes are infrequent', though we
never got so far as to iron out 12042 and fully close the loop on that
approach.

On Tue, Oct 3, 2017 at 5:56 AM, kurt greaves <k...@instaclustr.com> wrote:
> Well this is all terribly interesting. I was actually going to get some
> discussion going about this during my talk, which unfortunately didn't
> happen, but I'll take this opportunity to push my agenda. My 99 cents:
>
> *tl;dr: we should probably just focus on not releasing completely broken
> features in the first place, and we should do that through user
> engagement/testing wooo!*
>
> Some context to begin with, because I think this needs to be spelled out.
> Cassandra is a database. People treat databases as their prize possession.
> It stores all their sweet sweet data, and undoubtedly that data is the most
> important component in their system. Without it, there is no point in
> having a system. Users expect their databases to be the most stable
> component of their system, and generally they won't upgrade them without
> being absolutely positively sure that a new version will work at least
> exactly as the old one has. All our users treat their database in exactly
> this same way. Change happens slowly in the database world, and generally
> this is true both for the database and the users of the database. "C* 3.0.0
> is out tomorrow! let's upgrade!" - said no one ever.
>
> Anyway, with that out of the way, back to the crux of the issue. This may
> get long and unwieldy, and derail the actual thread, but in this case I
> think for good reason. Either way it's all relevant to the actual topic.
>
> I think it's worth taking a step back and looking at the actual situation
> and what brought us here, rather than just proposing a solution that's
> really just a band-aid on the real issue. These are the problems I've seen
> that have caused a lot of the pain with new features, and an indication
> that we need to change the way we manage our releases and major changes.
>
>    1. We pushed out large feature sets with minimal testing of said
>    features. At this stage we had no requirement for clean passing tests on
>    commit, and over all we didn't have a strong commitment to writing tests
>    either. In 3.10 this changed, where we put forth that dtests and utests
>    needed to pass, and new tests needed to be written for each change. Any
>    change prior to 3.10 was subject to many flaky tests with minimal coverage.
>    Many features only went partially tested and were committed anyway.
>
>    2. We rushed features to meet deadlines, or simply didn't give them
>    enough time + thought in the conception phase because of deadlines.
>    I've never met an arbitrary deadline that made things better. From
>    looking at lots of old tickets, there was "some" reason that even major
>    changes had to be squeezed into 3.0 before it was released, which resulted
>    in a lack of attention and testing for these features. We didn't just wait
>    until things were ready before committing them, we just cut scope so it
>    would fit. I honestly don't know how this could ever make sense for a
>    volunteer driven project. In fact I don't really know how it works well for
>    any software project. It generally just ends in bad software. It might make
>    sense for a business pushing the feature agenda for $$, or where a projects
>    users don't care about stability (lol), but it still results in bad
>    software. It definitely doesn't make sense for an open source project.
>
>    3. We didn't do any system-wide verification/integration testing of
>    features. We essentially relied on dtests and unit tests. Touched on this
>    in 1, but we don't have much system testing. dtests kind of covers it, but
>    not really well. cstar is also used in some cases but is also limited in
>    scope (performance only, really). We're lucky that we can cover a lot of
>    cases with dtests, but it seems to me that we don't capture a lot of the
>    cases where feature X affects feature Y. E.g: the effect of repairs against
>    everything ever, but mostly vnodes. We really need a proper testing cluster
>    with each version we put out, and to test new and existing features
>    extensively to measure their worth. Instaclustr is looking at this but
>    we're still a ways off having something up and running.
>    On this note we also changed defaults prematurely, but we wouldn't know
>    it was premature until we did so, as if we didn't change the default they
>    probably wouldn't have received much usage.
>
>    4. Our community is made up of mostly power users, and most of these are
>    still on older versions (2.0, 2.1). There is little reason for these users
>    to upgrade to newer versions, and little reason to use the new features
>    (even if they were the ones developing them). This is actually great, that
>    the power users have been adding functionality to Cassandra for new users,
>    however we haven't really engaged with these users to go and verify this
>    functionality, and we did a pretty half-arsed job of testing them
>    ourselves. We essentially just rolled it out and waited for the bug 
> reports.
>    IMO this is where the "experimental flag" comes in. We rolled out a
>    bunch of stuff, a year later some people started using it and realised it
>    didn't quite work but they had already invested a lot of time into it, all
>    of a sudden there is a world of issues and we realise we never should have
>    rolled it out in the first place. It's tempting to just say "let's put in
>    an experimental flag so this doesn't happen again and we'll be all G", but
>    that won't actually fix the problem, it's much like the changing the
>    defaults problem.
>
> Now, in a perfect world we would have the testing in place to not need an
> "experimental" flag, which I think is what we should actually aim for. In
> the mean time an experimental flag *may* be necessary, but so far I'm not
> really convinced. If we just mark a feature as experimental it will scare a
> lot of users off, and these new features will have a lot less coverage.
> Albeit there will be a lot less problems, but only because less people are
> using it. Especially with no indication of when it will actually be
> production ready. On that note, how do we even decide when it is production
> ready? It's bound to be something arbitrary like "we haven't seen a
> horrible bug in 6 months", which is no better than what we currently have.
> This sort of thing detracts from the usefulness of Cassandra, and gives
> nice big opportunities for someone to come along and do it better than us.
>
> I actually think a better solution here is more user engagement/testing in
> the release process. If there are users actually out there who want these
> features, they should be willing to help us test them prior to release. If
> each feature can get exposed to a few different use cases on real
> *staging* clusters,
> we could verify functionality a lot easier. This would have been cake with
> MV's, as there are many users managing their own views that could have just
> replaced them with MV's in their staging environment. This can be applied
> to a lot of other features as well (incremental repairs replace full
> repairs, SASI replace SI or even Solr), it just requires some buy-in from
> the userbase, which I'm sure we'd find, because if we didn't there would be
> no reason to write the feature in the first place. This would put us in a
> lot better position than an experimental flag, which would essentially
> require us to do this exact same thing in order to make a feature
> "production ready", however those experimental features may never end up
> getting the attention they need to become production ready. You could argue
> that if someone really wanted it then they'd push to get it out of an
> experimental state, but I think you'd find that most users will only
> consider what's readily available to them.
>
> And finally, back onto the original topic. I'm not convinced that MV's need
> this treatment now. Zhao and Paulo (and others+reviewers) have made quite a
> lot of fixes, granted there are still some outstanding bugs but the
> majority of bad ones have been fixed in 3.11.1 and 3.0.15, the remaining
> bugs mostly only affect views with a poor data model. Plus we've already
> required the known broken components require a flag to be turned on. Also
> at this point it's not worth making them experimental because a lot of
> users are already using them, it's a bit late to go and do that. We should
> just continue to try and fix them, or where not possible clearly document
> use cases that should be avoided.
>
> Frankly, marking features experimental that loads of users have already
> invested in feels to me a bit like a kick in the teeth to said users.
> Almost like telling them "we're actually not going to support this,
> surprise". If it's a big deal, we should probably just fix the issues. If
> anyone knows some really pressing issues I'm unaware of, feel free to fill
> me in. The only issue raised in this thread so far is a tool to repair
> consistency between view and base. While I think this is necessary, it
> really shouldn't be a major problem on the latest releases, and really, if
> the view loses consistency with the base, waiting for some kind of repair
> to fix it isn't much better than just rebuilding it from scratch. This is
> one case where we should document the possible causes of an inconsistent
> view, and the way to fix it (which is essentially, you had an outage, now
> you need to rebuild it), along with a warning about this in the docs.
>
> And to bring it all back to my initial comment about slow-moving databases
> and change and things... We've literally only just got stricter w.r.t
> testing in 3.10. We've hardly given 3.11 a go before coming along and
> saying "we need to make everything experimental so no one gets hurt!".
> Change is and should be slow in a database world, and science should be
> applied. At the very least, before we get too crazy, we should see if the
> changes to how we do testing have a positive effect on future features.
> This also comes back to the deadline situation I mentioned earlier. While
> we haven't formally changed how releases are scheduled/managed, we've
> informally moved to a strategy of "we'll have these problems solved before
> we do the next release". I think this will also be a huge improvement to
> the stability/production readiness of new features in 4.0. (ps: we should
> formalise that but that's a whole 'nother wall of text)
>
> Anyway, I have lots more to say on this and related topics but I see Josh
> is already raising one of my points against experimental flags now, and
> this is probably enough words for one email.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Reply via email to