I know we’ve got a lot of folks following the dev list without a lot of 
background, so let’s make sure we get some context here so everyone can be on 
the same page. 

Going to preface this wall of text by saying I’m +1 on a 3.5.1 (and 3.3.1, etc) 
if it’s done AFTER 3.9 (I think we need to get 3.9 out first before the RE 
manpower is spent on backporting fixes, even critical fixes, because 3.9 has 
multiple critical fixes for people running 3.7). 

Now some background: 

For many years, Cassandra used to have a dev process that kept 3 active 
branches - “bleeding edge”, a “stable”, and an “old stable” branch, where 
developers would be committing ALL new contributions to the bleeding edge, 
non-api-breaking changes to stable, and bugfixes only to old stable. While the 
api changed and major features were added, that bleeding edge would just be 
‘trunk’, and it’d get cut into a major version when it was ready to ship. We 
saw that with 2.2 / 2.1 / 2.0 (and before that, 2.1 / 2.0 / 1.2, and before 
that 2.0 / 1.2 / 1.1 ). When that bleeding edge got released as a major x.y.0, 
the third, oldest, most stable branch went EOL, and new features would go into 
trunk for the next major version. 

There were two big negatives observed with this:

The first big negative is that if multiple major new features were in flight, 
releases were prone to delay. Nobody wants to break an API on a x.y.1 release, 
and nobody wants to add a new feature to a x.y.2 release, so the project would 
delay the x.y releases if major features were close, and then there’d be 
pressure to slip them in before they were fully tested, or cut features to 
avoid delaying the release. This pressure was observed to be bad for the 
project – it forced technical compromises. 

The second downside that was observed was that nobody would try to run the new 
versions when they launched, because they were buggy because they were filled 
with new features. 2.2, for example, introduced RBAC, commitlog compression, 
and user defined functions – major features that needed to be tested. 
Unfortunately, because there were few real-world testers, there were still 
major bugs being found for months – the first production-ready version of 2.2 
is probably in the 2.2.5 or 2.2.6 range. 

For version 3, we moved to an alternate release, modeled on Intel’s tick/tock 
https://en.wikipedia.org/wiki/Tick-Tock_model

The intention was to allow new features into 3.even releases (3.0, 3.2, 3.4, 
3.6, and so on), with bugfixes in 3.odd releases (3.1, … ). The hope was to 
allow more frequent releases to address the first big negative (flood of new 
features that blocked releases), while also helping to address the second – 
with fewer major features in a release, they better get more/better test 
coverage.

In the tick/tock model, anyone running 3.odd (like 3.5) should be looking for 
bugfixes in 3.7. It’s certainly true that 3.5 is horribly broken (as is 3.3, 
and 3.4, etc), but with this release model, the bugfix SHOULD BE in 3.7. As I 
mentioned previously, we have precedent for backporting critical fixes, but we 
don’t have a well defined bar (that I see) for what’s critical enough for a 
backport. 

Jon is noting (and what many of us who run Cassandra in production have really 
known for a very long time) is that nobody wants to run 3.newest (even or odd), 
because 3.newest is likely broken (because it’s a complex distributed database, 
and testing is hard, and it takes time and complex workloads to find bugs). In 
the tick/tock model, because new features went into 3.6, there are new features 
that may not be adequately tested/validated in 3.7 a user of 3.5 doesn’t want, 
and isn’t willing to accept the risk.

The bottom line here is that tick/tock is probably a well intentioned but 
failed attempt to bring stability to Cassandra’s releases. The problems 
tick/tock was meant to solve are real problems, but tick/tock doesn’t seem to 
be addressing them – new features invalidate old testing, which makes it 
difficult/impossible for real users to sit on the 3.odd versions.   

We’re due for cutting 3.9 and 3.0.9, and we have limited RE manpower to get 
those out. Only after those are out would I be +1 on a 3.5.1, and then only 
because if I were running 3.5, and I hit this bug, I wouldn’t want to spend the 
~$100k it would cost my organization to validate 3.7 prior to upgrading, and I 
don’t think it’s reasonable to ask users to recompile a release for a ~10 line 
fix for a very nasty bug. 

I’m also very strongly recommend we (committers/PMC) reconsider tick/tock for 
4.x releases, because this is exactly the type of problem that will continue to 
happen as we move forward. I suggest that we either need to go back to the old 
model and do a better job of dealing with feature creep and testing, or we need 
to better define what gets backported, because the community needs a stable 
version to run, and running latest odd release of tick/tock isn’t it.

- Jeff


On 9/15/16, 10:31 AM, "dave_les...@apple.com on behalf of Dave Lester" 
<dave_les...@apple.com> wrote:

>How would cutting a 3.5.1 release possibly confuse users of the software? It 
>would be easy to document the change and to send release notes.
>
>Given the bug’s critical nature and that it's a minor fix, I’m +1 
>(non-binding) to a new release.
>
>Dave
>
>> On Sep 15, 2016, at 7:18 AM, Jeremiah D Jordan 
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__jeremiah.jordan-40gmail.com&d=DQIFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=srNzKwrs8hKPoJMZ4Ao18CYaMYKnbWaCHou6ui5tqdM&s=iM_LKKIhaiC0w6uz3lhK1lob4gJbKhLPqGNfPPLye6w&e=
>>  > wrote:
>> 
>> I’m with Jeff on this, 3.7 (bug fixes on 3.6) has already been released with 
>> the fix.  Since the fix applies cleanly anyone is free to put it on top of 
>> 3.5 on their own if they like, but I see no reason to put out a 3.5.1 right 
>> now and confuse people further.
>> 
>> -Jeremiah
>> 
>> 
>>> On Sep 15, 2016, at 9:07 AM, Jonathan Haddad <j...@jonhaddad.com> wrote:
>>> 
>>> As I follow up, I suppose I'm only advocating for a fix to the odd
>>> releases.  Sadly, Tick Tock versioning is misleading.
>>> 
>>> If tick tock were to continue (and I'm very much against how it currently
>>> works) the whole even-features odd-fixes thing needs to stop ASAP, all it
>>> does it confuse people.
>>> 
>>> The follow up to 3.4 (3.5) should have been 3.4.1, following semver, so
>>> people know it's bug fixes only to 3.4.
>>> 
>>> Jon
>>> 
>>> On Wed, Sep 14, 2016 at 10:37 PM Jonathan Haddad <j...@jonhaddad.com> wrote:
>>> 
>>>> In this particular case, I'd say adding a bug fix release for every
>>>> version that's affected would be the right thing.  The issue is so easily
>>>> reproducible and will likely result in massive data loss for anyone on 3.X
>>>> WHERE X < 6 and uses the "date" type.
>>>> 
>>>> This is how easy it is to reproduce:
>>>> 
>>>> 1. Start Cassandra 3.5
>>>> 2. create KEYSPACE test WITH replication = {'class': 'SimpleStrategy',
>>>> 'replication_factor': 1};
>>>> 3. use test;
>>>> 4. create table fail (id int primary key, d date);
>>>> 5. delete d from fail where id = 1;
>>>> 6. Stop Cassandra
>>>> 7. Start Cassandra
>>>> 
>>>> You will get this, and startup will fail:
>>>> 
>>>> ERROR 05:32:09 Exiting due to error while processing commit log during
>>>> initialization.
>>>> org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException:
>>>> Unexpected error deserializing mutation; saved to
>>>> /var/folders/0l/g2p6cnyd5kx_1wkl83nd3y4r0000gn/T/mutation6313332720566971713dat.
>>>> This may be caused by replaying a mutation against a table with the same
>>>> name but incompatible schema.  Exception follows:
>>>> org.apache.cassandra.serializers.MarshalException: Expected 4 byte long for
>>>> date (0)
>>>> 
>>>> I mean.. come on.  It's an easy fix.  It cleanly merges against 3.5 (and
>>>> probably the other releases) and requires very little investment from
>>>> anyone.
>>>> 
>>>> 
>>>> On Wed, Sep 14, 2016 at 9:40 PM Jeff Jirsa <jeff.ji...@crowdstrike.com>
>>>> wrote:
>>>> 
>>>>> We did 3.1.1 and 3.2.1, so there’s SOME precedent for emergency fixes,
>>>>> but we certainly didn’t/won’t go back and cut new releases from every
>>>>> branch for every critical bug in future releases, so I think we need to
>>>>> draw the line somewhere. If it’s fixed in 3.7 and 3.0.x (x >= 6), it seems
>>>>> like you’ve got options (either stay on the tick and go up to 3.7, or bail
>>>>> down to 3.0.x)
>>>>> 
>>>>> Perhaps, though, this highlights the fact that tick/tock may not be the
>>>>> best option long term. We’ve tried it for a year, perhaps we should 
>>>>> instead
>>>>> discuss whether or not it should continue, or if there’s another process
>>>>> that gives us a better way to get useful patches into versions people are
>>>>> willing to run in production.
>>>>> 
>>>>> 
>>>>> 
>>>>> On 9/14/16, 8:55 PM, "Jonathan Haddad" <j...@jonhaddad.com> wrote:
>>>>> 
>>>>>> Common sense is what prevents someone from upgrading to yet another
>>>>>> completely unknown version with new features which have probably broken
>>>>>> even more stuff that nobody is aware of.  The folks I'm helping right
>>>>>> deployed 3.5 when they got started because
>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__cassandra.apache.org&d=DQIBaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=MZ9nLcNNhQZkuXyH0NBbP1kSEE2M-SYgyVqZ88IJcXY&s=pLP3udocOcAG6k_sAb9p8tcAhtOhpFm6JB7owGhPQEs&e=
>>>>> suggests
>>>>>> it's acceptable for production.  It turns out using 4 of the built in
>>>>>> datatypes of the database result in the server being unable to restart
>>>>>> without clearing out the commit logs and running a repair.  That screams
>>>>>> critical to me.  You shouldn't even be able to install 3.5 without the
>>>>>> patch I've supplied - that bug is a ticking time bomb for anyone that
>>>>>> installs it.
>>>>>> 
>>>>>> On Wed, Sep 14, 2016 at 8:12 PM Michael Shuler <mich...@pbandjelly.org>
>>>>>> wrote:
>>>>>> 
>>>>>>> What's preventing the use of the 3.6 or 3.7 releases where this bug is
>>>>>>> already fixed? This is also fixed in the 3.0.6/7/8 releases.
>>>>>>> 
>>>>>>> Michael
>>>>>>> 
>>>>>>> On 09/14/2016 08:30 PM, Jonathan Haddad wrote:
>>>>>>>> Unfortunately CASSANDRA-11618 was fixed in 3.6 but was not back
>>>>> ported to
>>>>>>>> 3.5 as well, and it makes Cassandra effectively unusable if someone
>>>>> is
>>>>>>>> using any of the 4 types affected in any of their schema.
>>>>>>>> 
>>>>>>>> I have cherry picked & merged the patch back to here and will put it
>>>>> in a
>>>>>>>> JIRA as well tonight, I just wanted to get the ball rolling asap on
>>>>> this.
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_rustyrazorblade_cassandra_tree_fix-5Fcommitlog-5Fexception&d=DQIBaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=MZ9nLcNNhQZkuXyH0NBbP1kSEE2M-SYgyVqZ88IJcXY&s=ktY5tkT-nO1jtyc0EicbgZHXJYl03DvzuxqzyyOgzII&e=
>>>>>>>> 
>>>>>>>> Jon
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> 
>

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to