I know we’ve got a lot of folks following the dev list without a lot of background, so let’s make sure we get some context here so everyone can be on the same page.
Going to preface this wall of text by saying I’m +1 on a 3.5.1 (and 3.3.1, etc) if it’s done AFTER 3.9 (I think we need to get 3.9 out first before the RE manpower is spent on backporting fixes, even critical fixes, because 3.9 has multiple critical fixes for people running 3.7). Now some background: For many years, Cassandra used to have a dev process that kept 3 active branches - “bleeding edge”, a “stable”, and an “old stable” branch, where developers would be committing ALL new contributions to the bleeding edge, non-api-breaking changes to stable, and bugfixes only to old stable. While the api changed and major features were added, that bleeding edge would just be ‘trunk’, and it’d get cut into a major version when it was ready to ship. We saw that with 2.2 / 2.1 / 2.0 (and before that, 2.1 / 2.0 / 1.2, and before that 2.0 / 1.2 / 1.1 ). When that bleeding edge got released as a major x.y.0, the third, oldest, most stable branch went EOL, and new features would go into trunk for the next major version. There were two big negatives observed with this: The first big negative is that if multiple major new features were in flight, releases were prone to delay. Nobody wants to break an API on a x.y.1 release, and nobody wants to add a new feature to a x.y.2 release, so the project would delay the x.y releases if major features were close, and then there’d be pressure to slip them in before they were fully tested, or cut features to avoid delaying the release. This pressure was observed to be bad for the project – it forced technical compromises. The second downside that was observed was that nobody would try to run the new versions when they launched, because they were buggy because they were filled with new features. 2.2, for example, introduced RBAC, commitlog compression, and user defined functions – major features that needed to be tested. Unfortunately, because there were few real-world testers, there were still major bugs being found for months – the first production-ready version of 2.2 is probably in the 2.2.5 or 2.2.6 range. For version 3, we moved to an alternate release, modeled on Intel’s tick/tock https://en.wikipedia.org/wiki/Tick-Tock_model The intention was to allow new features into 3.even releases (3.0, 3.2, 3.4, 3.6, and so on), with bugfixes in 3.odd releases (3.1, … ). The hope was to allow more frequent releases to address the first big negative (flood of new features that blocked releases), while also helping to address the second – with fewer major features in a release, they better get more/better test coverage. In the tick/tock model, anyone running 3.odd (like 3.5) should be looking for bugfixes in 3.7. It’s certainly true that 3.5 is horribly broken (as is 3.3, and 3.4, etc), but with this release model, the bugfix SHOULD BE in 3.7. As I mentioned previously, we have precedent for backporting critical fixes, but we don’t have a well defined bar (that I see) for what’s critical enough for a backport. Jon is noting (and what many of us who run Cassandra in production have really known for a very long time) is that nobody wants to run 3.newest (even or odd), because 3.newest is likely broken (because it’s a complex distributed database, and testing is hard, and it takes time and complex workloads to find bugs). In the tick/tock model, because new features went into 3.6, there are new features that may not be adequately tested/validated in 3.7 a user of 3.5 doesn’t want, and isn’t willing to accept the risk. The bottom line here is that tick/tock is probably a well intentioned but failed attempt to bring stability to Cassandra’s releases. The problems tick/tock was meant to solve are real problems, but tick/tock doesn’t seem to be addressing them – new features invalidate old testing, which makes it difficult/impossible for real users to sit on the 3.odd versions. We’re due for cutting 3.9 and 3.0.9, and we have limited RE manpower to get those out. Only after those are out would I be +1 on a 3.5.1, and then only because if I were running 3.5, and I hit this bug, I wouldn’t want to spend the ~$100k it would cost my organization to validate 3.7 prior to upgrading, and I don’t think it’s reasonable to ask users to recompile a release for a ~10 line fix for a very nasty bug. I’m also very strongly recommend we (committers/PMC) reconsider tick/tock for 4.x releases, because this is exactly the type of problem that will continue to happen as we move forward. I suggest that we either need to go back to the old model and do a better job of dealing with feature creep and testing, or we need to better define what gets backported, because the community needs a stable version to run, and running latest odd release of tick/tock isn’t it. - Jeff On 9/15/16, 10:31 AM, "dave_les...@apple.com on behalf of Dave Lester" <dave_les...@apple.com> wrote: >How would cutting a 3.5.1 release possibly confuse users of the software? It >would be easy to document the change and to send release notes. > >Given the bug’s critical nature and that it's a minor fix, I’m +1 >(non-binding) to a new release. > >Dave > >> On Sep 15, 2016, at 7:18 AM, Jeremiah D Jordan >> <https://urldefense.proofpoint.com/v2/url?u=http-3A__jeremiah.jordan-40gmail.com&d=DQIFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=srNzKwrs8hKPoJMZ4Ao18CYaMYKnbWaCHou6ui5tqdM&s=iM_LKKIhaiC0w6uz3lhK1lob4gJbKhLPqGNfPPLye6w&e= >> > wrote: >> >> I’m with Jeff on this, 3.7 (bug fixes on 3.6) has already been released with >> the fix. Since the fix applies cleanly anyone is free to put it on top of >> 3.5 on their own if they like, but I see no reason to put out a 3.5.1 right >> now and confuse people further. >> >> -Jeremiah >> >> >>> On Sep 15, 2016, at 9:07 AM, Jonathan Haddad <j...@jonhaddad.com> wrote: >>> >>> As I follow up, I suppose I'm only advocating for a fix to the odd >>> releases. Sadly, Tick Tock versioning is misleading. >>> >>> If tick tock were to continue (and I'm very much against how it currently >>> works) the whole even-features odd-fixes thing needs to stop ASAP, all it >>> does it confuse people. >>> >>> The follow up to 3.4 (3.5) should have been 3.4.1, following semver, so >>> people know it's bug fixes only to 3.4. >>> >>> Jon >>> >>> On Wed, Sep 14, 2016 at 10:37 PM Jonathan Haddad <j...@jonhaddad.com> wrote: >>> >>>> In this particular case, I'd say adding a bug fix release for every >>>> version that's affected would be the right thing. The issue is so easily >>>> reproducible and will likely result in massive data loss for anyone on 3.X >>>> WHERE X < 6 and uses the "date" type. >>>> >>>> This is how easy it is to reproduce: >>>> >>>> 1. Start Cassandra 3.5 >>>> 2. create KEYSPACE test WITH replication = {'class': 'SimpleStrategy', >>>> 'replication_factor': 1}; >>>> 3. use test; >>>> 4. create table fail (id int primary key, d date); >>>> 5. delete d from fail where id = 1; >>>> 6. Stop Cassandra >>>> 7. Start Cassandra >>>> >>>> You will get this, and startup will fail: >>>> >>>> ERROR 05:32:09 Exiting due to error while processing commit log during >>>> initialization. >>>> org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: >>>> Unexpected error deserializing mutation; saved to >>>> /var/folders/0l/g2p6cnyd5kx_1wkl83nd3y4r0000gn/T/mutation6313332720566971713dat. >>>> This may be caused by replaying a mutation against a table with the same >>>> name but incompatible schema. Exception follows: >>>> org.apache.cassandra.serializers.MarshalException: Expected 4 byte long for >>>> date (0) >>>> >>>> I mean.. come on. It's an easy fix. It cleanly merges against 3.5 (and >>>> probably the other releases) and requires very little investment from >>>> anyone. >>>> >>>> >>>> On Wed, Sep 14, 2016 at 9:40 PM Jeff Jirsa <jeff.ji...@crowdstrike.com> >>>> wrote: >>>> >>>>> We did 3.1.1 and 3.2.1, so there’s SOME precedent for emergency fixes, >>>>> but we certainly didn’t/won’t go back and cut new releases from every >>>>> branch for every critical bug in future releases, so I think we need to >>>>> draw the line somewhere. If it’s fixed in 3.7 and 3.0.x (x >= 6), it seems >>>>> like you’ve got options (either stay on the tick and go up to 3.7, or bail >>>>> down to 3.0.x) >>>>> >>>>> Perhaps, though, this highlights the fact that tick/tock may not be the >>>>> best option long term. We’ve tried it for a year, perhaps we should >>>>> instead >>>>> discuss whether or not it should continue, or if there’s another process >>>>> that gives us a better way to get useful patches into versions people are >>>>> willing to run in production. >>>>> >>>>> >>>>> >>>>> On 9/14/16, 8:55 PM, "Jonathan Haddad" <j...@jonhaddad.com> wrote: >>>>> >>>>>> Common sense is what prevents someone from upgrading to yet another >>>>>> completely unknown version with new features which have probably broken >>>>>> even more stuff that nobody is aware of. The folks I'm helping right >>>>>> deployed 3.5 when they got started because >>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__cassandra.apache.org&d=DQIBaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=MZ9nLcNNhQZkuXyH0NBbP1kSEE2M-SYgyVqZ88IJcXY&s=pLP3udocOcAG6k_sAb9p8tcAhtOhpFm6JB7owGhPQEs&e= >>>>> suggests >>>>>> it's acceptable for production. It turns out using 4 of the built in >>>>>> datatypes of the database result in the server being unable to restart >>>>>> without clearing out the commit logs and running a repair. That screams >>>>>> critical to me. You shouldn't even be able to install 3.5 without the >>>>>> patch I've supplied - that bug is a ticking time bomb for anyone that >>>>>> installs it. >>>>>> >>>>>> On Wed, Sep 14, 2016 at 8:12 PM Michael Shuler <mich...@pbandjelly.org> >>>>>> wrote: >>>>>> >>>>>>> What's preventing the use of the 3.6 or 3.7 releases where this bug is >>>>>>> already fixed? This is also fixed in the 3.0.6/7/8 releases. >>>>>>> >>>>>>> Michael >>>>>>> >>>>>>> On 09/14/2016 08:30 PM, Jonathan Haddad wrote: >>>>>>>> Unfortunately CASSANDRA-11618 was fixed in 3.6 but was not back >>>>> ported to >>>>>>>> 3.5 as well, and it makes Cassandra effectively unusable if someone >>>>> is >>>>>>>> using any of the 4 types affected in any of their schema. >>>>>>>> >>>>>>>> I have cherry picked & merged the patch back to here and will put it >>>>> in a >>>>>>>> JIRA as well tonight, I just wanted to get the ball rolling asap on >>>>> this. >>>>>>>> >>>>>>>> >>>>>>> >>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_rustyrazorblade_cassandra_tree_fix-5Fcommitlog-5Fexception&d=DQIBaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=MZ9nLcNNhQZkuXyH0NBbP1kSEE2M-SYgyVqZ88IJcXY&s=ktY5tkT-nO1jtyc0EicbgZHXJYl03DvzuxqzyyOgzII&e= >>>>>>>> >>>>>>>> Jon >>>>>>>> >>>>>>> >>>>>>> >>>>> >>>> >> >
smime.p7s
Description: S/MIME cryptographic signature