Re: Proposal - 3.5.1

Jeremiah D Jordan Thu, 15 Sep 2016 11:57:59 -0700

Because tick-tock started based off of the 3.0 big bang “we broke everything” 
release I don’t think we can judge wether or not it is working until we are 
another 6 months in.  AKA when we would have been releasing the next big bang 
release.  Right now a lot if not most of the bugs in a given tick tock release 
are bugs that were introduced in 3.0.  Even the bug mentioned here, it is not a 
tick tock bug, it is a 3.0 bug.



> On Sep 15, 2016, at 1:48 PM, Jake Luciani <[email protected]> wrote:
> 
> I'm pretty sure everyone will agree Tick-Tock didn't go well and needs to
> change.
> 
> The problem for me is going back to the old way doesn't sound great. There
> are parts of tick-tock I really like,
> for example, the cadence and limited scope per release.
> 
> I know at the summit there were a lot of ideas thrown around I can
> regurgitate but perhaps people
> who have been thinking about this would like to chime in and present ideas?
> 
> -Jake
> 
> On Thu, Sep 15, 2016 at 2:28 PM, Benedict Elliott Smith <[email protected]
>> wrote:
> 
>> I agree tick-tock is a failure.  But for two reasons IMO:
>> 
>> 1) Ultimately, the users are the real testers and it takes a while for a
>> release to percolate into the wild for feedback.  The reality is that a
>> release doesn't have its tires properly kicked for at least three months
>> after it's cut.  So if we are to have any tocks, they should be completely
>> unwed from the ticks, and should probably happen on a ~3M cadence to keep
>> the labour down but the utility up (and there should probably still be more
>> than one tock per tick)
>> 
>> 2) Those promised resources to improved process never happened.  We haven't
>> even reached parity with the 2.1 release until very recently, i.e. no
>> failing u/dtests.
>> 
>> 
>> On 15 September 2016 at 19:08, Jeff Jirsa <[email protected]>
>> wrote:
>> 
>>> I know we’ve got a lot of folks following the dev list without a lot of
>>> background, so let’s make sure we get some context here so everyone can
>> be
>>> on the same page.
>>> 
>>> Going to preface this wall of text by saying I’m +1 on a 3.5.1 (and
>> 3.3.1,
>>> etc) if it’s done AFTER 3.9 (I think we need to get 3.9 out first before
>>> the RE manpower is spent on backporting fixes, even critical fixes,
>> because
>>> 3.9 has multiple critical fixes for people running 3.7).
>>> 
>>> Now some background:
>>> 
>>> For many years, Cassandra used to have a dev process that kept 3 active
>>> branches - “bleeding edge”, a “stable”, and an “old stable” branch, where
>>> developers would be committing ALL new contributions to the bleeding
>> edge,
>>> non-api-breaking changes to stable, and bugfixes only to old stable.
>> While
>>> the api changed and major features were added, that bleeding edge would
>>> just be ‘trunk’, and it’d get cut into a major version when it was ready
>> to
>>> ship. We saw that with 2.2 / 2.1 / 2.0 (and before that, 2.1 / 2.0 / 1.2,
>>> and before that 2.0 / 1.2 / 1.1 ). When that bleeding edge got released
>> as
>>> a major x.y.0, the third, oldest, most stable branch went EOL, and new
>>> features would go into trunk for the next major version.
>>> 
>>> There were two big negatives observed with this:
>>> 
>>> The first big negative is that if multiple major new features were in
>>> flight, releases were prone to delay. Nobody wants to break an API on a
>>> x.y.1 release, and nobody wants to add a new feature to a x.y.2 release,
>> so
>>> the project would delay the x.y releases if major features were close,
>> and
>>> then there’d be pressure to slip them in before they were fully tested,
>> or
>>> cut features to avoid delaying the release. This pressure was observed to
>>> be bad for the project – it forced technical compromises.
>>> 
>>> The second downside that was observed was that nobody would try to run
>> the
>>> new versions when they launched, because they were buggy because they
>> were
>>> filled with new features. 2.2, for example, introduced RBAC, commitlog
>>> compression, and user defined functions – major features that needed to
>> be
>>> tested. Unfortunately, because there were few real-world testers, there
>>> were still major bugs being found for months – the first production-ready
>>> version of 2.2 is probably in the 2.2.5 or 2.2.6 range.
>>> 
>>> For version 3, we moved to an alternate release, modeled on Intel’s
>>> tick/tock https://en.wikipedia.org/wiki/Tick-Tock_model
>>> 
>>> The intention was to allow new features into 3.even releases (3.0, 3.2,
>>> 3.4, 3.6, and so on), with bugfixes in 3.odd releases (3.1, … ). The hope
>>> was to allow more frequent releases to address the first big negative
>>> (flood of new features that blocked releases), while also helping to
>>> address the second – with fewer major features in a release, they better
>>> get more/better test coverage.
>>> 
>>> In the tick/tock model, anyone running 3.odd (like 3.5) should be looking
>>> for bugfixes in 3.7. It’s certainly true that 3.5 is horribly broken (as
>> is
>>> 3.3, and 3.4, etc), but with this release model, the bugfix SHOULD BE in
>>> 3.7. As I mentioned previously, we have precedent for backporting
>> critical
>>> fixes, but we don’t have a well defined bar (that I see) for what’s
>>> critical enough for a backport.
>>> 
>>> Jon is noting (and what many of us who run Cassandra in production have
>>> really known for a very long time) is that nobody wants to run 3.newest
>>> (even or odd), because 3.newest is likely broken (because it’s a complex
>>> distributed database, and testing is hard, and it takes time and complex
>>> workloads to find bugs). In the tick/tock model, because new features
>> went
>>> into 3.6, there are new features that may not be adequately
>>> tested/validated in 3.7 a user of 3.5 doesn’t want, and isn’t willing to
>>> accept the risk.
>>> 
>>> The bottom line here is that tick/tock is probably a well intentioned but
>>> failed attempt to bring stability to Cassandra’s releases. The problems
>>> tick/tock was meant to solve are real problems, but tick/tock doesn’t
>> seem
>>> to be addressing them – new features invalidate old testing, which makes
>> it
>>> difficult/impossible for real users to sit on the 3.odd versions.
>>> 
>>> We’re due for cutting 3.9 and 3.0.9, and we have limited RE manpower to
>>> get those out. Only after those are out would I be +1 on a 3.5.1, and
>> then
>>> only because if I were running 3.5, and I hit this bug, I wouldn’t want
>> to
>>> spend the ~$100k it would cost my organization to validate 3.7 prior to
>>> upgrading, and I don’t think it’s reasonable to ask users to recompile a
>>> release for a ~10 line fix for a very nasty bug.
>>> 
>>> I’m also very strongly recommend we (committers/PMC) reconsider tick/tock
>>> for 4.x releases, because this is exactly the type of problem that will
>>> continue to happen as we move forward. I suggest that we either need to
>> go
>>> back to the old model and do a better job of dealing with feature creep
>> and
>>> testing, or we need to better define what gets backported, because the
>>> community needs a stable version to run, and running latest odd release
>> of
>>> tick/tock isn’t it.
>>> 
>>> - Jeff
>>> 
>>> 
>>> On 9/15/16, 10:31 AM, "[email protected] on behalf of Dave Lester" <
>>> [email protected]> wrote:
>>> 
>>>> How would cutting a 3.5.1 release possibly confuse users of the
>> software?
>>> It would be easy to document the change and to send release notes.
>>>> 
>>>> Given the bug’s critical nature and that it's a minor fix, I’m +1
>>> (non-binding) to a new release.
>>>> 
>>>> Dave
>>>> 
>>>>> On Sep 15, 2016, at 7:18 AM, Jeremiah D Jordan <https://urldefense.
>>> proofpoint.com/v2/url?u=http-3A__jeremiah.jordan-40gmail.com&d=DQIFaQ&c=
>>> 08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=
>>> yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=
>>> srNzKwrs8hKPoJMZ4Ao18CYaMYKnbWaCHou6ui5tqdM&s=iM_
>>> LKKIhaiC0w6uz3lhK1lob4gJbKhLPqGNfPPLye6w&e= > wrote:
>>>>> 
>>>>> I’m with Jeff on this, 3.7 (bug fixes on 3.6) has already been
>> released
>>> with the fix.  Since the fix applies cleanly anyone is free to put it on
>>> top of 3.5 on their own if they like, but I see no reason to put out a
>>> 3.5.1 right now and confuse people further.
>>>>> 
>>>>> -Jeremiah
>>>>> 
>>>>> 
>>>>>> On Sep 15, 2016, at 9:07 AM, Jonathan Haddad <[email protected]>
>>> wrote:
>>>>>> 
>>>>>> As I follow up, I suppose I'm only advocating for a fix to the odd
>>>>>> releases.  Sadly, Tick Tock versioning is misleading.
>>>>>> 
>>>>>> If tick tock were to continue (and I'm very much against how it
>>> currently
>>>>>> works) the whole even-features odd-fixes thing needs to stop ASAP,
>> all
>>> it
>>>>>> does it confuse people.
>>>>>> 
>>>>>> The follow up to 3.4 (3.5) should have been 3.4.1, following semver,
>> so
>>>>>> people know it's bug fixes only to 3.4.
>>>>>> 
>>>>>> Jon
>>>>>> 
>>>>>> On Wed, Sep 14, 2016 at 10:37 PM Jonathan Haddad <[email protected]>
>>> wrote:
>>>>>> 
>>>>>>> In this particular case, I'd say adding a bug fix release for every
>>>>>>> version that's affected would be the right thing.  The issue is so
>>> easily
>>>>>>> reproducible and will likely result in massive data loss for anyone
>>> on 3.X
>>>>>>> WHERE X < 6 and uses the "date" type.
>>>>>>> 
>>>>>>> This is how easy it is to reproduce:
>>>>>>> 
>>>>>>> 1. Start Cassandra 3.5
>>>>>>> 2. create KEYSPACE test WITH replication = {'class':
>> 'SimpleStrategy',
>>>>>>> 'replication_factor': 1};
>>>>>>> 3. use test;
>>>>>>> 4. create table fail (id int primary key, d date);
>>>>>>> 5. delete d from fail where id = 1;
>>>>>>> 6. Stop Cassandra
>>>>>>> 7. Start Cassandra
>>>>>>> 
>>>>>>> You will get this, and startup will fail:
>>>>>>> 
>>>>>>> ERROR 05:32:09 Exiting due to error while processing commit log
>> during
>>>>>>> initialization.
>>>>>>> org.apache.cassandra.db.commitlog.CommitLogReplayer$
>>> CommitLogReplayException:
>>>>>>> Unexpected error deserializing mutation; saved to
>>>>>>> /var/folders/0l/g2p6cnyd5kx_1wkl83nd3y4r0000gn/T/
>>> mutation6313332720566971713dat.
>>>>>>> This may be caused by replaying a mutation against a table with the
>>> same
>>>>>>> name but incompatible schema.  Exception follows:
>>>>>>> org.apache.cassandra.serializers.MarshalException: Expected 4 byte
>>> long for
>>>>>>> date (0)
>>>>>>> 
>>>>>>> I mean.. come on.  It's an easy fix.  It cleanly merges against 3.5
>>> (and
>>>>>>> probably the other releases) and requires very little investment
>> from
>>>>>>> anyone.
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Sep 14, 2016 at 9:40 PM Jeff Jirsa <
>>> [email protected]>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> We did 3.1.1 and 3.2.1, so there’s SOME precedent for emergency
>>> fixes,
>>>>>>>> but we certainly didn’t/won’t go back and cut new releases from
>> every
>>>>>>>> branch for every critical bug in future releases, so I think we
>> need
>>> to
>>>>>>>> draw the line somewhere. If it’s fixed in 3.7 and 3.0.x (x >= 6),
>> it
>>> seems
>>>>>>>> like you’ve got options (either stay on the tick and go up to 3.7,
>>> or bail
>>>>>>>> down to 3.0.x)
>>>>>>>> 
>>>>>>>> Perhaps, though, this highlights the fact that tick/tock may not be
>>> the
>>>>>>>> best option long term. We’ve tried it for a year, perhaps we should
>>> instead
>>>>>>>> discuss whether or not it should continue, or if there’s another
>>> process
>>>>>>>> that gives us a better way to get useful patches into versions
>>> people are
>>>>>>>> willing to run in production.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 9/14/16, 8:55 PM, "Jonathan Haddad" <[email protected]> wrote:
>>>>>>>> 
>>>>>>>>> Common sense is what prevents someone from upgrading to yet
>> another
>>>>>>>>> completely unknown version with new features which have probably
>>> broken
>>>>>>>>> even more stuff that nobody is aware of.  The folks I'm helping
>>> right
>>>>>>>>> deployed 3.5 when they got started because
>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__
>>> cassandra.apache.org&d=DQIBaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kq
>>> hAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=
>>> MZ9nLcNNhQZkuXyH0NBbP1kSEE2M-SYgyVqZ88IJcXY&s=pLP3udocOcAG6k_
>>> sAb9p8tcAhtOhpFm6JB7owGhPQEs&e=
>>>>>>>> suggests
>>>>>>>>> it's acceptable for production.  It turns out using 4 of the built
>>> in
>>>>>>>>> datatypes of the database result in the server being unable to
>>> restart
>>>>>>>>> without clearing out the commit logs and running a repair.  That
>>> screams
>>>>>>>>> critical to me.  You shouldn't even be able to install 3.5 without
>>> the
>>>>>>>>> patch I've supplied - that bug is a ticking time bomb for anyone
>>> that
>>>>>>>>> installs it.
>>>>>>>>> 
>>>>>>>>> On Wed, Sep 14, 2016 at 8:12 PM Michael Shuler <
>>> [email protected]>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> What's preventing the use of the 3.6 or 3.7 releases where this
>>> bug is
>>>>>>>>>> already fixed? This is also fixed in the 3.0.6/7/8 releases.
>>>>>>>>>> 
>>>>>>>>>> Michael
>>>>>>>>>> 
>>>>>>>>>> On 09/14/2016 08:30 PM, Jonathan Haddad wrote:
>>>>>>>>>>> Unfortunately CASSANDRA-11618 was fixed in 3.6 but was not back
>>>>>>>> ported to
>>>>>>>>>>> 3.5 as well, and it makes Cassandra effectively unusable if
>>> someone
>>>>>>>> is
>>>>>>>>>>> using any of the 4 types affected in any of their schema.
>>>>>>>>>>> 
>>>>>>>>>>> I have cherry picked & merged the patch back to here and will
>> put
>>> it
>>>>>>>> in a
>>>>>>>>>>> JIRA as well tonight, I just wanted to get the ball rolling asap
>>> on
>>>>>>>> this.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.
>>> com_rustyrazorblade_cassandra_tree_fix-5Fcommitlog-
>> 5Fexception&d=DQIBaQ&c=
>>> 08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=
>>> yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=
>>> MZ9nLcNNhQZkuXyH0NBbP1kSEE2M-SYgyVqZ88IJcXY&s=ktY5tkT-
>>> nO1jtyc0EicbgZHXJYl03DvzuxqzyyOgzII&e=
>>>>>>>>>>> 
>>>>>>>>>>> Jon
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 
> 
> 
> -- 
> http://twitter.com/tjake

Re: Proposal - 3.5.1

Reply via email to