Re: Summary of 4.0 Large Features/Breaking Changes (Was: Rough roadmap for 4.0)

Blake Eggleston Sun, 20 Nov 2016 11:24:06 -0800

> I'm not sure how the apache team does this. Perhaps individual engineers 
can run some modern version at a company of theirs, altho that seems 
unlikely, but as an Apache org, i just don't see how that happens.


> To me it seems like the Apache Cassandra infrastructure itself needs to 
stand up a multinode live instance running some 'real-world' example 
that is getting pounded, so that we can stage feature branches to really 
test them. 

Not having access to test hardware as an apache org is a problem, but there’s 
also a lot of room for improvement on the junit testing and testability side of 
things. That’s true for both local and distributed components, but more junit 
coverage of the distributed mechanisms would make not having test hardware suck 
less. With distributed algorithms (like gossip 2.0) one of the limitations of 
testing with live nodes is that you’re often just testing the happy path. 
Reliably and repeatably testing how the system responds to weird edge cases 
involving specific ordering of events across nodes is very difficult to do.

I’d written epaxos with this sort of testing in mind, and was able to do a lot 
of testing of obscure failure scenarios (see 
https://github.com/bdeggleston/cassandra/blob/CASSANDRA-6246-trunk/test/unit/org/apache/cassandra/service/epaxos/integration/EpaxosIntegrationRF3Test.java#L144
 for an example). This doesn’t obviate the need to test on real clusters of 
course, but it does increase confidence that the system will behave correctly 
under load, and reduce the amount of things you’re relying on a loaded test 
cluster to reveal.

On November 20, 2016 at 9:02:55 AM, Dave Brosius (dbros...@mebigfatguy.com) 
wrote:

>> We fully intend to "engineer and test the snot out of" the changes  
we are working on as the whole point of us working on them is so we  
*can* run them in production, at our scale.  

I'm not sure how the apache team does this. Perhaps individual engineers  
can run some modern version at a company of theirs, altho that seems  
unlikely, but as an Apache org, i just don't see how that happens.  

To me it seems like the Apache Cassandra infrastructure itself needs to  
stand up a multinode live instance running some 'real-world' example  
that is getting pounded, so that we can stage feature branches to really  
test them.  

Otherwise we will forever be basing versions on the poor test saps who  
decide they are willing to risk all to upgrade to the cutting edge, and  
why, everyone believes in the adage, don't upgrade until at least .6  

--dave  


On 11/20/2016 09:50 AM, Jason Brown wrote:  
> Hey all,  
>  
> One of the goals on my team, when working on large patches, is to get  
> community feedback on these initiatives before throwing them into prod.  
> This gets us a wider net of feedback (see Sylvain's continuing excellent  
> rounds of feedback to my work on CASSANDRA-8457), as well as making sure we  
> don't go too far off the deep end in terms of straying from the community  
> version. The latter point is crucial because if we make too many  
> incompatible changes to, for example, the internode messaging protocol or  
> the CQL protocol or the sstable file format, and deploy that, it may be  
> very difficult, if not impossible, to rectify with future, in-development  
> versions of cassandra.  
>  
> We fully intend to "engineer and test the snot out of" the changes we are  
> working on as the whole point of us working on them is so we *can* run them  
> in production, at our scale. We aren't expecting others in the community to  
> dog food it for us. There will be a delay between committing something  
> upstream, and us backporting it to a current version we run in production  
> and actually deploying it. However, you can be sure that any bugs we find  
> will be fixed ASAP; we have many users counting on it.  
>  
> Thanks for listening,  
>  
> -Jason  
>  
>  
> On Sat, Nov 19, 2016 at 11:04 AM, Blake Eggleston <beggles...@apple.com>  
> wrote:  
>  
>> I think Ed's just using gossip 2.0 as a hypothetical example. His point is  
>> that we should only commit things when we have a high degree of confidence  
>> that they work correctly, not with the expectation that they don't.  
>>  
>>  
>> On November 19, 2016 at 10:52:38 AM, Michael Kjellman (  
>> mkjell...@internalcircle.com) wrote:  
>>  
>> Jason has asked for review and feedback many times. Maybe be constructive  
>> and review his code instead of just complaining (once again)?  
>>  
>> Sent from my iPhone  
>>  
>>> On Nov 19, 2016, at 1:49 PM, Edward Capriolo <edlinuxg...@gmail.com>  
>> wrote:  
>>> I would say start with a mindset like 'people will run this in  
>> production'  
>>> not like 'why would you expect this to work'.  
>>>  
>>> Now how does this logic effect feature develement? Maybe use gossip 2.0  
>> as  
>>> an example.  
>>>  
>>> I will play my given debby downer role. I could imagine 1 or 2 dtests and  
>>> the logic of 'dont expect it to work' unleash 4.0 onto hords of nubes  
>> with  
>>> twitter announce of the release let bugs trickle in.  
>>>  
>>> One could also do something comprehensive like test on clusters of 2 to  
>>> 1000 nodes. Test with jepsen to see what happens during partitions,  
>> inject  
>>> things like jvm pauses and account for behaivor. Log convergence times  
>>> after given events.  
>>>  
>>> Take a stand and say look "we engineered and beat the crap out of this  
>>> feature. I deployed this release feature at my company and eat my  
>> dogfood.  
>>> You are not my crash test dummy."  
>>>  
>>>  
>>>> On Saturday, November 19, 2016, Jeff Jirsa <jji...@gmail.com> wrote:  
>>>>  
>>>> Any proposal to solve the problem you describe?  
>>>>  
>>>> --  
>>>> Jeff Jirsa  
>>>>  
>>>>  
>>>>> On Nov 19, 2016, at 8:50 AM, Edward Capriolo <edlinuxg...@gmail.com  
>>>> <;>> wrote:  
>>>>> This is especially relevant if people wish to focus on removing things.  
>>>>>  
>>>>> For example, gossip 2.0 sounds great, but seems geared toward huge  
>>>> clusters  
>>>>> which is not likely a majority of users. For those with a 20 node  
>> cluster  
>>>>> are the indirect benefits woth it?  
>>>>>  
>>>>> Also there seems to be a first push to remove things like compact  
>> storage  
>>>>> or thrift. Fine great. But what is the realistic update path for  
>> someone.  
>>>>> If the big players are running 2.1 and maintaining backports, the  
>> average  
>>>>> shop without a dedicated team is going to be stuck saying (great  
>> features  
>>>>> in 4.0 that improve performance, i would probably switch but its not  
>>>> stable  
>>>>> and we have that one compact storage cf and who knows what is going to  
>>>>> happen performance wise when)  
>>>>>  
>>>>> We really need to lose this realease wont be stable for 6 minor  
>> versions  
>>>>> concept.  
>>>>>  
>>>>> On Saturday, November 19, 2016, Edward Capriolo <edlinuxg...@gmail.com  
>>>> <;>>  
>>>>> wrote:  
>>>>>  
>>>>>>  
>>>>>> On Friday, November 18, 2016, Jeff Jirsa <jeff.ji...@crowdstrike.com  
>>>> <;>  
>>>>>> <_e(%7B%7D,'cvml','jeff.ji...@crowdstrike.com <;>');>>  
>>>> wrote:  
>>>>>>> We should assume that we’re ditching tick/tock. I’ll post a thread on  
>>>>>>> 4.0-and-beyond here in a few minutes.  
>>>>>>>  
>>>>>>> The advantage of a prod release every 6 months is fewer incentive to  
>>>> push  
>>>>>>> unfinished work into a release.  
>>>>>>> The disadvantage of a prod release every 6 months is then we either  
>>>> have  
>>>>>>> a very short lifespan per-release, or we have to maintain lots of  
>>>> active  
>>>>>>> releases.  
>>>>>>>  
>>>>>>> 2.1 has been out for over 2 years, and a lot of people (including us)  
>>>> are  
>>>>>>> running it in prod – if we have a release every 6 months, that means  
>>>> we’d  
>>>>>>> be supporting 4+ releases at a time, just to keep parity with what we  
>>>> have  
>>>>>>> now? Maybe that’s ok, if we’re very selective about ‘support’ for 2+  
>>>> year  
>>>>>>> old branches.  
>>>>>>>  
>>>>>>>  
>>>>>>> On 11/18/16, 3:10 PM, "beggles...@apple.com <;> on behalf  
>>>> of Blake  
>>>>>>> Eggleston" <beggles...@apple.com <;>> wrote:  
>>>>>>>  
>>>>>>>>> While stability is important if we push back large "core" changes  
>>>>>>> until later we're just setting ourselves up to face the same issues  
>>>> later on  
>>>>>>>> In theory, yes. In practice, when incomplete features are earmarked  
>>>> for  
>>>>>>> a certain release, those features are often rushed out, and not  
>> always  
>>>>>>> fully baked.  
>>>>>>>> In any case, I don’t think it makes sense to spend too much time  
>>>>>>> planning what goes into 4.0, and what goes into the next major  
>> release  
>>>> with  
>>>>>>> so many release strategy related decisions still up in the air. Are  
>> we  
>>>>>>> going to ditch tick-tock? If so, what will it’s replacement look  
>> like?  
>>>>>>> Specifically, when will the next “production” release happen? Without  
>>>>>>> knowing that, it's hard to say if something should go in 4.0, or 4.5,  
>>>> or  
>>>>>>> 5.0, or whatever.  
>>>>>>>> The reason I suggested a production release every 6 months is  
>> because  
>>>>>>> (in my mind) it’s frequent enough that people won’t be tempted to  
>> rush  
>>>>>>> features to hit a given release, but not so frequent that it’s not  
>>>>>>> practical to support. It wouldn’t be the end of the world if some of  
>>>> these  
>>>>>>> tickets didn’t make it into 4.0, because 4.5 would fine.  
>>>>>>>> On November 18, 2016 at 1:57:21 PM, kurt Greaves (  
>>>> k...@instaclustr.com <;>)  
>>>>>>> wrote:  
>>>>>>>>> On 18 November 2016 at 18:25, Jason Brown <jasedbr...@gmail.com  
>>>> <;>> wrote:  
>>>>>>>>> #11559 (enhanced node representation) - decided it's *not*  
>> something  
>>>> we  
>>>>>>>>> need wrt #7544 storage port configurable per node, so we are  
>> punting  
>>>> on  
>>>>>>>> #12344 - Forward writes to replacement node with same address during  
>>>>>>> replace  
>>>>>>>> depends on #11559. To be honest I'd say #12344 is pretty important,  
>>>>>>>> otherwise it makes it difficult to replace nodes without potentially  
>>>>>>>> requiring client code/configuration changes. It would be nice to get  
>>>>>>> #12344  
>>>>>>>> in for 4.0. It's marked as an improvement but I'd consider it a bug  
>>>> and  
>>>>>>>> thus think it could be included in a later minor release.  
>>>>>>>>  
>>>>>>>> Introducing all of these in a single release seems pretty risky. I  
>>>> think  
>>>>>>> it  
>>>>>>>>> would be safer to spread these out over a few 4.x releases (as  
>>>> they’re  
>>>>>>>>> finished) and give them time to stabilize before including them in  
>> an  
>>>>>>> LTS  
>>>>>>>>> release. The downside would be having to maintain backwards  
>>>>>>> compatibility  
>>>>>>>>> across the 4.x versions, but that seems preferable to delaying the  
>>>>>>> release  
>>>>>>>>> of 4.0 to include these, and having another big bang release.  
>>>>>>>>  
>>>>>>>> I don't think anyone expects 4.0.0 to be stable. It's a major  
>> version  
>>>>>>>> change with lots of new features; in the production world people  
>> don't  
>>>>>>>> normally move to a new major version until it has been out for quite  
>>>> some  
>>>>>>>> time and several minor releases have passed. Really, most people are  
>>>> only  
>>>>>>>> migrating to 3.0.x now. While stability is important if we push back  
>>>>>>> large  
>>>>>>>> "core" changes until later we're just setting ourselves up to face  
>> the  
>>>>>>> same  
>>>>>>>> issues later on. There should be enough uptake on the early releases  
>>>> of  
>>>>>>> 4.0  
>>>>>>>> from new users to help test and get it to a production-ready state.  
>>>>>>>>  
>>>>>>>>  
>>>>>>>> Kurt Greaves  
>>>>>>>> k...@instaclustr.com <;>  
>>>>>>>  
>>>>>> I don't think anyone expects 4.0.0 to be stable  
>>>>>>  
>>>>>> Someone previously described 3.0 as the "break everything release".  
>>>>>>  
>>>>>> We know that many people are still 2.1 and 3.0. Cassandra will always  
>> be  
>>>>>> maintaining 3 or 4 active branches and have adoption issues if  
>> releases  
>>>> are  
>>>>>> not stable and usable.  
>>>>>>  
>>>>>> Being that cassandra was 1.0 years ago I expect things to be stable.  
>>>> Half  
>>>>>> working features , or added this broke that are not appealing to me.  
>>>>>>  
>>>>>>  
>>>>>>  
>>>>>> --  
>>>>>> Sorry this was sent from mobile. Will do less grammar and spell check  
>>>> than  
>>>>>> usual.  
>>>>>>  
>>>>>  
>>>>> --  
>>>>> Sorry this was sent from mobile. Will do less grammar and spell check  
>>>> than  
>>>>> usual.  
>>>  
>>> --  
>>> Sorry this was sent from mobile. Will do less grammar and spell check  
>> than  
>>> usual.

Re: Summary of 4.0 Large Features/Breaking Changes (Was: Rough roadmap for 4.0)

Reply via email to