Re: Proposing an Apache Cassandra Management process

Jeff Jirsa Fri, 07 Sep 2018 20:19:50 -0700

The benefit is that it more closely matched the design doc, from 5 months ago, 
which is decidedly not about coordinating repair - it’s about a general purpose 
management tool, where repair is one of many proposed tasks


https://docs.google.com/document/d/1UV9pE81NaIUF3g4L1wxq09nT11AkSQcMijgLFwGsY3s/edit


By starting with a tool that is built to run repair, you’re sacrificing 
generality and accepting something purpose built for one sub task. It’s an 
important subtask, and it’s a nice tool, but it’s not an implementation of the 
proposal, it’s an alternative that happens to do some of what was proposed.

-- 
Jeff Jirsa


> On Sep 7, 2018, at 6:53 PM, Blake Eggleston <beggles...@apple.com> wrote:
> 
> What’s the benefit of doing it that way vs starting with reaper and 
> integrating the netflix scheduler? If reaper was just a really inappropriate 
> choice for the cassandra management process, I could see that being a better 
> approach, but I don’t think that’s the case.
> 
> If our management process isn’t a drop in replacement for reaper, then reaper 
> will continue to exist, which will split the user and developers base between 
> the 2 projects. That won't be good for either project.
> 
> On September 7, 2018 at 6:12:01 PM, Jeff Jirsa (jji...@gmail.com) wrote:
> 
> I’d also like to see the end state you describe: reaper UI wrapping the 
> Netflix management process with pluggable scheduling (either as is with 
> reaper now, or using the Netflix scheduler), but I don’t think that means we 
> need to start with reaper - if personally prefer the opposite direction, 
> starting with something small and isolated and layering on top.  
> 
> --  
> Jeff Jirsa  
> 
> 
>> On Sep 7, 2018, at 5:42 PM, Blake Eggleston <beggles...@apple.com> wrote:  
>> 
>> I think we should accept the reaper project as is and make that cassandra 
>> management process 1.0, then integrate the netflix scheduler (and other new 
>> features) into that.  
>> 
>> The ultimate goal would be for the netflix scheduler to become the default 
>> repair scheduler, but I think using reaper as the starting point makes it 
>> easier to get there.  
>> 
>> Reaper would bring a prod user base that would realistically take 2-3 years 
>> to build up with a new project. As an operator, switching to a cassandra 
>> management process that’s basically a re-brand of an existing and commonly 
>> used management process isn’t super risky. Asking operators to switch to a 
>> new process is a much harder sell.  
>> 
>> On September 7, 2018 at 4:17:10 PM, Jeff Jirsa (jji...@gmail.com) wrote:  
>> 
>> How can we continue moving this forward?  
>> 
>> Mick/Jon/TLP folks, is there a path here where we commit the  
>> Netflix-provided management process, and you augment Reaper to work with it? 
>>  
>> Is there a way we can make a larger umbrella that's modular that can  
>> support either/both?  
>> Does anyone believe there's a clear, objective argument that one is  
>> strictly better than the other? I haven't seen one.  
>> 
>> 
>> 
>> On Mon, Aug 20, 2018 at 4:14 PM Roopa Tangirala  
>> <rtangir...@netflix.com.invalid> wrote:  
>> 
>>> +1 to everything that Joey articulated with emphasis on the fact that  
>>> contributions should be evaluated based on the merit of code and their  
>>> value add to the whole offering. I hope it does not matter whether that  
>>> contribution comes from PMC member or a person who is not a committer. I  
>>> would like the process to be such that it encourages the new members to be  
>>> a part of the community and not shy away from contributing to the code  
>>> assuming their contributions are valued differently than committers or PMC  
>>> members. It would be sad to see the contributions decrease if we go down  
>>> that path.  
>>> 
>>> *Regards,*  
>>> 
>>> *Roopa Tangirala*  
>>> 
>>> Engineering Manager CDE  
>>> 
>>> *(408) 438-3156 - mobile*  
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Mon, Aug 20, 2018 at 2:58 PM Joseph Lynch <joe.e.ly...@gmail.com>  
>>> wrote:  
>>> 
>>>>> We are looking to contribute Reaper to the Cassandra project.  
>>>>> 
>>>> Just to clarify are you proposing contributing Reaper as a project via  
>>>> donation or you are planning on contributing the features of Reaper as  
>>>> patches to Cassandra? If the former how far along are you on the donation  
>>>> process? If the latter, when do you think you would have patches ready  
>>> for  
>>>> consideration / review?  
>>>> 
>>>> 
>>>>> Looking at the patch it's very similar in its base design already, but  
>>>>> Reaper does has a lot more to offer. We have all been working hard to  
>>>> move  
>>>>> it to also being a side-car so it can be contributed. This raises a  
>>>> number  
>>>>> of relevant questions to this thread: would we then accept both works  
>>> in  
>>>>> the Cassandra project, and what burden would it put on the current PMC  
>>> to  
>>>>> maintain both works.  
>>>>> 
>>>> I would hope that we would collaborate on merging the best parts of all  
>>>> into the official Cassandra sidecar, taking the always on, shared  
>>> nothing,  
>>>> highly available system that we've contributed a patchset for and adding  
>>> in  
>>>> many of the repair features (e.g. schedules, a nice web UI) that Reaper  
>>>> has.  
>>>> 
>>>> 
>>>>> I share Stefan's concern that consensus had not been met around a  
>>>>> side-car, and that it was somehow default accepted before a patch  
>>> landed.  
>>>> 
>>>> 
>>>> I feel this is not correct or fair. The sidecar and repair discussions  
>>> have  
>>>> been anything _but_ "default accepted". The timeline of consensus  
>>> building  
>>>> involving the management sidecar and repair scheduling plans:  
>>>> 
>>>> Dec 2016: Vinay worked with Jon and Alex to try to collaborate on Reaper  
>>> to  
>>>> come up with design goals for a repair scheduler that could work at  
>>> Netflix  
>>>> scale.  
>>>> 
>>>> ~Feb 2017: Netflix believes that the fundamental design gaps prevented us  
>>>> from using Reaper as it relies heavily on remote JMX connections and  
>>>> central coordination.  
>>>> 
>>>> Sep. 2017: Vinay gives a lightning talk at NGCC about a highly available  
>>>> and distributed repair scheduling sidecar/tool. He is encouraged by  
>>>> multiple committers to build repair scheduling into the daemon itself and  
>>>> not as a sidecar so the database is truly eventually consistent.  
>>>> 
>>>> ~Jun. 2017 - Feb. 2018: Based on internal need and the positive feedback  
>>> at  
>>>> NGCC, Vinay and myself prototype the distributed repair scheduler within  
>>>> Priam and roll it out at Netflix scale.  
>>>> 
>>>> Mar. 2018: I open a Jira (CASSANDRA-14346) along with a detailed 20 page  
>>>> design document for adding repair scheduling to the daemon itself and  
>>> open  
>>>> the design up for feedback from the community. We get feedback from Alex,  
>>>> Blake, Nate, Stefan, and Mick. As far as I know there were zero proposals  
>>>> to contribute Reaper at this point. We hear the consensus that the  
>>>> community would prefer repair scheduling in a separate distributed  
>>> sidecar  
>>>> rather than in the daemon itself and we re-work the design to match this  
>>>> consensus, re-aligning with our original proposal at NGCC.  
>>>> 
>>>> Apr 2018: Blake brings the discussion of repair scheduling to the dev  
>>> list  
>>>> (  
>>>> 
>>>> 
>>> https://lists.apache.org/thread.html/760fbef677f27aa5c2ab4c375c7efeb81304fea428deff986ba1c2eb@%3Cdev.cassandra.apache.org%3E
>>>   
>>>> ).  
>>>> Many community members give positive feedback that we should solve it as  
>>>> part of Cassandra and there is still no mention of contributing Reaper at  
>>>> this point. The last message is my attempted summary giving context on  
>>> how  
>>>> we want to take the best of all the sidecars (OpsCenter, Priam, Reaper)  
>>> and  
>>>> ship them with Cassandra.  
>>>> 
>>>> Apr. 2018: Dinesh opens CASSANDRA-14395 along with a public design  
>>> document  
>>>> for gathering feedback on a general management sidecar. Sankalp and  
>>> Dinesh  
>>>> encourage Vinay and myself to kickstart that sidecar using the repair  
>>>> scheduler patch  
>>>> 
>>>> Apr 2018: Dinesh reaches out to the dev list (  
>>>> 
>>>> 
>>> https://lists.apache.org/thread.html/a098341efd8f344494bcd2761dba5125e971b59b1dd54f282ffda253@%3Cdev.cassandra.apache.org%3E
>>>   
>>>> )  
>>>> about the general management process to gain further feedback. All  
>>> feedback  
>>>> remains positive as it is a potential place for multiple community  
>>> members  
>>>> to contribute their various sidecar functionality.  
>>>> 
>>>> May-Jul 2017: Vinay and I work on creating a basic sidecar for running  
>>> the  
>>>> repair scheduler based on the feedback from the community in  
>>>> CASSANDRA-14346 and CASSANDRA-14395  
>>>> 
>>>> Jun 2018: I bump CASSANDRA-14346 indicating we're still working on this,  
>>>> nobody objects  
>>>> 
>>>> Jul 2018: Sankalp asks on the dev list if anyone has feature Jiras anyone  
>>>> needs review for before 4.0, I mention again that we've nearly got the  
>>>> basic sidecar and repair scheduling work done and will need help with  
>>>> review. No one responds.  
>>>> 
>>>> Aug 2018: We submit a patch that brings a basic distributed sidecar and  
>>>> robust distributed repair to Cassandra itself. Dinesh mentions that he  
>>> will  
>>>> try to review. Now folks appear concerned about it being in tree and  
>>>> instead maybe it should go in a different repo all together. I don't  
>>> think  
>>>> we have consensus on the repo choice yet.  
>>>> 
>>>> This seems at odds when we're already struggling to keep up with the  
>>>>> incoming patches/contributions, and there could be other git repos in  
>>> the  
>>>>> project we will need to support in the future too. But I'm also curious  
>>>>> about the whole "Community over Code" angle to this, how do we  
>>> encourage  
>>>>> multiple external works to collaborate together building value in both  
>>>> the  
>>>>> technical and community.  
>>>>> 
>>>> 
>>>> I viewed this management sidecar as a way for us to stop, as a community,  
>>>> building the same thing over and over again. Netflix maintains Priam,  
>>> Last  
>>>> pickle maintains Reaper, Datastax maintains OpsCenter. Why can't we take  
>>>> the best of Reaper (e.g. schedules, diagnostic events, UI) and leave the  
>>>> worst (e.g. centralized design with lots of locking) and combine it with  
>>>> the best of Priam (robust shared nothing sidecar that makes Cassandra  
>>>> management easy) and leave the worst (a bunch of technical debt), and  
>>>> iterate towards one sidecar that allows Cassandra users to just run their  
>>>> database.  
>>>> 
>>>> 
>>>>> The Reaper project has worked hard in building both its user and  
>>>>> contributor base. And I would have thought these, including having the  
>>>>> contributor base overlap with the C* PMC, were prerequisites before  
>>>> moving  
>>>>> a larger body of work into the project (separate git repo or not). I  
>>>> guess  
>>>>> this isn't so much "Community over Code", but it illustrates a concern  
>>>>> regarding abandoned code when there's no existing track record of  
>>>>> maintaining it as OSS, as opposed to expecting an existing "show, don't  
>>>>> tell" culture. Reaper for example has stronger indicators for ongoing  
>>>>> support and an existing OSS user base: today C* committers having  
>>>>> contributed to Reaper are Jon, Stefan, Nate, and myself, amongst the 40  
>>>>> contributors in total. And we've been making steps to involve it more  
>>>> into  
>>>>> the C* community (eg users ML), without being too presumptuous.  
>>>> 
>>>> I worry about this logic to be frank. Why do significant contributions  
>>> need  
>>>> to come only from established C* PMC members? Shouldn't we strive to  
>>>> consider relative merits of code that has actually been submitted to the  
>>>> project on the basis of the code and not who sent the patches?  
>>>> 
>>>> 
>>>>> On the technical side: Reaper supports (or can easily) all the concerns  
>>>>> that the proposal here raises: distributed nodetool commands,  
>>>> centralising  
>>>>> jmx interfacing, scheduling ops (repairs, snapshots, compactions,  
>>>> cleanups,  
>>>>> etc), monitoring and diagnostics, etc etc. It's designed so that it can  
>>>> be  
>>>>> a single instance, instance-per-datacenter, or side-car (per process).  
>>>> When  
>>>>> there are multiple instances in a datacenter you get HA. You have a  
>>>> choice  
>>>>> of different storage backends (memory, postgres, c*). You can ofc use a  
>>>>> separate C* cluster as a backend so to separate infrastructure data  
>>> from  
>>>>> production data. And it's got an UI for C* Diagnostics already (which  
>>>>> imposes a different jmx interface of polling for events rather than  
>>>>> subscribing to jmx notifications which we know is problematic, thanks  
>>> to  
>>>>> Stefan). Anyway, that's my plug for Reaper :-)  
>>>>> 
>>>> Could we get some of these suggestions into the  
>>>> CASSANDRA-14346/CASSANDRA-14395 jiras and we can debate the technical  
>>>> merits there?  
>>>> 
>>>> There's been little effort in evaluating these two bodies of work, one  
>>>>> which is largely unknown to us, and my concern is how we would fairly  
>>>>> support both going into the future?  
>>>>> 
>>>> 
>>>>> Another option would be that this side-car patch first exists as a  
>>> github  
>>>>> project for a period of time, on par to how Reaper has been. This will  
>>>> help  
>>>>> evaluate its use and to first build up its contributors. This makes it  
>>>>> easier for the C* PMC to choose which projects it would want to  
>>> formally  
>>>>> maintain, and to do so based on factors beyond merits of the technical.  
>>>> We  
>>>>> may even see it converge (or collaborate more) with Reaper, a win for  
>>>>> everyone.  
>>>>> 
>>>> We could have put our distributed repair scheduler as part of Priam ages  
>>>> ago which would have been much easier for us and also has an existing  
>>>> community, but we don't want to because that will encourage the community  
>>>> to remain fractured on the most important management processes. Instead  
>>> we  
>>>> seek to work with the community to take the lessons learned from all the  
>>>> various available sidecars owned by different organizations (Datastax,  
>>>> Netflix, TLP) and fix this once for the whole community. Can we work  
>>>> together to make Cassandra just work for our users out of the box?  
>>>> 
>>>> -Joey  
>>>> 
>>> 
> 
> ---------------------------------------------------------------------  
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org  
> For additional commands, e-mail: dev-h...@cassandra.apache.org  
>

Re: Proposing an Apache Cassandra Management process

Reply via email to