Re: Proposing an Apache Cassandra Management process

Vinay Chella Fri, 07 Sep 2018 18:57:58 -0700

 > I think we should accept the reaper project as is and make that
cassandra management process 1.0, then integrate the Netflix scheduler (and
other new features) into that.
Integrating Netflix scheduler into reaper is mostly refactoring reaper code
since they are different architectures.


> Reaper would bring a prod user base that would realistically take 2-3
years to build up with a new project.
IMO, it is great if we have that, but this should not be the deciding factor

> As an operator, switching to a cassandra management process that’s
basically a re-brand of an existing and commonly used management process
isn’t super risky. Asking operators to switch to a new process is a much
harder sell.
Reaper is far away from becoming a "cassandra management process", I
understand it does its job in doing repairs and snapshots (and other things
if I missed any), but as a Cassandra mangement sidecar process,
responsibilities of it are far beyond those just mentioned. All the design
goals mentioned in this thread from Joey (pluggable execution engine,
backup, restore, ring health detection, sstable upgrade, rolling restarts,
topology-aware maintenance, replacements of entire fleet without
compromising availability etc.,) are critical operations of a "cassandra
management process" that are hard to add on to a system which is not
architected for that. It basically makes total rework/refactor of reaper,
if we were to go down that path. And don't get me wrong, Reaper is a great
repair tool available for C* community, with a great visualization which
makes it easy to use.

We prefer what Jeff proposes, starting with something small and isolated
and layering the best of all sidecars incrementally on top.

--Vinay Chella


On Fri, Sep 7, 2018 at 6:11 PM Jeff Jirsa <[email protected]> wrote:

> I’d also like to see the end state you describe: reaper UI wrapping the
> Netflix management process with pluggable scheduling (either as is with
> reaper now, or using the Netflix scheduler), but I don’t think that means
> we need to start with reaper - if personally prefer the opposite direction,
> starting with something small and isolated and layering on top.
>
> --
> Jeff Jirsa
>
>
> > On Sep 7, 2018, at 5:42 PM, Blake Eggleston <[email protected]>
> wrote:
> >
> > I think we should accept the reaper project as is and make that
> cassandra management process 1.0, then integrate the netflix scheduler (and
> other new features) into that.
> >
> > The ultimate goal would be for the netflix scheduler to become the
> default repair scheduler, but I think using reaper as the starting point
> makes it easier to get there.
> >
> > Reaper would bring a prod user base that would realistically take 2-3
> years to build up with a new project. As an operator, switching to a
> cassandra management process that’s basically a re-brand of an existing and
> commonly used management process isn’t super risky. Asking operators to
> switch to a new process is a much harder sell.
> >
> > On September 7, 2018 at 4:17:10 PM, Jeff Jirsa ([email protected]) wrote:
> >
> > How can we continue moving this forward?
> >
> > Mick/Jon/TLP folks, is there a path here where we commit the
> > Netflix-provided management process, and you augment Reaper to work with
> it?
> > Is there a way we can make a larger umbrella that's modular that can
> > support either/both?
> > Does anyone believe there's a clear, objective argument that one is
> > strictly better than the other? I haven't seen one.
> >
> >
> >
> > On Mon, Aug 20, 2018 at 4:14 PM Roopa Tangirala
> > <[email protected]> wrote:
> >
> >> +1 to everything that Joey articulated with emphasis on the fact that
> >> contributions should be evaluated based on the merit of code and their
> >> value add to the whole offering. I hope it does not matter whether
> that
> >> contribution comes from PMC member or a person who is not a committer.
> I
> >> would like the process to be such that it encourages the new members to
> be
> >> a part of the community and not shy away from contributing to the code
> >> assuming their contributions are valued differently than committers or
> PMC
> >> members. It would be sad to see the contributions decrease if we go
> down
> >> that path.
> >>
> >> *Regards,*
> >>
> >> *Roopa Tangirala*
> >>
> >> Engineering Manager CDE
> >>
> >> *(408) 438-3156 - mobile*
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Mon, Aug 20, 2018 at 2:58 PM Joseph Lynch <[email protected]>
> >> wrote:
> >>
> >>>> We are looking to contribute Reaper to the Cassandra project.
> >>>>
> >>> Just to clarify are you proposing contributing Reaper as a project
> via
> >>> donation or you are planning on contributing the features of Reaper
> as
> >>> patches to Cassandra? If the former how far along are you on the
> donation
> >>> process? If the latter, when do you think you would have patches
> ready
> >> for
> >>> consideration / review?
> >>>
> >>>
> >>>> Looking at the patch it's very similar in its base design already,
> but
> >>>> Reaper does has a lot more to offer. We have all been working hard
> to
> >>> move
> >>>> it to also being a side-car so it can be contributed. This raises a
> >>> number
> >>>> of relevant questions to this thread: would we then accept both
> works
> >> in
> >>>> the Cassandra project, and what burden would it put on the current
> PMC
> >> to
> >>>> maintain both works.
> >>>>
> >>> I would hope that we would collaborate on merging the best parts of
> all
> >>> into the official Cassandra sidecar, taking the always on, shared
> >> nothing,
> >>> highly available system that we've contributed a patchset for and
> adding
> >> in
> >>> many of the repair features (e.g. schedules, a nice web UI) that
> Reaper
> >>> has.
> >>>
> >>>
> >>>> I share Stefan's concern that consensus had not been met around a
> >>>> side-car, and that it was somehow default accepted before a patch
> >> landed.
> >>>
> >>>
> >>> I feel this is not correct or fair. The sidecar and repair
> discussions
> >> have
> >>> been anything _but_ "default accepted". The timeline of consensus
> >> building
> >>> involving the management sidecar and repair scheduling plans:
> >>>
> >>> Dec 2016: Vinay worked with Jon and Alex to try to collaborate on
> Reaper
> >> to
> >>> come up with design goals for a repair scheduler that could work at
> >> Netflix
> >>> scale.
> >>>
> >>> ~Feb 2017: Netflix believes that the fundamental design gaps prevented
> us
> >>> from using Reaper as it relies heavily on remote JMX connections and
> >>> central coordination.
> >>>
> >>> Sep. 2017: Vinay gives a lightning talk at NGCC about a highly
> available
> >>> and distributed repair scheduling sidecar/tool. He is encouraged by
> >>> multiple committers to build repair scheduling into the daemon itself
> and
> >>> not as a sidecar so the database is truly eventually consistent.
> >>>
> >>> ~Jun. 2017 - Feb. 2018: Based on internal need and the positive
> feedback
> >> at
> >>> NGCC, Vinay and myself prototype the distributed repair scheduler
> within
> >>> Priam and roll it out at Netflix scale.
> >>>
> >>> Mar. 2018: I open a Jira (CASSANDRA-14346) along with a detailed 20
> page
> >>> design document for adding repair scheduling to the daemon itself and
> >> open
> >>> the design up for feedback from the community. We get feedback from
> Alex,
> >>> Blake, Nate, Stefan, and Mick. As far as I know there were zero
> proposals
> >>> to contribute Reaper at this point. We hear the consensus that the
> >>> community would prefer repair scheduling in a separate distributed
> >> sidecar
> >>> rather than in the daemon itself and we re-work the design to match
> this
> >>> consensus, re-aligning with our original proposal at NGCC.
> >>>
> >>> Apr 2018: Blake brings the discussion of repair scheduling to the dev
> >> list
> >>> (
> >>>
> >>>
> >>
> https://lists.apache.org/thread.html/760fbef677f27aa5c2ab4c375c7efeb81304fea428deff986ba1c2eb@%3Cdev.cassandra.apache.org%3E
>
> >>> ).
> >>> Many community members give positive feedback that we should solve it
> as
> >>> part of Cassandra and there is still no mention of contributing Reaper
> at
> >>> this point. The last message is my attempted summary giving context
> on
> >> how
> >>> we want to take the best of all the sidecars (OpsCenter, Priam,
> Reaper)
> >> and
> >>> ship them with Cassandra.
> >>>
> >>> Apr. 2018: Dinesh opens CASSANDRA-14395 along with a public design
> >> document
> >>> for gathering feedback on a general management sidecar. Sankalp and
> >> Dinesh
> >>> encourage Vinay and myself to kickstart that sidecar using the repair
> >>> scheduler patch
> >>>
> >>> Apr 2018: Dinesh reaches out to the dev list (
> >>>
> >>>
> >>
> https://lists.apache.org/thread.html/a098341efd8f344494bcd2761dba5125e971b59b1dd54f282ffda253@%3Cdev.cassandra.apache.org%3E
>
> >>> )
> >>> about the general management process to gain further feedback. All
> >> feedback
> >>> remains positive as it is a potential place for multiple community
> >> members
> >>> to contribute their various sidecar functionality.
> >>>
> >>> May-Jul 2017: Vinay and I work on creating a basic sidecar for
> running
> >> the
> >>> repair scheduler based on the feedback from the community in
> >>> CASSANDRA-14346 and CASSANDRA-14395
> >>>
> >>> Jun 2018: I bump CASSANDRA-14346 indicating we're still working on
> this,
> >>> nobody objects
> >>>
> >>> Jul 2018: Sankalp asks on the dev list if anyone has feature Jiras
> anyone
> >>> needs review for before 4.0, I mention again that we've nearly got
> the
> >>> basic sidecar and repair scheduling work done and will need help with
> >>> review. No one responds.
> >>>
> >>> Aug 2018: We submit a patch that brings a basic distributed sidecar
> and
> >>> robust distributed repair to Cassandra itself. Dinesh mentions that
> he
> >> will
> >>> try to review. Now folks appear concerned about it being in tree and
> >>> instead maybe it should go in a different repo all together. I don't
> >> think
> >>> we have consensus on the repo choice yet.
> >>>
> >>> This seems at odds when we're already struggling to keep up with the
> >>>> incoming patches/contributions, and there could be other git repos
> in
> >> the
> >>>> project we will need to support in the future too. But I'm also
> curious
> >>>> about the whole "Community over Code" angle to this, how do we
> >> encourage
> >>>> multiple external works to collaborate together building value in
> both
> >>> the
> >>>> technical and community.
> >>>>
> >>>
> >>> I viewed this management sidecar as a way for us to stop, as a
> community,
> >>> building the same thing over and over again. Netflix maintains Priam,
> >> Last
> >>> pickle maintains Reaper, Datastax maintains OpsCenter. Why can't we
> take
> >>> the best of Reaper (e.g. schedules, diagnostic events, UI) and leave
> the
> >>> worst (e.g. centralized design with lots of locking) and combine it
> with
> >>> the best of Priam (robust shared nothing sidecar that makes Cassandra
> >>> management easy) and leave the worst (a bunch of technical debt), and
> >>> iterate towards one sidecar that allows Cassandra users to just run
> their
> >>> database.
> >>>
> >>>
> >>>> The Reaper project has worked hard in building both its user and
> >>>> contributor base. And I would have thought these, including having
> the
> >>>> contributor base overlap with the C* PMC, were prerequisites before
> >>> moving
> >>>> a larger body of work into the project (separate git repo or not). I
> >>> guess
> >>>> this isn't so much "Community over Code", but it illustrates a
> concern
> >>>> regarding abandoned code when there's no existing track record of
> >>>> maintaining it as OSS, as opposed to expecting an existing "show,
> don't
> >>>> tell" culture. Reaper for example has stronger indicators for
> ongoing
> >>>> support and an existing OSS user base: today C* committers having
> >>>> contributed to Reaper are Jon, Stefan, Nate, and myself, amongst the
> 40
> >>>> contributors in total. And we've been making steps to involve it
> more
> >>> into
> >>>> the C* community (eg users ML), without being too presumptuous.
> >>>
> >>> I worry about this logic to be frank. Why do significant
> contributions
> >> need
> >>> to come only from established C* PMC members? Shouldn't we strive to
> >>> consider relative merits of code that has actually been submitted to
> the
> >>> project on the basis of the code and not who sent the patches?
> >>>
> >>>
> >>>> On the technical side: Reaper supports (or can easily) all the
> concerns
> >>>> that the proposal here raises: distributed nodetool commands,
> >>> centralising
> >>>> jmx interfacing, scheduling ops (repairs, snapshots, compactions,
> >>> cleanups,
> >>>> etc), monitoring and diagnostics, etc etc. It's designed so that it
> can
> >>> be
> >>>> a single instance, instance-per-datacenter, or side-car (per
> process).
> >>> When
> >>>> there are multiple instances in a datacenter you get HA. You have a
> >>> choice
> >>>> of different storage backends (memory, postgres, c*). You can ofc use
> a
> >>>> separate C* cluster as a backend so to separate infrastructure data
> >> from
> >>>> production data. And it's got an UI for C* Diagnostics already
> (which
> >>>> imposes a different jmx interface of polling for events rather than
> >>>> subscribing to jmx notifications which we know is problematic,
> thanks
> >> to
> >>>> Stefan). Anyway, that's my plug for Reaper :-)
> >>>>
> >>> Could we get some of these suggestions into the
> >>> CASSANDRA-14346/CASSANDRA-14395 jiras and we can debate the technical
> >>> merits there?
> >>>
> >>> There's been little effort in evaluating these two bodies of work,
> one
> >>>> which is largely unknown to us, and my concern is how we would
> fairly
> >>>> support both going into the future?
> >>>>
> >>>
> >>>> Another option would be that this side-car patch first exists as a
> >> github
> >>>> project for a period of time, on par to how Reaper has been. This
> will
> >>> help
> >>>> evaluate its use and to first build up its contributors. This makes
> it
> >>>> easier for the C* PMC to choose which projects it would want to
> >> formally
> >>>> maintain, and to do so based on factors beyond merits of the
> technical.
> >>> We
> >>>> may even see it converge (or collaborate more) with Reaper, a win
> for
> >>>> everyone.
> >>>>
> >>> We could have put our distributed repair scheduler as part of Priam
> ages
> >>> ago which would have been much easier for us and also has an existing
> >>> community, but we don't want to because that will encourage the
> community
> >>> to remain fractured on the most important management processes.
> Instead
> >> we
> >>> seek to work with the community to take the lessons learned from all
> the
> >>> various available sidecars owned by different organizations
> (Datastax,
> >>> Netflix, TLP) and fix this once for the whole community. Can we work
> >>> together to make Cassandra just work for our users out of the box?
> >>>
> >>> -Joey
> >>>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Proposing an Apache Cassandra Management process

Reply via email to