Re: Proposing an Apache Cassandra Management process

Joseph Lynch Mon, 20 Aug 2018 14:58:13 -0700

> We are looking to contribute Reaper to the Cassandra project.
>
Just to clarify are you proposing contributing Reaper as a project via
donation or you are planning on contributing the features of Reaper as
patches to Cassandra? If the former how far along are you on the donation
process? If the latter, when do you think you would have patches ready for
consideration / review?



> Looking at the patch it's very similar in its base design already, but
> Reaper does has a lot more to offer. We have all been working hard to move
> it to also being a side-car so it can be contributed. This raises a number
> of relevant questions to this thread: would we then accept both works in
> the Cassandra project, and what burden would it put on the current PMC to
> maintain both works.
>
I would hope that we would collaborate on merging the best parts of all
into the official Cassandra sidecar, taking the always on, shared nothing,
highly available system that we've contributed a patchset for and adding in
many of the repair features (e.g. schedules, a nice web UI) that Reaper has.


> I share Stefan's concern that consensus had not been met around a
> side-car, and that it was somehow default accepted before a patch landed.


I feel this is not correct or fair. The sidecar and repair discussions have
been anything _but_ "default accepted". The timeline of consensus building
involving the management sidecar and repair scheduling plans:

Dec 2016: Vinay worked with Jon and Alex to try to collaborate on Reaper to
come up with design goals for a repair scheduler that could work at Netflix
scale.

~Feb 2017: Netflix believes that the fundamental design gaps prevented us
from using Reaper as it relies heavily on remote JMX connections and
central coordination.

Sep. 2017: Vinay gives a lightning talk at NGCC about a highly available
and distributed repair scheduling sidecar/tool. He is encouraged by
multiple committers to build repair scheduling into the daemon itself and
not as a sidecar so the database is truly eventually consistent.

~Jun. 2017 - Feb. 2018: Based on internal need and the positive feedback at
NGCC, Vinay and myself prototype the distributed repair scheduler within
Priam and roll it out at Netflix scale.

Mar. 2018: I open a Jira (CASSANDRA-14346) along with a detailed 20 page
design document for adding repair scheduling to the daemon itself and open
the design up for feedback from the community. We get feedback from Alex,
Blake, Nate, Stefan, and Mick. As far as I know there were zero proposals
to contribute Reaper at this point. We hear the consensus that the
community would prefer repair scheduling in a separate distributed sidecar
rather than in the daemon itself and we re-work the design to match this
consensus, re-aligning with our original proposal at NGCC.

Apr 2018: Blake brings the discussion of repair scheduling to the dev list (
https://lists.apache.org/thread.html/760fbef677f27aa5c2ab4c375c7efeb81304fea428deff986ba1c2eb@%3Cdev.cassandra.apache.org%3E).
Many community members give positive feedback that we should solve it as
part of Cassandra and there is still no mention of contributing Reaper at
this point. The last message is my attempted summary giving context on how
we want to take the best of all the sidecars (OpsCenter, Priam, Reaper) and
ship them with Cassandra.

Apr. 2018: Dinesh opens CASSANDRA-14395 along with a public design document
for gathering feedback on a general management sidecar. Sankalp and Dinesh
encourage Vinay and myself to kickstart that sidecar using the repair
scheduler patch

Apr 2018: Dinesh reaches out to the dev list (
https://lists.apache.org/thread.html/a098341efd8f344494bcd2761dba5125e971b59b1dd54f282ffda253@%3Cdev.cassandra.apache.org%3E)
about the general management process to gain further feedback. All feedback
remains positive as it is a potential place for multiple community members
to contribute their various sidecar functionality.

May-Jul 2017: Vinay and I work on creating a basic sidecar for running the
repair scheduler based on the feedback from the community in
CASSANDRA-14346 and CASSANDRA-14395

Jun 2018: I bump CASSANDRA-14346 indicating we're still working on this,
nobody objects

Jul 2018: Sankalp asks on the dev list if anyone has feature Jiras anyone
needs review for before 4.0, I mention again that we've nearly got the
basic sidecar and repair scheduling work done and will need help with
review. No one responds.

Aug 2018: We submit a patch that brings a basic distributed sidecar and
robust distributed repair to Cassandra itself. Dinesh mentions that he will
try to review. Now folks appear concerned about it being in tree and
instead maybe it should go in a different repo all together. I don't think
we have consensus on the repo choice yet.

This seems at odds when we're already struggling to keep up with the
> incoming patches/contributions, and there could be other git repos in the
> project we will need to support in the future too. But I'm also curious
> about the whole "Community over Code" angle to this, how do we encourage
> multiple external works to collaborate together building value in both the
> technical and community.
>

I viewed this management sidecar as a way for us to stop, as a community,
building the same thing over and over again. Netflix maintains Priam, Last
pickle maintains Reaper, Datastax maintains OpsCenter. Why can't we take
the best of Reaper (e.g. schedules, diagnostic events, UI) and leave the
worst (e.g. centralized design with lots of locking) and combine it with
the best of Priam (robust shared nothing sidecar that makes Cassandra
management easy) and leave the worst (a bunch of technical debt), and
iterate towards one sidecar that allows Cassandra users to just run their
database.


> The Reaper project has worked hard in building both its user and
> contributor base. And I would have thought these, including having the
> contributor base overlap with the C* PMC, were prerequisites before moving
> a larger body of work into the project (separate git repo or not). I guess
> this isn't so much "Community over Code", but it illustrates a concern
> regarding abandoned code when there's no existing track record of
> maintaining it as OSS, as opposed to expecting an existing "show, don't
> tell" culture. Reaper for example has stronger indicators for ongoing
> support and an existing OSS user base: today C* committers having
> contributed to Reaper are Jon, Stefan, Nate, and myself, amongst the 40
> contributors in total. And we've been making steps to involve it more into
> the C* community (eg users ML), without being too presumptuous.

I worry about this logic to be frank. Why do significant contributions need
to come only from established C* PMC members? Shouldn't we strive to
consider relative merits of code that has actually been submitted to the
project on the basis of the code and not who sent the patches?


> On the technical side: Reaper supports (or can easily) all the concerns
> that the proposal here raises: distributed nodetool commands, centralising
> jmx interfacing, scheduling ops (repairs, snapshots, compactions, cleanups,
> etc), monitoring and diagnostics, etc etc. It's designed so that it can be
> a single instance, instance-per-datacenter, or side-car (per process). When
> there are multiple instances in a datacenter you get HA. You have a choice
> of different storage backends (memory, postgres, c*). You can ofc use a
> separate C* cluster as a backend so to separate infrastructure data from
> production data. And it's got an UI for C* Diagnostics already (which
> imposes a different jmx interface of polling for events rather than
> subscribing to jmx notifications which we know is problematic, thanks to
> Stefan). Anyway, that's my plug for Reaper :-)
>
Could we get some of these suggestions into the
CASSANDRA-14346/CASSANDRA-14395 jiras and we can debate the technical
merits there?

There's been little effort in evaluating these two bodies of work, one
> which is largely unknown to us, and my concern is how we would fairly
> support both going into the future?
>

> Another option would be that this side-car patch first exists as a github
> project for a period of time, on par to how Reaper has been. This will help
> evaluate its use and to first build up its contributors. This makes it
> easier for the C* PMC to choose which projects it would want to formally
> maintain, and to do so based on factors beyond merits of the technical. We
> may even see it converge (or collaborate more) with Reaper, a win for
> everyone.
>
We could have put our distributed repair scheduler as part of Priam ages
ago which would have been much easier for us and also has an existing
community, but we don't want to because that will encourage the community
to remain fractured on the most important management processes. Instead we
seek to work with the community to take the lessons learned from all the
various available sidecars owned by different organizations (Datastax,
Netflix, TLP) and fix this once for the whole community. Can we work
together to make Cassandra just work for our users out of the box?

-Joey

Re: Proposing an Apache Cassandra Management process

Reply via email to