Re: [Discuss] Repair inside C*

Jordan West Tue, 22 Oct 2024 08:36:12 -0700

Agreed with the sentiment that decomposition is a good target but out of
scope here. I’m personally excited to see an in-tree repair scheduler and
am supportive of the approach shared here.


Jordan

On Tue, Oct 22, 2024 at 08:12 Dinesh Joshi <djo...@apache.org> wrote:

> Decomposing Cassandra may be architecturally desirable but that is not the
> goal of this CEP. This CEP brings value to operators today so it should be
> considered on that merit. We definitely need to have a separate
> conversation on Cassandra's architectural direction.
>
> On Tue, Oct 22, 2024 at 7:51 AM Joseph Lynch <joe.e.ly...@gmail.com>
> wrote:
>
>> Definitely like this in C* itself. We only changed our proposal to
>> putting repair scheduling in the sidecar before because trunk was frozen
>> for the foreseeable future at that time. With trunk unfrozen and
>> development on the main process going at a fast pace I think it makes way
>> more sense to integrate natively as table properties as this CEP proposes.
>> Completely agree the scheduling overhead should be minimal.
>>
>> Moving the actual repair operation (comparing data and streaming
>> mismatches) along with compaction operations to a separate process long
>> term makes a lot of sense but imo only once we both have a release of
>> sidecar and a contract figured out between them on communication. I'm
>> watching CEP-38 there as I think CQL and virtual tables are looking much
>> stronger than when we wrote CEP-1 and chose HTTP but that's for that
>> discussion and not this one.
>>
>> -Joey
>>
>> On Mon, Oct 21, 2024 at 3:25 PM Francisco Guerrero <fran...@apache.org>
>> wrote:
>>
>>> Like others have said, I was expecting the scheduling portion of repair
>>> is
>>> negligible. I was mostly curious if you had something handy that you can
>>> quickly share.
>>>
>>> On 2024/10/21 18:59:41 Jaydeep Chovatia wrote:
>>> > >Jaydeep, do you have any metrics on your clusters comparing them
>>> before
>>> > and after introducing repair scheduling into the Cassandra process?
>>> >
>>> > Yes, I had made some comparisons when I started rolling this feature
>>> out to
>>> > our production five years ago :)  Here are the details:
>>> > *The Scheduling*
>>> > The scheduling itself is exceptionally lightweight, as only one
>>> additional
>>> > thread monitors the repair activity, updating the status to a system
>>> table
>>> > once every few minutes or so. So, it does not appear anywhere in the
>>> CPU
>>> > charts, etc. Unfortunately, I do not have those graphs now, but I can
>>> do a
>>> > quick comparison if it helps!
>>> >
>>> > *The Repair Itself*
>>> > As we all know, the Cassandra repair algorithm is a heavy-weight
>>> process
>>> > due to Merkle tree/streaming, etc., no matter how we schedule it. But
>>> it is
>>> > an orthogonal topic and folks are already discussing creating a new
>>> CEP.
>>> >
>>> > Jaydeep
>>> >
>>> >
>>> > On Mon, Oct 21, 2024 at 10:02 AM Francisco Guerrero <
>>> fran...@apache.org>
>>> > wrote:
>>> >
>>> > > Jaydeep, do you have any metrics on your clusters comparing them
>>> before
>>> > > and after introducing repair scheduling into the Cassandra process?
>>> > >
>>> > > On 2024/10/21 16:57:57 "J. D. Jordan" wrote:
>>> > > > Sounds good. Just wanted to bring it up. I agree that the
>>> scheduling bit
>>> > > is
>>> > > > pretty light weight and the ideal would be to bring the whole of
>>> the
>>> > > repair
>>> > > > external, which is a much bigger can of worms to open.
>>> > > >
>>> > > >
>>> > > >
>>> > > > -Jeremiah
>>> > > >
>>> > > >
>>> > > >
>>> > > > > On Oct 21, 2024, at 11:21 AM, Chris Lohfink <
>>> clohfin...@gmail.com>
>>> > > wrote:
>>> > > > >
>>> > > > >
>>> > > >
>>> > > > > 
>>> > > > >
>>> > > > > > I actually think we should be looking at how we can move
>>> things out
>>> > > of the
>>> > > > > database process.
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > While worth pursuing, I think we would need a different CEP just
>>> to
>>> > > figure
>>> > > > > out how to do that. Not only is there a lot of infrastructure
>>> > > difficulty in
>>> > > > > running multi process, the inter app communication needs to be
>>> figured
>>> > > out
>>> > > > > better then JMX. Even the sidecar we dont have a solid story on
>>> how to
>>> > > > > ensure both are running or anything yet. It's up to each app
>>> owner to
>>> > > figure
>>> > > > > it out. Once we have a good thing in place I think we can start
>>> moving
>>> > > > > compactions, repairs, etc out of the database. Even then it's the
>>> > > _repairs_
>>> > > > > that is expensive, not the scheduling.
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > On Mon, Oct 21, 2024 at 9:45 AM Jeremiah Jordan
>>> > > > > <[jeremiah.jor...@gmail.com](mailto:jeremiah.jor...@gmail.com)>
>>> > > wrote:
>>> > > > >
>>> > > > >
>>> > > >
>>> > > > >> I love the idea of a repair service being there by default for
>>> an
>>> > > install
>>> > > > of C*.  My main concern here is that it is putting more services
>>> into
>>> > > the main
>>> > > > database process.  I actually think we should be looking at how we
>>> can
>>> > > move
>>> > > > things out of the database process.  The C* process being a giant
>>> > > monolith has
>>> > > > always been a pain point.  Is there anyway it makes sense for this
>>> to be
>>> > > an
>>> > > > external process rather than a new thread pool inside the C*
>>> process?
>>> > > >
>>> > > > >>
>>> > > >
>>> > > > >>
>>> > > > >
>>> > > > >>
>>> > > >
>>> > > > >> -Jeremiah Jordan
>>> > > >
>>> > > > >>
>>> > > >
>>> > > > >>
>>> > > > >
>>> > > > >>
>>> > > >
>>> > > > >> On Oct 18, 2024 at 2:58:15 PM, Mick Semb Wever
>>> > > > <[m...@apache.org](mailto:m...@apache.org)> wrote:
>>> > > > >
>>> > > > >>
>>> > > >
>>> > > > >>>
>>> > > > >
>>> > > > >>>
>>> > > >
>>> > > > >>> This is looking strong, thanks Jaydeep.
>>> > > >
>>> > > > >>>
>>> > > >
>>> > > > >>>
>>> > > > >
>>> > > > >>>
>>> > > >
>>> > > > >>> I would suggest folk take a look at the design doc and the PR
>>> in the
>>> > > CEP.
>>> > > > A lot is there (that I have completely missed).
>>> > > >
>>> > > > >>>
>>> > > >
>>> > > > >>>
>>> > > > >
>>> > > > >>>
>>> > > >
>>> > > > >>> I would especially ask all authors of prior art (Reaper, DSE
>>> > > nodesync,
>>> > > > ecchronos)  to take a final review of the proposal
>>> > > > >
>>> > > > >>>
>>> > > >
>>> > > > >>>
>>> > > > >
>>> > > > >>>
>>> > > >
>>> > > > >>> Jaydeep, can we ask for a two week window while we reach out
>>> to these
>>> > > > people ?  There's a lot of prior art in this space, and it feels
>>> like
>>> > > we're in
>>> > > > a good place now where it's clear this has legs and we can use
>>> that to
>>> > > bring
>>> > > > folk in and make sure there's no remaining blindspots.
>>> > > >
>>> > > > >>>
>>> > > >
>>> > > > >>>
>>> > > > >
>>> > > > >>>
>>> > > >
>>> > > > >>>
>>> > > > >
>>> > > > >>>
>>> > > >
>>> > > > >>> On Fri, 18 Oct 2024 at 01:40, Jaydeep Chovatia
>>> > > > <[chovatia.jayd...@gmail.com](mailto:chovatia.jayd...@gmail.com)>
>>> > > wrote:
>>> > > > >
>>> > > > >>>
>>> > > >
>>> > > > >>>> Sorry, there is a typo in the CEP-37 link; here is the correct
>>> > > > [link](
>>> > >
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+Apache+Cassandra+Unified+Repair+Solution
>>> > > )
>>> > > >
>>> > > > >>>>
>>> > > >
>>> > > > >>>>
>>> > > > >
>>> > > > >>>>
>>> > > >
>>> > > > >>>>
>>> > > > >
>>> > > > >>>>
>>> > > >
>>> > > > >>>> On Thu, Oct 17, 2024 at 4:36 PM Jaydeep Chovatia
>>> > > > <[chovatia.jayd...@gmail.com](mailto:chovatia.jayd...@gmail.com)>
>>> > > wrote:
>>> > > > >
>>> > > > >>>>
>>> > > >
>>> > > > >>>>> First, thank you for your patience while we strengthened the
>>> > > CEP-37.
>>> > > >
>>> > > > >>>>>
>>> > > >
>>> > > > >>>>>
>>> > > > >
>>> > > > >>>>>
>>> > > >
>>> > > > >>>>> Over the last eight months, Chris Lohfink, Andy Tolbert, Josh
>>> > > McKenzie,
>>> > > > Dinesh Joshi, Kristijonas Zalys, and I have done tons of work
>>> (online
>>> > > > discussions/a dedicated Slack channel
>>> > > #cassandra-repair-scheduling-cep37) to
>>> > > > come up with the best possible design that not only significantly
>>> > > simplifies
>>> > > > repair operations but also includes the most common features that
>>> > > everyone
>>> > > > will benefit from running at Scale.
>>> > > >
>>> > > > >>>>>
>>> > > >
>>> > > > >>>>> For example,
>>> > > >
>>> > > > >>>>>
>>> > > >
>>> > > > >>>>>   * Apache Cassandra must be capable of running multiple
>>> repair
>>> > > types,
>>> > > > such as Full, Incremental, Paxos, and Preview - so the framework
>>> should
>>> > > be
>>> > > > easily extendable with no additional overhead from the operator’s
>>> point
>>> > > of
>>> > > > view.
>>> > > >
>>> > > > >>>>>
>>> > > >
>>> > > > >>>>>   * An easy way to extend the token-split calculation
>>> algorithm
>>> > > with a
>>> > > > default implementation should exist.
>>> > > >
>>> > > > >>>>>
>>> > > >
>>> > > > >>>>>   * Running incremental repair reliably at Scale is pretty
>>> > > challenging,
>>> > > > so we need to place safeguards, such as migration/rollback w/o
>>> restart
>>> > > and
>>> > > > stopping incremental repair automatically if the disk is about to
>>> get
>>> > > full.
>>> > > >
>>> > > > >>>>>
>>> > > >
>>> > > > >>>>>
>>> > > >
>>> > > > >>>>>
>>> > > >
>>> > > > >>>>> We are glad to inform you that CEP-37 (i.e., Repair inside
>>> > > Cassandra) is
>>> > > > now officially ready for review after multiple rounds of design,
>>> > > testing, code
>>> > > > reviews, documentation reviews, and, more importantly, validation
>>> that
>>> > > it runs
>>> > > > at Scale!
>>> > > >
>>> > > > >>>>>
>>> > > >
>>> > > > >>>>>
>>> > > > >
>>> > > > >>>>>
>>> > > >
>>> > > > >>>>> Some facts about CEP-37.
>>> > > >
>>> > > > >>>>>
>>> > > >
>>> > > > >>>>>   * Multiple members have verified all aspects of CEP-37
>>> numerous
>>> > > times.
>>> > > >
>>> > > > >>>>>
>>> > > >
>>> > > > >>>>>   * The design proposed in CEP-37 has been thoroughly tried
>>> and
>>> > > tested
>>> > > > on an immense scale (hundreds of unique Cassandra clusters, tens of
>>> > > thousands
>>> > > > of Cassandra nodes, with tens of millions of QPS) on top of 4.1
>>> > > open-source
>>> > > > for more than five years; please see more details[
>>> > > > here](
>>> > >
>>> https://www.uber.com/en-US/blog/how-uber-optimized-cassandra-operations-
>>> > > > at-scale/).
>>> > > >
>>> > > > >>>>>
>>> > > >
>>> > > > >>>>>   * The following
>>> > > > [presentation](
>>> > >
>>> https://docs.google.com/presentation/d/1Zilww9c7LihHULk_ckErI2s4XbObxjWknKqRtbvHyZc/edit#slide=id.g30a4fd4fcf7_0_13
>>> > > )
>>> > > > highlights the rigorous applied to CEP-37, which was given during
>>> last
>>> > > week’s
>>> > > > Apache Cassandra Bay Area [Meetup](
>>> > > https://www.meetup.com/apache-cassandra-
>>> > > > bay-area/events/303469006/),
>>> > > >
>>> > > > >>>>>
>>> > > >
>>> > > > >>>>>
>>> > > >
>>> > > >
>>> > > > >
>>> > > > >>>>>
>>> > > >
>>> > > > >>>>> Since things are massively overhauled, we believe it is
>>> almost
>>> > > ready for
>>> > > > a final pass pre-VOTE. We would like you to please review the
>>> > > > [CEP-37](
>>> > >
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+Apache+Cassandra+Unified+Repair+Solution\
>>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+Apache+Cassandra+Unified+Repair+Solution%5C>
>>> > > <
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+Apache+Cassandra+Unified+Repair+Solution%5C
>>> >
>>> > > ))
>>> > > > and the associated detailed design
>>> > > > [doc](https://docs.google.com/document/d/1CJWxjEi-
>>> > > > mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit#heading=h.r112r46toau0).
>>> > > >
>>> > > > >>>>>
>>> > > >
>>> > > > >>>>>
>>> > > > >
>>> > > > >>>>>
>>> > > >
>>> > > > >>>>> Thank you everyone!
>>> > > >
>>> > > > >>>>>
>>> > > >
>>> > > > >>>>> Chris, Andy, Josh, Dinesh, Kristijonas, and Jaydeep
>>> > > >
>>> > > > >>>>>
>>> > > >
>>> > > > >>>>>
>>> > > > >
>>> > > > >
>>> > > > >>>>>
>>> > > >
>>> > > > >>>>>
>>> > > > >
>>> > > > >>>>>
>>> > > >
>>> > > > >>>>> On Thu, Sep 19, 2024 at 11:26 AM Josh McKenzie
>>> > > > <[jmcken...@apache.org](mailto:jmcken...@apache.org)> wrote:
>>> > > > >
>>> > > > >>>>>
>>> > > >
>>> > > > >>>>>>  __
>>> > > >
>>> > > > >>>>>>
>>> > > >
>>> > > > >>>>>> Not quite; finishing touches on the CEP and design doc are
>>> in
>>> > > flight
>>> > > > (as of last / this week).
>>> > > > >
>>> > > > >>>>>>
>>> > > >
>>> > > > >>>>>>
>>> > > > >
>>> > > > >>>>>>
>>> > > >
>>> > > > >>>>>> Soon(tm).
>>> > > >
>>> > > > >>>>>>
>>> > > >
>>> > > > >>>>>>
>>> > > > >
>>> > > > >>>>>>
>>> > > >
>>> > > > >>>>>> On Thu, Sep 19, 2024, at 2:07 PM, Patrick McFadin wrote:
>>> > > > >
>>> > > > >>>>>>
>>> > > >
>>> > > > >>>>>>> Is this CEP ready for a VOTE thread?
>>> > > > <
>>> > >
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+%28DRAFT%29+Apache+Cassandra+Unified+Repair+Solution
>>> >
>>> > >
>>> > > > >
>>> > > > >>>>>>>
>>> > > >
>>> > > > >>>>>>>
>>> > > > >
>>> > > > >>>>>>>
>>> > > >
>>> > > > >>>>>>> On Sun, Feb 25, 2024 at 12:25 PM Jaydeep Chovatia
>>> > > > <[chovatia.jayd...@gmail.com](mailto:chovatia.jayd...@gmail.com)>
>>> > > wrote:
>>> > > > >
>>> > > > >>>>>>>
>>> > > >
>>> > > > >>>>>>>> Thanks, Josh. I've just updated the
>>> > > > [CEP](
>>> > >
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+%28DRAFT%29+Apache+Cassandra+Official+Repair+Solution
>>> > > )
>>> > > > and included all the solutions you mentioned below.
>>> > > > >
>>> > > > >>>>>>>>
>>> > > >
>>> > > > >>>>>>>>
>>> > > > >
>>> > > > >>>>>>>>
>>> > > >
>>> > > > >>>>>>>> Jaydeep
>>> > > > >
>>> > > > >>>>>>>>
>>> > > >
>>> > > > >>>>>>>>
>>> > > > >
>>> > > > >>>>>>>>
>>> > > >
>>> > > > >>>>>>>> On Thu, Feb 22, 2024 at 9:33 AM Josh McKenzie
>>> > > > <[jmcken...@apache.org](mailto:jmcken...@apache.org)> wrote:
>>> > > > >
>>> > > > >>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>  __
>>> > > > >
>>> > > > >>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>> Very late response from me here (basically necro'ing this
>>> > > thread).
>>> > > > >
>>> > > > >>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>
>>> > > > >
>>> > > > >>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>> I think it'd be useful to get this condensed into a CEP
>>> that
>>> > > we can
>>> > > > then discuss in that format. It's clearly something we all agree
>>> we need
>>> > > and
>>> > > > having an implementation that works, even if it's not in your
>>> preferred
>>> > > > execution domain, is vastly better than nothing IMO.
>>> > > > >
>>> > > > >>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>
>>> > > > >
>>> > > > >>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>> I don't have cycles (nor background ;) ) to do that, but
>>> it
>>> > > sounds
>>> > > > like you do Jaydeep given the implementation you have on a private
>>> fork +
>>> > > > design.
>>> > > > >
>>> > > > >>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>
>>> > > > >
>>> > > > >>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>> A non-exhaustive list of things that might be useful
>>> > > incorporating
>>> > > > into or referencing from a CEP:
>>> > > > >
>>> > > > >>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>> Slack thread: <https://the-
>>> > > > asf.slack.com/archives/CK23JSY2K/p1690225062383619>
>>> > > > >
>>> > > > >>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>> Joey's old C* ticket:
>>> > > > <https://issues.apache.org/jira/browse/CASSANDRA-14346>
>>> > > > >
>>> > > > >>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>> Even older automatic repair scheduling:
>>> > > > <https://issues.apache.org/jira/browse/CASSANDRA-10070>
>>> > > > >
>>> > > > >>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>> Your design gdoc: <
>>> > > https://docs.google.com/document/d/1CJWxjEi-
>>> > > > mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit#heading=h.r112r46toau0>
>>> > > > >
>>> > > > >>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>> PR with automated repair:
>>> > > > <
>>> > >
>>> https://github.com/jaydeepkumar1984/cassandra/commit/ef6456d652c0d07cf29d88dfea03b73704814c2c
>>> >
>>> > >
>>> > > > >
>>> > > > >>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>
>>> > > > >
>>> > > > >>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>> My intuition is that we're all basically in agreement
>>> that
>>> > > this is
>>> > > > something the DB needs, we're all willing to bikeshed for our
>>> personal
>>> > > > preference on where it lives and how it's implemented, and at the
>>> end of
>>> > > the
>>> > > > day, code talks. I don't think anyone's said they'll die on the
>>> hill of
>>> > > > implementation details, so that feels like CEP time to me.
>>> > > > >
>>> > > > >>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>
>>> > > > >
>>> > > > >>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>> If you were willing and able to get a CEP together for
>>> > > automated
>>> > > > repair based on the above material, given you've done the work and
>>> have
>>> > > the
>>> > > > proof points it's working at scale, I think this would be a  _huge
>>> > > > contribution_ to the community.
>>> > > > >
>>> > > > >>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>
>>> > > > >
>>> > > > >>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>> On Thu, Aug 24, 2023, at 7:26 PM, Jaydeep Chovatia wrote:
>>> > > > >
>>> > > > >>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>> Is anyone going to file an official CEP for this?
>>> > > > >
>>> > > > >>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>> As mentioned in this email thread, here is one of the
>>> > > solution's
>>> > > > [design doc](https://docs.google.com/document/d/1CJWxjEi-
>>> > > > mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit#heading=h.r112r46toau0)
>>> and
>>> > > source
>>> > > > code on a private Apache Cassandra patch. Could you go through it
>>> and
>>> > > let me
>>> > > > know what you think?
>>> > > > >
>>> > > > >>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>
>>> > > > >
>>> > > > >>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>> Jaydeep
>>> > > > >
>>> > > > >>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>
>>> > > > >
>>> > > > >>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>> On Wed, Aug 2, 2023 at 3:54 PM Jon Haddad
>>> > > > <[rustyrazorbl...@apache.org](mailto:rustyrazorbl...@apache.org)>
>>> > > wrote:
>>> > > > >
>>> > > > >>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>> > That said I would happily support an effort to bring
>>> repair
>>> > > > scheduling to the sidecar immediately. This has nothing blocking
>>> it, and
>>> > > would
>>> > > > potentially enable the sidecar to provide an official repair
>>> scheduling
>>> > > > solution that is compatible with current or even previous versions
>>> of the
>>> > > > database.
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>>
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>> This is something I hadn't thought much about, and is a
>>> > > pretty
>>> > > > good argument for using the sidecar initially.  There's a lot of
>>> > > deployments
>>> > > > out there and having an official repair option would be a big win.
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>>
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>>
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>> On 2023/07/26 23:20:07 "C. Scott Andreas" wrote:
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>> > I agree that it would be ideal for Cassandra to have
>>> a
>>> > > repair
>>> > > > scheduler in-DB.
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>> >
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>> > That said I would happily support an effort to bring
>>> repair
>>> > > > scheduling to the sidecar immediately. This has nothing blocking
>>> it, and
>>> > > would
>>> > > > potentially enable the sidecar to provide an official repair
>>> scheduling
>>> > > > solution that is compatible with current or even previous versions
>>> of the
>>> > > > database.
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>> >
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>> > Once TCM has landed, we’ll have much stronger
>>> primitives
>>> > > for
>>> > > > repair orchestration in the database itself. But I don’t think that
>>> > > should
>>> > > > block progress on a repair scheduling solution in the sidecar, and
>>> there
>>> > > is
>>> > > > nothing that would prevent someone from continuing to use a
>>> sidecar-based
>>> > > > solution in perpetuity if they preferred.
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>> >
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>> > \- Scott
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>> >
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>> > > On Jul 26, 2023, at 3:25 PM, Jon Haddad
>>> > > > <[rustyrazorbl...@apache.org](mailto:rustyrazorbl...@apache.org)>
>>> > > wrote:
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>> > >
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>> > > I'm 100% in favor of repair being part of the
>>> core DB,
>>> > > not
>>> > > > the sidecar.  The current (and past) state of things where running
>>> the DB
>>> > > > correctly *requires* running a separate process (either community
>>> > > maintained
>>> > > > or official C* sidecar) is incredibly painful for folks.  The idea
>>> that
>>> > > your
>>> > > > data integrity needs to be opt-in has never made sense to me from
>>> the
>>> > > > perspective of either the product or the end user.
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>> > >
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>> > > I've worked with way too many teams that have
>>> either
>>> > > > configured this incorrectly or not at all.
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>> > >
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>> > > Ideally Cassandra would ship with repair built in
>>> and on
>>> > > by
>>> > > > default.  Power users can disable if they want to continue to
>>> maintain
>>> > > their
>>> > > > own repair tooling for some reason.
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>> > >
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>> > > Jon
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>> > >
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>> > >> On 2023/07/24 20:44:14 German Eichberger via dev
>>> > > wrote:
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>> > >> All,
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>> > >> We had a brief discussion in [2] about the Uber
>>> article
>>> > > [1]
>>> > > > where they talk about having integrated repair into Cassandra and
>>> how
>>> > > great
>>> > > > that is. I expressed my disappointment that they didn't work with
>>> the
>>> > > > community on that (Uber, if you are listening time to make amends
>>> 🙂)
>>> > > and it
>>> > > > turns out Joey already had the idea and wrote the code [3] - so I
>>> wanted
>>> > > to
>>> > > > start a discussion to gauge interest and maybe how to revive that
>>> > > effort.
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>> > >> Thanks,
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>> > >> German
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>> > >> [1] <
>>> > > https://www.uber.com/blog/how-uber-optimized-cassandra-
>>> > > > operations-at-scale/>
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>> > >> [2] <https://the-
>>> > > > asf.slack.com/archives/CK23JSY2K/p1690225062383619>
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>> > >> [3] <
>>> > > https://issues.apache.org/jira/browse/CASSANDRA-14346>
>>> > > > >
>>> > > > >>>>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>>> >
>>> > > > >
>>> > > > >>>>>>>>>
>>> > > >
>>> > > > >>>>>>>>>
>>> > > > >
>>> > > > >>>>>>
>>> > > >
>>> > > > >>>>>>
>>> > > > >
>>> > > >
>>> > > >
>>> > >
>>> >
>>>
>>

Re: [Discuss] Repair inside C*

Reply via email to