Re: [Discuss] Repair inside C*

Josh McKenzie Thu, 27 Jul 2023 07:34:43 -0700

> The idea that your data integrity needs to be opt-in has never made sense to 
> me from the perspective of either the product or the end user.
I could not agree with this more. 100%.

> The current (and past) state of things where running the DB correctly 
> **requires* *running a separate process (either community maintained or 
> official C* sidecar) is incredibly painful for folks.
I'm 50/50 on this (and I have some opinions here; bear with me :D ).

To me this goes beyond the question of just "where do we coordinate repair" 
into "what role does a node play vs. the sidecar and how does that intersect 
w/the industry today".

Having just 1 process you run on N machines is much nicer from an operations 
standpoint and it's *much* cleaner and easier for us as a project to not have 
to deal with signaling, shmem, and going down the IPC rabbit hole. A modular 
monolith, if you will.

That said, I feel like zeitgeist has been all-in in terms of microservices and 
control planes, whether they're the right solution or not. The affordances on 
being able to build out independent teams and large organization dev velocity, 
never-mind the ideal of being able to cleanly upgrade or rewrite internal 
components, is attractive enough on paper that it feels like most groups have 
gone that direction and accepted the perceived costs; I view Cassandra as being 
something of an architectural anachronism at this point. And to call back to 
the prior paragraph, I *think* you get all those positive affordances w/a 
modular monolith. Sadly, google trends 
<https://trends.google.com/trends/explore?cat=32&date=today%205-y&q=microservices,modular%20monolith&hl=en>
 don't really give me a lot of hope there.

In an ideal world operators (or better yet, an automated operations process) 
would be able to dynamically adjust resource allocation to nodes based on 
"burstiness of the buffering" (i.e. lots of data building up in CL's needing to 
be flushed, or compaction need, or repair need); It's not immediately obvious 
to me how we'd gracefully do that in a single process paradigm in containers 
w/out becoming a noisy neighbor but it's not impossible. Kind of goes meta 
outside C*'s scope into how you're coordinating your hardware and software 
interactions; maybe that's the cleaner route: we clearly signal metrics for 
each major operation the DB needs to do to indicate their backlog and an 
external orchestration process / system / ??? handles the resource allocation. 
i.e. we don't take that on.

Certainly we can do a lot better when it comes to internal scheduling of DB 
operations to one another than we are today (start using cql rate limiting, 
dynamically determine a rolling average of needs to smooth out burst requests, 
make byte-based rate-limiting an option, user-space threads w/loom and some 
kind of QoS prioritization based on backlogs, etc).

I personally view moving maintenance tasks into the sidecar as a reasonable 
"first step satisficing compromise". If anything, that'd potentially give us 
some breathing room to get our house in order on the "I/O" process (as opposed 
to sidecar as "maintenance process") to then re-integrate things back in in a 
more clean, planned fashion with some better tools to do it right.

~Josh

On Wed, Jul 26, 2023, at 7:20 PM, C. Scott Andreas wrote:
> I agree that it would be ideal for Cassandra to have a repair scheduler in-DB.
> 
> That said I would happily support an effort to bring repair scheduling to the 
> sidecar immediately. This has nothing blocking it, and would potentially 
> enable the sidecar to provide an official repair scheduling solution that is 
> compatible with current or even previous versions of the database.
> 
> Once TCM has landed, we’ll have much stronger primitives for repair 
> orchestration in the database itself. But I don’t think that should block 
> progress on a repair scheduling solution in the sidecar, and there is nothing 
> that would prevent someone from continuing to use a sidecar-based solution in 
> perpetuity if they preferred.
> 
> - Scott
> 
> > On Jul 26, 2023, at 3:25 PM, Jon Haddad <rustyrazorbl...@apache.org> wrote:
> > 
> > I'm 100% in favor of repair being part of the core DB, not the sidecar.  
> > The current (and past) state of things where running the DB correctly 
> > *requires* running a separate process (either community maintained or 
> > official C* sidecar) is incredibly painful for folks.  The idea that your 
> > data integrity needs to be opt-in has never made sense to me from the 
> > perspective of either the product or the end user.
> > 
> > I've worked with way too many teams that have either configured this 
> > incorrectly or not at all.  
> > 
> > Ideally Cassandra would ship with repair built in and on by default.  Power 
> > users can disable if they want to continue to maintain their own repair 
> > tooling for some reason.
> > 
> > Jon
> > 
> >> On 2023/07/24 20:44:14 German Eichberger via dev wrote:
> >> All,
> >> We had a brief discussion in [2] about the Uber article [1] where they 
> >> talk about having integrated repair into Cassandra and how great that is. 
> >> I expressed my disappointment that they didn't work with the community on 
> >> that (Uber, if you are listening time to make amends 🙂) and it turns out 
> >> Joey already had the idea and wrote the code [3] - so I wanted to start a 
> >> discussion to gauge interest and maybe how to revive that effort.
> >> Thanks,
> >> German
> >> [1] 
> >> https://www.uber.com/blog/how-uber-optimized-cassandra-operations-at-scale/
> >> [2] https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
> >> [3] https://issues.apache.org/jira/browse/CASSANDRA-14346
>

Re: [Discuss] Repair inside C*

Reply via email to