Re: [DISCUSS] Forking Cassandra utilities into a separately released library

Josh McKenzie Thu, 11 Jun 2026 08:17:09 -0700

> What do you mean by this?  A "branch of tools per GA branch” I don’t follow.
So if we have the following on C* as GA branches:
- cassandra-4.1
- cassandra-5.0
- cassandra-6.0


We'd have branches on the tools project for:
- cassandra-4.1
- cassandra-5.0
- cassandra-6.0

i.e. we mirror the C* upstream branching strategy and maintain compatibility 
between HEAD on both repos. That way we can make changes that are C*-version 
specific if needed w/out having to modernize the integration of tooling w/older 
C* branches.

If tooling for older branches is unlikely to change, then it seems like the 
following might be optimal:
 1. new repo
 2. branch strategy matching our primary consumer (C*)
   1. Backport changes selectively to older branches as needed
 3. embed the tooling as a submodule in C*
That a distillation of what you're thinking David? Seems reasonable to me.

On Mon, Jun 8, 2026, at 6:24 PM, David Capwell wrote:
>> but that introduces the inverse problem where you'd have to make a change 
>> across N branches on the shared library if you have a patch that introduces 
>> testing that hits all our GA C* and need to backport that functionality 
>> instead of changing it in one place.
> 
> In the case I was talking about its the Property, Gen, and Gens classes, and 
> not cluster level tests (similar to python dtest); so don’t think that would 
> happen?
> 
> 
>>  • Do we expect the shared functionality in this lib would change frequently 
>> in ways that would impact multiple branches, or do we think it would be 
>> mostly stable for older branches and mutate more frequently on trunk?
> 
> I went through our mailing list to see where this has been brought up and a 
> common set brought up are "executors/futures/collections/concurrency 
> utilities”.  These cases I feel should be the same, that new features are for 
> trunk and we don’t really need to back port to older branches unless there 
> are bug fixes (in which case we bump the version).  So I work with the 
> assumption that back port to older branches isn’t that likely.  Bug fixes 
> might need a version bump but should be backwards compatible, new features 
> should also not break the public API.
> 
> One advantage of being a separate and versioned dependency is its easier to 
> track when the API is broken, in tree makes this more painful.
> 
> Now, going through the history of this topic there is a group of things that 
> I don’t think make sense to fork, and its stuff like AbstractType / Index / 
> IAuthentorictor, etc… plugin authors want a way to handle building their 
> plugins without Cassandra-all and these APIs are structurally cassandra 
> related.  The stuff I propose extracting out of the code base are generic and 
> unaware of cassandra as a project.
> 
>>  • If the latter (mostly stable, trunk only changes) then having a branch of 
>> tools per GA branch would be optimal
> What do you mean by this?  A "branch of tools per GA branch” I don’t follow.
> 
>> From a workflow perspective, a shared library factored out to its own repo 
>> and embedded into C* branches as a submodule has some attractive properties 
>> either way. It gives you "best of both worlds" (or least-worst-option) by 
>> allowing you to work on things seamlessly as though they were one project 
>> but keep the branching strategies of the tooling and the dependents 
>> decoupled. Even if we only had 1 branch of the test tooling that all C* 
>> versions depended on, having it separate and embedded as a submodule should 
>> give us the same devx ergonomics while preserving the option to customize 
>> per C* branch fairly easily.
> 
> Yep!  While working on accord I never needed 2 different IDEs open, one for 
> accord and one for cassandra; I was able to make changes as if it was a 
> single project and the only complexity for development was making sure CI 
> knew about my accord branch (we have a script in tree for that) and merge is 
> 3 steps rather than 1 (merge accord, update cassandra to point to latest 
> accord, merge cassandra).
> 
> Sub modules do have down sides we are currently living with (as you have seen 
> working with CI) and I do hope its been mostly seamless for people… 
> 
> I can also see us trying out a hybrid model… trunk is submodule but once we 
> fork a major branch we switch to release jars instead; we get the trunk level 
> velocity and loose all the pain points of submodules when working in a 
> release branch.
> 
>> On Jun 8, 2026, at 7:25 AM, Josh McKenzie <[email protected]> wrote:
>> 
>>> One other motivation for forking is that we can fix issues one time rather 
>>> than have to fix in 5 branches that have slightly different versions of our 
>>> libraries. 
>> The pain on this one is real. Spit-balling, but I wonder if there'd be a way 
>> to sustainably have all GA branches depend on this code from trunk and we 
>> use testing and validation to ensure the code on trunk stays compatible with 
>> older releases.
>> 
>> There's a lot of complexity there since we'd need CI updated to run that 
>> subset of tooling tests across all GA branches before a commit (i.e. trunk 
>> only changes would then potentially impact all GA branches), but maybe that 
>> actually wouldn't be so bad if we just had a new pipeline that pulled and 
>> built all GA branches from HEAD and ran through the tooling test suites 
>> against those releases. That, and it'd only really be in scope if you were 
>> making changes to that tooling. That said, it would seem pretty weird for 
>> 5.0 to need to check out code from the trunk branch to build and run tests 
>> against though... =/
>> 
>>> My primary need is for test utilities so my focus is there.
>> Hm. Yeah, the more I think through this, having a versioned set of test 
>> utilities in trunk for instance would definitely feel like "crossing the 
>> streams" (i.e. PropertyTestingBase4.0, PropertyTestingBase4.1, etc). Big 
>> separation of concerns / scope failure if people working on a trunk branch 
>> in C* are having to think about other branches and API breakage with them 
>> (moreso than we already have to w/mixed version upgrades etc.)
>> 
>> Having things like that in a separate repo where we could cut iterate on 
>> things to update for a single branch would alleviate that immediate 
>> versioning / mismatch context leak, but that introduces the inverse problem 
>> where you'd have to make a change across N branches on the shared library if 
>> you have a patch that introduces testing that hits all our GA C* and need to 
>> backport that functionality instead of changing it in one place.
>> 
>> Blech.
>> 
>> So as I was drafting the above, my thinking has distilled down to the 
>> following as being important to have a shared mental model on:
>>  • Do we expect the shared functionality in this lib would change frequently 
>> in ways that would impact multiple branches, or do we think it would be 
>> mostly stable for older branches and mutate more frequently on trunk?
>>    • If the former (multi-branch impacting blast radius, we keep older GA 
>> branches in sync / compatible with test harness changes), a single golden 
>> copy of the shared code that each branch shares would minimize toil
>>    • If the latter (mostly stable, trunk only changes) then having a branch 
>> of tools per GA branch would be optimal
>> 
>> From a workflow perspective, a shared library factored out to its own repo 
>> and embedded into C* branches as a submodule has some attractive properties 
>> either way. It gives you "best of both worlds" (or least-worst-option) by 
>> allowing you to work on things seamlessly as though they were one project 
>> but keep the branching strategies of the tooling and the dependents 
>> decoupled. Even if we only had 1 branch of the test tooling that all C* 
>> versions depended on, having it separate and embedded as a submodule should 
>> give us the same devx ergonomics while preserving the option to customize 
>> per C* branch fairly easily.
>> 
>> On Fri, Jun 5, 2026, at 9:25 AM, David Capwell wrote:
>>> One other motivation for forking is that we can fix issues one time rather 
>>> than have to fix in 5 branches that have slightly different versions of our 
>>> libraries. A recent example is CASSANDRA-21216 which was a bug fix for 
>>> btree.  
>>> 
>>> One of the other reasons brought up in the past is that many libraries are 
>>> needed by accord but accord can’t depend on Cassandra else we have a 
>>> cyclical dependency, so forking off let’s accord use our libraries.  For 
>>> the time being accord had to fork many libraries in accord to make 
>>> progress; this is a common issue right now.
>>> 
>>> 
>>> 
>>> Sent from my iPhone
>>> 
>>>> On Jun 3, 2026, at 1:45 PM, Josh McKenzie <[email protected]> wrote:
>>>> 
>>>>> delays this effort for years as we need time to get people on board and 
>>>>> used to gradle before we flip that switch. 
>>>> Oof. I'm way more optimistic on this one; if we can get a PR that has ant 
>>>> targets as dumb wrappers that instead call gradle targets (i.e. all 
>>>> workflows and local scripting Just Work), I don't see why we couldn't 
>>>> merge that as soon as we ironed out kinks.
>>>> 
>>>> Is there anyone that's broadly against that approach? Or did I just 
>>>> misunderstand the other thread / JIRA you'd created David?
>>>> 
>>>> On Wed, Jun 3, 2026, at 1:21 PM, David Capwell wrote:
>>>>> Fair point but one thing to point out, if this work depends on gradle 
>>>>> that delays this effort for years as we need time to get people on board 
>>>>> and used to gradle before we flip that switch.  So leaving in tree means 
>>>>> we have to hand roll all that logic in ant. 
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>>> On Jun 3, 2026, at 12:33 PM, Jon Haddad <[email protected]> wrote:
>>>>>> 
>>>>>> Josh is right.  Gradle subprojects could allow this without dealing with 
>>>>>> separate repo.  I've done this before and am about to again for some 
>>>>>> stuff I maintain.  I spent a long time agonozing over this for my other 
>>>>>> projects and found it works exceptionally well, especially bc you 
>>>>>> frequently develop things that are tightly coupled.  
>>>>>> 
>>>>>> Juggling repos sucks, this solves it (imo) perfectly.
>>>>>> 
>>>>>> Jon
>>>>>> 
>>>>>> On Tue, Jun 2, 2026 at 1:18 PM Josh McKenzie <[email protected]> 
>>>>>> wrote:
>>>>>>> __
>>>>>>>> Is there a reason not to use a folder in the current repo that becomes 
>>>>>>>> its own jar?  It can even be published separately if we like?
>>>>>>> 
>>>>>>>> Mostly to decouple from Cassandra release.
>>>>>>> I *think* we could just have that .jar release on its own cadence 
>>>>>>> independently of the parent C* project.
>>>>>>> 
>>>>>>> Some of us have talked about taking this same approach to making some 
>>>>>>> code from C* available to the ecosystem (think I/O .jar that has 
>>>>>>> SSTable read/write, CommitLog read/write, etc). This feels like a very 
>>>>>>> similarly shaped thing.
>>>>>>> 
>>>>>>> I assume w/a modern build / publish / etc system we'd be able to 
>>>>>>> publish a release that represents a strict subset of the parent project 
>>>>>>> out of the repo right?
>>>>>>> 
>>>>>>> On Mon, Jun 1, 2026, at 8:18 PM, David Capwell wrote:
>>>>>>>> Mostly to decouple from Cassandra release.  If there is a feature 
>>>>>>>> added does it have to wait for the next major release of Cassandra so 
>>>>>>>> others can consume?  Even if we can get to yearly releases that’s 
>>>>>>>> still a long wait.
>>>>>>>> 
>>>>>>>> For example Alex and I have been talking about proper fuzz testing, so 
>>>>>>>> best case is a year before 3rd parties could use.
>>>>>>>> 
>>>>>>>> Sent from my iPhone
>>>>>>>> 
>>>>>>>>> On Jun 1, 2026, at 4:32 PM, Jeremiah Jordan <[email protected]> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Does it need to be a separate repo? Is there a reason not to use a 
>>>>>>>>> folder in the current repo that becomes its own jar?  It can even be 
>>>>>>>>> published separately if we like?
>>>>>>>>> 
>>>>>>>>> -Jeremiah
>>>>>>>>> 
>>>>>>>>> On Jun 1, 2026 at 10:00:15 AM, David Capwell <[email protected]> 
>>>>>>>>> wrote:
>>>>>>>>>> Hi all,
>>>>>>>>>> 
>>>>>>>>>> We've discussed pulling utilities out of trunk before. I'd like to 
>>>>>>>>>> actually start.  My primary need is for test utilities so my focus 
>>>>>>>>>> is there.
>>>>>>>>>> 
>>>>>>>>>> This isn't just my need. Sidecar wants property/stateful tests but 
>>>>>>>>>> can't use ours without a published jar.
>>>>>>>>>> 
>>>>>>>>>> Proposed approach:
>>>>>>>>>> 
>>>>>>>>>> 1. Define scope — start with property/stateful test utilities
>>>>>>>>>> 2. Set up the repo and release independently of Cassandra
>>>>>>>>>> 3. ...
>>>>>>>>>> 4. Cassandra depends on the library
>>>>>>>>>> 
>>>>>>>>>> I'd focus on the fork first, before making Cassandra depend on it — 
>>>>>>>>>> keeps our builds simple and gives the lib room to stabilize. We can 
>>>>>>>>>> sort out the dependency question later (wait on releases, or use 
>>>>>>>>>> submodules?).
>>>>>>>>>> 
>>>>>>>>>> Happy to drive this if there's interest.
>>>>>>>>>> 
>>>>>>>>>> Sent from my iPhone
>>>>>>> 
>>>> 
>>

Re: [DISCUSS] Forking Cassandra utilities into a separately released library

Reply via email to