Re: Intra-project dependencies

Henrik Ingo Mon, 16 Jan 2023 08:10:59 -0800

Hi all

I was invited to share my thoughts just as an additional and somewhat fresh
point of view...

On a high level: We talked through this with Mick and a few other
colleagues, and I/we came to the conclusion that fundamentally all of the
mentioned options 1-5 are just variations of the same problem being moved
into different places. That is to say there's complexity here that isn't
going away. This is good to recognize just so that you realize when you are
feeling that you don't quite like any of the available options, this is
why. At least for me it's somehow calming when you understand this is the
reality and you just have to face it.

It seems to me the fundamental question is, will the link from Cassandra to
Accord be a 1-1 or n-1 mapping? Superficially we would think that Accord is
a separate library and all future Cassandra versions will use the same
version of Accord. But is that really the case? Isn't it rather expected
that Cassandra 5.1, 5.2 will probably come with more and improved
functionality than what will be in 5.0? Fundamental additional
functionality like less-than-strict consistency, mvcc, and maybe one day
interactive transactions. What I'd expect to see here is then that the
separate Accord library in fact is rather closely tied to its parent
Cassandra release, and as soon as we have a 5.0 GA, we will also need a
stable Accord branch to match, while significant new development will
happen in tandem with Cassandra trunk/5.1?

If the latter scenario is more likely, then having Accord in tree seems to
be the easiest choice, because it's actually not the case that you are
maintaining three copies of the same codebase. (Anymore than that's the
case for all Cassandra code.)

FWIW MongoDB does in fact use option 5: At build time there's a bash script
that copies your separate WiredTiger repository into the source tree, then
compiles. A major reason they did it this way was to support the possiblity
that some modules would be closed source. Git modules would not work - or
at least be very annoying - for a case where the parent directory is open
source but the sub-module is not available to everyone.

But having used the MongoDB system - which apparently is also Accord's
system today - I'd say in the end it's just git submodules in a different
form: You get to choose whether to manage the library dependency with git
or a bash script.

Finally, and I know this was stated before as well, the Accord developers
seem hopeful that Accord will gain interest and contributors from outside
of Cassandra, and as such warrants its own repository. For arguments sake,
let's assume this is possible/likely...

I didn't write this email to support any particular alternative or opinion.
But combining the above thoughts, I feel like there is a conclusion
sticking out of this email... And the conclusion is of the form "we can
always change this later"...

It seems to me that especially now, and probably also after 5.0 is
released, we will in any case only have a single version of Cassandra using
a singgle version of Accord. So at least to begin with, it's the least
effort to keep it in-tree, to avoid the overhead of git submodules, or
having to make releases, etc.  The separate constituency of Accord-only
developers can be satisfied by keeping Accord in its own directory, could
even be a top-level directory, and a small build system that can build a
separate Accord jar file. You could even maintain a separate github repo
just for advertising purposes. (Just like github.com/apache/cassandra isn't
the official git repo for Cassandra either.)

If both of my assumptions above are true, then from a Cassandra point of
view there's not much benefit having Accord separately, but if 3rd party
interest in Accord grows, then it could indeed be split out into its own
repository at that point. The main motivation then would be to service
those 3rd party developers who aren't so interested in Cassandra. But this
split would only be done once it is known that such a community will form.

Thoughts?

henrik

On Mon, Jan 16, 2023 at 2:30 PM Josh McKenzie <[email protected]> wrote:

>  - permanence from a git SHA no longer exists
>
> With the caveat that I haven't worked w/submodules before and only know
> about them from a cursory search, it looks like git-submodule status would
> show us the sha for submodules and we could have parent projects reference
> specific shas to pull for submodules to build?
> https://git-scm.com/docs/git-submodule/#Documentation/git-submodule.txt-status--cached--recursive--ltpathgt82308203
> <https://urldefense.com/v3/__https://git-scm.com/docs/git-submodule/*Documentation/git-submodule.txt-status--cached--recursive--ltpathgt82308203__;Iw!!PbtH5S7Ebw!fsPjRP4hKq0en0Jh6A9uUnXA5lITeY3LIkXYEZZg_0SweveVOQvRg-z1CIxAexTWI6blxLaoo5SIDnMCSaOsnw$>
>
> It seems like our use case is one of the primary ones git submodules are
> designed to address.
>
> On Mon, Jan 16, 2023, at 6:40 AM, Benedict wrote:
>
>
> I guess option 5 is what we have today in cep-15, have the build file grab
> the relevant SHA for the library. This way you maintain a precise SHA for
> builds and scripts don’t have to be modified.
>
> I believe this is also possible with git submodules, but I’m happy to bake
> this into our build file instead with a script.
>
> > As the library itself no longer has an explicit version, what I presume
> you meant by logical version.
>
> I mean that we don’t want to duplicate work and risk diverging
> functionality maintaining what is logically (meant to be) the same code. As
> a developer, managing all of the branches is already a pain. Libraries
> naturally have a different development cadence to the main project, and
> tying the development to C* versions is just an unnecessary ongoing burden
> (and risk) that we can avoid.
>
> There’s also an additional penalty: we reduce the likelihood of outside
> contributions to the libraries only. Accord in particular I hope will
> attract outside interest if it is maintained as a separate library, as it
> has broad applicability, and is likely of academic interest. Tying it to C*
> version and more tightly coupling with C* codebase makes that less likely.
> We might also see folk interested in our utilities, or our simulator
> framework, if they were to be maintained separately, which could be
> valuable.
>
>
>
>
> On 16 Jan 2023, at 10:49, Mick Semb Wever <[email protected]> wrote:
>
> 
>
> I think (4) is the only sensible option. It permits different development
> branches to easily reference different versions of a library and also to
> easily co-develop them - from within the same IDE project, even.
>
>
>
> I've only heard horror stories about submodules. The challenges they bring
> should be listed and checked.
>
> Some examples
>  - you can no longer just `git clone …`  (and we clone automatically in a
> number of places)
>  - same with `git pull …` (easy to be left with out-of-sync submodules)
>  - permanence from a git SHA no longer exists
>  - our releases get more complicated (our source tarballs are the asf
> releases)
>  - handling patches cover submodules
>  - switching branches, and using git worktrees, during dv
>
> I see (4) as a valid option, but concerned with the amount of work
> required to adapt to it, and whether it will only make it more complicated
> for the new contributor to the project. For example the first two points
> are addressed by remembering to do `git clone --recurse-submodules …` . And
> who would be fixing our build/test/release scripts to accommodate?
>
> Not blockers, just concerns we need to raise and address.
>
>
>
> We might even be able to avoid additional release votes as a matter of
> course, by compiling the library source as part of the C* release, so that
> they adopt the C* release vote (or else we may periodically release the
> library as we do other releases)
>
>
>
> Yes. Today we do a combination of first (3) and then (1). Having to make a
> release of these libraries every time a patch (/feature branch) is
> completing is a horror story in itself.
>
>
> I might be missing something, does anyone have any other bright ideas for
> approaching this problem? I’m sure there are plenty of opinions out there.
>
>
>
> Looking at the problem with these libraries,
>  - we don't need releases
>  - we don't have a clean version/branch parity to in-tree
>  - codebase parity between branches is important for upgrade tests (shared
> classloaders)
>
>  For (2) you mention drift of the "same" version, isn't this only a
> problem for dtest-api in the way it requires the "same version" of a
> codebase for compatibility when running upgrade tests? As the library
> itself no longer has an explicit version, what I presume you meant by
> logical version.
>
> To begin with, I'm leaning towards (2) because it is a cognitive re-use of
> our release branches, and the problems around classpath compatibility can
> be solved with tests. I'm sure I'm not seeing the whole picture though…
>
>
>

-- 

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>
  [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>

Re: Intra-project dependencies

Reply via email to