Interesting data Michael. I am not sure though that the shared commits tell
us that there are people that contribute to both projects. Eventually, an
API change/update in Lucene will require a change in Solr (but not vice
versa). Those commits will still occur in both projects, only on the Solr
side they will occur when Solr will upgrade to the respective Lucene
version.

I wonder if we can tell, out of the shared commits, how many started in
Lucene and ended in Solr because of the shared build (i.e. an API change
required Solr code changes for the build to pass), vs how many started in
Solr, and ended in Lucene because a core change was needed to support the
Solr feature/update. The first case does not indicate, IMO, a shared
contribution (whoever changes a Lucene API will not then go and update Solr
and Elasticsearch if the projects were split), while the second case is a
stronger indication of a shared contribution.

Maybe if we could "label" committers as mostly Lucene/Solr, we could tell
more about the shared commits?

Anyway, data is good, I agree.

Shai

On Mon, May 4, 2020 at 5:49 PM Michael Sokolov <msoko...@gmail.com> wrote:

> I always like to look at data when making a big decision, so I
> gathered some statistics about authors and commits to git over the
> history of the project. I wanted to see what these statistics could
> tell us about the degree of overlap between the two projects and
> whether it has changed over time. Using commands like
>
>      git log --pretty=%an --since=2012 --lucene
>      git log --pretty=%an --since=2012 --solr
>
> I looked at the authors of commits in the lucene and solr top-level
> folders of the project. I think this makes a reasonable proxy for
> contributors to the two projects. From there I found that since 2012,
> there are 60 Lucene-only authors, 71 Solr-only authors, and 101
> authors (or 43%) contributing at least one commit to each project.
> Since 2018, the percentage of both-project authors is somewhat lower:
> 36%.
>
> I also looked at commits spanning both projects. I'm not sure this
> captures all the work that touches both projects, but it's a window
> into that, at least. I found that since 2012, 1387/19063 (6.8%) of
> commits spanned both project folders. Since 2018, 7.4% did.
>
> I don't think you can really draw very many meaningful conclusions
> from this, but a few things jump out: First, it is clear that these
> projects are not completely separate today. A substantial number of
> people commit to both, over time, although most people do not. Also,
> relatively few commits span both projects. Some do though, and it's
> certainly worth considering what the workflow for such changes would
> be like in the split world. Maybe a majority of these are
> build-related; it's hard to tell from this coarse analysis.
>
>
> On Mon, May 4, 2020 at 5:11 AM Dawid Weiss <dawid.we...@gmail.com> wrote:
> >
> > Dear Lucene and Solr developers!
> >
> > A few days ago, I initiated a discussion among PMC members about
> > potential pros and cons of splitting the project into separate Lucene
> > and Solr entities by promoting Solr to its own top-level Apache
> > project (TLP). Let me share with you the motivation for such an action
> > and some follow-up thoughts I heard from other PMC members so far.
> >
> > Please read this e-mail carefully. Both the PMC and I look forward to
> > hearing your opinion. This is a DISCUSS thread and it will be followed
> > next week by a VOTE thread. This is our shared project and we should
> > all shape its future responsibly.
> >
> > The big question is this: “Is this the right time to split Solr and
> > Lucene into two independent projects?”.
> >
> > Here are several technical considerations that drove me to ask the
> > question above (in no order of priorities):
> >
> > 1) Precommit/ test times. These are crazy high. If we split into two
> > projects we can pretty much cut all of Lucene testing out of Solr (and
> > likewise), making development a bit more fun again.
> >
> > 2) Build system itself and source release packaging. The current
> > combined codebase is a *beast* to maintain. Working with gradle on
> > both projects at once made me realise how little the two have in
> > common. The code layout, the dependencies, even the workflow of people
> >
> > working on these projects... The build (both ant and gradle) is full
> > of Solr and Lucene-specific exceptions and hooks that could be more
> > elegantly solved if moved to each project independently.
> >
> > 3) Packaging. There is no single source distribution package for
> > Solr+Lucene. They are already "independent" there. Why should Lucene
> > and Solr always be released at the same pace? Does it always make
> > sense?
> >
> > 4) Solr is essentially taking in Lucene and its dependencies as a
> > whole (so is Elasticsearch and many other projects). In my opinion
> > this makes Lucene eligible for refactoring and
> >
> > maintenance as a separate component. The learning curve for people
> > coming to each project separately is going to be gentler than trying
> > to dive into the combined codebase.
> >
> > 5) Mailing lists, build servers. Mailing lists for users are already
> > separated. I think this is yet another indication that Solr is
> > something more than a component within Lucene. It is perceived as an
> > independent entity and used as an independent product. I would really
> > like to have separate mailing lists for these two projects (this
> > includes build and test results) as it would make life easier: if your
> > focus is more on Lucene (or Solr), you would only need to track half
> > of the current traffic.
> >
> >
> > As I already mentioned, the discussion among PMC members highlighted
> > some initial concerns and reasons why the project should perhaps
> > remain glued together. These are outlined below with some of the
> > counter-arguments presented under each concern to avoid repetition of
> > the same content from the PMC mailing list (they’re copied from the
> > private discussion list).
> >
> > 1) Both projects may gradually split their ways after the separation
> > and even develop “against” each other like it used to be before the
> > merge.
> >
> > Whether this is a legitimate concern is hard to tell. If Solr goes TLP
> > then all existing Lucene committers will automatically become Solr
> > committers (unless they opt not to) so there will be both procedural
> > ways to prevent this from happening (vetoes) as well as common-sense
> > reasons to just cooperate.
> >
> > 2) Some people like parallel version numbering (concurrent Solr and
> > Lucene releases) as it gives instant clarity which Solr version uses
> > which version of Lucene.
> >
> > This can still be done on Solr side (it is Solr’s decision to adapt
> > any versioning scheme the project feels comfortable with). I
> > personally (DW) think this kind of versioning is actually more
> > confusing than helpful; Solr should have its own cadence of releases
> > driven by features, not sub-component changes. If the “backwards
> > compatibility” is a factor then a solution might be to sync on major
> > version releases only (e.g., this is how Elasticsearch is handling
> > this).
> >
> > 3) Solr tests are the first “battlefield” test zone for Lucene changes
> > - if it becomes TLP this part will be gone.
> >
> > Yes, true. But realistically Solr will have to adopt some kind of
> > snapshot-based dependency on Lucene anyway (whether as a git submodule
> > or a maven snapshot dependency). So if there are bugs in Lucene they
> > will still be detected by Solr tests (and fairly early).
> >
> > 4) Why split now if we merged in the first place?
> >
> > Some of you may wonder why split the project that was initially
> > *merged* from two independent codebases (around 10 years ago). In
> > short, there was a lot of code duplication and interaction between
> > Solr and Lucene back then, with patches flying back and forth.
> > Integration into a single codebase seemed like a great idea to clean
> > things up and make things easier. In many ways this is exactly what
> > did happen: we have cleaned up code dependencies and reusable
> > components (on Lucene side) consumed by not just Solr but also other
> > projects (downstream from Lucene).
> >
> > The situation we find ourselves now is different to what it was
> > before: recent and ongoing development for the most part falls within
> > Solr or Lucene exclusively.
> >
> >
> > This e-mail is for discussing the idea and presenting arguments/
> > counter-arguments for or against the split. It will be followed by a
> > separate VOTE thread e-mail next Monday. If the vote passes then there
> > are many questions about how this process should be arranged and
> > orchestrated. There are past examples even within Lucene [1] that we
> > can learn from, and there are people who know how to do it - the
> > actual process is of lesser concern at the moment, what we mostly want
> > to do is to reach out to you, signal the idea and ask about your
> > opinion. Let us know what you think.
> >
> > [1]
> https://lists.apache.org/thread.html/15bf2dc6d6ccd25459f8a43f0122751eedd3834caa31705f790844d7%401270142638%40%3Cuser.nutch.apache.org%3E
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Reply via email to