Re: Solr Alpha (EA) release of Reference Branch

Anshum Gupta Tue, 06 Oct 2020 19:46:13 -0700

Thanks for initiating this discussion, Ishan.

For the sake of making sure that we are all on the same page, let me
summarize my understanding and take on this thread.


The current situation
Mark has a reference branch, which the folks who have looked at the branch,
feel that it’s a much better, improved, reliable, and sustainable version
of the current master i.e. take the same baseline and make it better. We
would like to get those changes to the project, but aren’t sure about how
to do so. Releasing the branch when it’s ready to go, as an alpha release
will allow users to test it.

1. Is releasing the branch officially going to help us achieve the goal of
having a well tested branch ?
2. Assuming #1 is true, do we as a community want to release the branch
officially and assume responsibility?
3. What is our path forward after the release I.e. do we merge the branch
into master or swap out current master.

What do we plan to do (options).
I feel there is a consensus on everyone wanting the best for the project
and wanting Marks’ changes released.

#1 - There are differing opinions, and I personally think we can have our
test harnesses test the new branch, but I think most companies running Solr
at scale would have concerns with taking up an alpha release and deploying
it in production. The various tests that a bunch of folks are working on is
our best bet at testing out the branch, in which case I’m not sure if we
want an official release.

#2 - I feel that having an official release and having artifacts show up in
maven central will confuse people. The 4.0 alpha release was very different
in the sense that it was the same branch, the code wasn’t replacing
anything existing but introducing a completely new feature i.e. SolrCloud.

#3 - I’m still unclear on how these changes will be released in terms of
the community consensus. I’ve tried to merge parts of Marks’ effort from
another time into master, but it’s very difficult, almost impossible to
isolate and extract commits on the basis of coverage/features/etc. This is
a lot of really great effort and after having spoken with Mark multiple
times, I really feel we should figure out a way to absorb this but I do
have concerns around replacing the master branch completely.

While I do like the idea that Tomás proposed, I also feel that maintaining
and managing cherry-picking across 9x, master, and ref branch will only
make it difficult for people to work though the duration of 9x.

I haven’t looked at the current ref branch recently, but the folks who have
looked at it, if you think that this code can be merged into master even as
big chunks, that’d be the most confidence building way forward.




On Tue, Oct 6, 2020 at 11:37 AM Ilan Ginzburg <ilans...@gmail.com> wrote:

> Copying below Mark's posts from ASF Slack #solr-next-big-thing channel.
>
> The Solr Reference Branch.
> Document 1, a quick intro.
> You can think of the Solr Reference Branch as a remaster of Solr. It
> is not an attempt to redesign Solr or make it more fancy. The goal of
> the Solr Reference Branch is to be a better incarnation of the current
> Apache Solr, which will provide a base for future development and
> design.
> There are a variety of problems with Solr today that make it difficult
> to adopt and run. This is me being as honest and objective as I can
> be, though no doubt, many will see it as an exaggeration or negative
> focus. I just see it as the way it is and has been, it's just taken me
> a real long time to actually get all the way under the rug to find the
> really hardened nasty cockroaches burrowed in there.
> 1. Resource usage and management is wasteful, inefficient, buggy, and
> haphazard.
> 2. SolrCloud is not long term reliable. Exceptional cases will
> frequently flummox the system, and exceptional cases are supposed to
> be our wheelhouse and primary focus. Leaders will be lost and not
> recover, the Overseer will go away, GC storms will hit, tight loops in
> a bad case will crank up resources, and retries will be abundant and
> overaggressive.
> 3. Our blocking and locking is generally not efficient, especially in key
> paths.
> 4. We get thread safety wrong (too often) in some important spots.
> 5. Distributed updates have to be added locally before they are
> distributed, and then that distribution is generally inefficient,
> prone to blocking and/or timeouts, and hobbled by HTTP1.1 and our need
> for packing updates into a single request to achieve any kind of
> performance, losing proper error handling and eating the many rough
> edges of the ConcurrentUpdateSolrClient.
> 6. Our Zookeeper foundation code is often inefficient, buggy,
> unreliable, and improperly used (we don’t always use async or multi
> where we should, we force updates from zk instead of being notified,
> we don’t handle session expiration as well as we should, our
> algorithms are slow and buggy, we make a multitude more calls than we
> should (especially on cluster startup), etc, etc)
> 7. We have circular dependencies between major classes that can start
> threads in their constructors that start interacting with the other
> classes before construction is complete.
> 8. Our XML handling is abysmally outdated and slow for multiple
> reasons. Our heavy Xpath usage is incredibly wasteful and expensive.
> 9. Our thread management is not understandable, not properly tunable,
> not efficient, sometimes buggy, not always consistent, and difficult
> to understand fundamentally.
> 10. Our Jetty configuration is lacking in a variety of ways,
> especially around shutdown and http2.
> 11. The dynamic schema feature can be very expensive and not fully thread
> safe.
> 12. The Overseer is extremely inefficient, can be extremely slow to
> stop, had a buggy leader election algorithm, doesn’t handle session
> expiration as well as it should, can keep trying to come back from the
> dead, and the list goes on.
> 13. Our connection resuse is often very poor or non existent, when
> it’s improved, it always reverts back to bad or worse.
> 14. HTTP1.1 is not great for our type of application in a variety of
> ways that HTTP2 solves – but we still use a lot of HTTP1.1 and HTTP2
> is not configured well and the client needs some work.
> 15. Lifecycle of important objects is often off, most things can and
> will leak (SolrCores, SolrIndexSearchers, Directory’s, Solr clients),
> some things will close objects more than once or that don’t belong to
> them, or close things in a bad order.
> 16. There is often sleeps and/or pulling that is a magnitude slower
> than proper event driven waits.
> 17. Our tests are actually pretty unstable and making them stable is
> way, way more difficult than most people realize. I’m quite sure I’ve
> spent much, much more time on this than anyone out there, and I can
> tell you, the tests are not stable in a 1,000 shifting ways that have
> and will continue to cause lots of damage.
> 18. We don’t have good async update/search support for scaling and
> better resource usage.
> 19. We often duplicate resources or create new pools instead of sharing.
> 20. We don’t do tons of parallelizable stuff in parallel, when we do
> it’s inconsistent.
> 21. Our Collections API can often not wait correctly for the proper
> state for what it did to be ready before returning. Even if it gets it
> right, a cloud client that made the request won’t necessarily have the
> updated state local when the request returns. Things often still work,
> but with a variety of interesting and slow results possible.
> 22. We don’t often holistically look at what we have built and how it
> fits together and so often there are silly things, bad fits, one off
> bad patterns, lazy attempts at something, etc.
> 24. Close and shutdown are inefficient and slow across a huge swatch
> of our object tree. These issues tend to be growy and breed less
> concern over time.
> 25. There are a variety of ways and places that we can generate an
> absurd amount of unnecessary garbage.
> 26. SolrCore reload is not fully reliable and increasingly important and
> used.
> 27. The leader election code has a variety of ugly little bugs and is
> based on a recursive implementation that will eventually exhaust stack
> space – though it’s likely your cluster will be brought down by
> something else before that is a problem (unless you hit the infinite
> loop, no one can be leader, eat up the stack as fast as possible case
> – which should be hard these days with the leader election throttle).
> 28. The recovery processes, like almost everything you can imagine,
> has a variety of issues and rarer bad cases and affects.
> By and large, everything is inefficient and buggy and full of accepted
> compromise regardless.
> Interestingly, this does not make us an atypical open source Java
> distributed project. But, I’m kind of a software snob, and I would not
> run this thing and so I cannot work on it. What is there to do ...
> The Solr Reference Branch is intended to tackle every one of those
> issues. As well as about 1000+ more of varying and lesser importance.
> As all of that comes together, cool stuff starts to unlock, and you
> begin to see some phenomena that is together much greater than the sum
> of it’s many, many parts.
> 29. Our tests have been getting better and better are stamping out the
> legit noise they create - every scream a breadcrumb towards badness -
> but we have built a scream catching machine - though we will never be
> able to catch them all for a huge variety of deep reasons.
>
> The Solr Reference Branch
> Document 2
> While the extent of the previously mentioned issues was not clear to
> me, that is a deep rabbit hole, I’ve always, as have many others,
> known the current state of things with Solr at a higher, broader
> level.
> So what about this effort is different? Is this not just a bunch of my
> standard JIRA issues all crammed into one? Should we not break them
> out proper and do things sensible?
> Well, previously, as is probably common, I was both a bit lost on
> where we were exactly and certainly on where to find firmer ground for
> real, not just the mirage always just over the hill.
> I love performance and efficiency though. I’ve always avoided it as a
> focus with Solr and SolrCloud, thinking stability has to come first.
> Having given up on stability and scale after a good 8 years or
> something, completely tossed out as a pipe dream, I started work on
> something new, something really just for me. I started plugging in
> HTTP2. And the effort and work needed for that and the learning and
> some of the results, completely opened my eyes. I also attacked very
> different than I have in the past, doing something I like for me, I
> drowned myself in it. Spent 2-3 weeks at a time here and there sitting
> at the computer with intense focus for 16-20 hours a day. The more I
> did, the more I found, the more I understood, the more I discovered.
> I discovered a discovery processes. It was leading me to everything I
> needed to do and I just had to follow the long, ever flowing path,
> keeping my mental models strong, re etching, ruminating, obsessing.
> I realized many test functions we have – most- should be taking on the
> order of milliseconds instead of seconds to dozens of seconds. I
> realized tons and tons of our issues and gremlins lived and prospered
> in our slow and inefficient smog. I realized that if I just spent the
> time to look where slowness and flakiness prevailed, really look, like
> take hours just for some random side road - build a bridge, burn it,
> and build one further down, etc, etc – that making huge improvement
> after huge improvement was actually very low hanging fruit, just
> hidden by some thorns and leaves and lack of any reasonable
> introspection into the system we have created and continue to build.
> Over time, I could see what had to be done and I could see what it
> would achieve. I built different parts at different times, lost them
> and rebuilt them a different way with different focus. I build and
> expanded my introspection and measuring tools and classes.
> That’s a sentence trying to cover a universe, but if you want to
> really boil it down even further, I’d invoke the normally faulty
> broken windows theory. There is magic in perfect windows that only
> those that have them know. Can we get perfect? I like to dream and
> there is no end to the introspection, experimentation, and
> improvements to try. The perfect landing aside though, no doubt we can
> move drastically from where we are.
>
> Another thing I learned is the crazy number of ways you can make all
> the tests pass like champions, and roll into production unusable.
> Which tells me that production users are a large part of our test
> strategy, and that can’t be to make any real change in a satisfactory
> way.
>
> The current goal is to have a mostly usable and testable system by
> mid-late October. Not everything 100%, some known caveats and cleanup
> and plenty to do, but it should be in good shape for a user to try out
> given the caveats outlined
> The biggest risk currently is the absorption of the search side async
> work from master - I'm familiar with that, I've worked on it myself,
> the code involved is derived from an old branch of mine, but async is
> a whole different animal and trying to nail it without any downsides
> to the old synchronous model is a tough nut
> one that I was already battling on the dist update side, so it's good
> stuff to work on and do, but its taking some effort to get in shape
>
> On Tue, Oct 6, 2020 at 8:00 PM Tomás Fernández Löbbe
> <tomasflo...@gmail.com> wrote:
> >
> > > Let's say we cut 9x and now there is a new master taken from the
> reference branch.
> > I never said “make a new master”, I said merge changes in ref branch
> into master. If things are broken into pieces like Ishan is suggesting,
> those changes can be merged into 9.x too. I only suggested this because you
> felt unsure about merging to master now and I guess this is due to fear of
> introducing bugs so close to a potential 9.0 release, is that not right?
> >
> >
> > > We will never be able to reconcile these 2 branches
> > Sorry, but how is that different if we do an alpha release from the
> branch now? What would be the process after that? Let's say people don't
> find issues and we want to merge those changes, what’s the plan then?
> >
> > > Choice 1:
> > I’m fine with choice 1 if that’s what you want, as long as it’s not an
> official release for the reasons stated above.
> >
> >
> > > I promise to do code review & cleanup as much as possible. But I'm
> hesitant to give a stamp of approval to make it THE official release
> > What do you mean? I thought this is what you were suggesting, make an
> official release from the reference_impl branch?
> >
> >
> > I think Ilan’s last email is on spot, and I agree 100% with what he can
> express much better than I can :)
> >
> > > Mark's descriptions in Slack go in the right way but are still too
> high level
> > Can someone share those here? or in Jira?
> >
> > On Tue, Oct 6, 2020 at 5:09 AM Noble Paul <noble.p...@gmail.com> wrote:
> >>
> >> > I think the danger is high to treat this branch as a black box (or an
> "all or nothing").
> >>
> >> True Ilan.  Ideally, I would like a few of us to study the code &
> >> start pulling in changes we are confident of (even to 8x branch, why
> >> not). We cannot burden a single developer to do everything.
> >>
> >> This cannot be a task just for one or 2 devs. We all will have to work
> >> together to decompose the changes and digest them into master. I can
> >> do my bit.
> >>
> >> But, I'm sure we may hit a point where certain changes cannot be
> >> isolated and absorbed. We will have to collectively make a call, how
> >> to absorb them
> >>
> >> On Tue, Oct 6, 2020 at 9:00 PM Ishan Chattopadhyaya
> >> <ichattopadhy...@gmail.com> wrote:
> >> >
> >> >
> >> > I'm willing to help and I believe others will too if the amount of
> work for contributing is reasonable (i.e. not a three months effort).
> >> >
> >> > I looked into the possibility of doing so. To me, it seemed to be
> that it is very hard to do so: possibly 1 year project for me. Problem is
> that it is hard to pull out a particular class of improvements (say thread
> management improvement) and have all tests pass with it (because tests have
> gotten extensive improvements of their own) and also observe the effect of
> the improvement. IIUC, every improvement to Solr seemed to require many
> iterations to get the tests happy. I remember Mark telling me that it may
> not even be possible for him to do something like that (i.e. bring all
> changes into master as tiny pieces).
> >> >
> >> > What I volunteered to do, however, is to decompose roughly all the
> general improvements into smaller, manageable commits. However, making sure
> all tests pass at every commit point is beyond my capability.
> >> >
> >> > On Tue, 6 Oct, 2020, 3:10 pm Ilan Ginzburg, <ilans...@gmail.com>
> wrote:
> >> >>
> >> >> Another option to integrate this work into the main code line would
> be to understand what changes have been made and where (Mark's descriptions
> in Slack go in the right way but are still too high level), and then port
> or even redo them in main, one by one.
> >> >>
> >> >> I think the danger is high to treat this branch as a black box (or
> an "all or nothing"). Using the merging itself to change our understanding
> and increase our knowledge of what was done can greatly reduce the risk.
> >> >>
> >> >> We do develop new features in Solr 9 without beta releasing them, so
> if we port Mark's improvements by small chunks (and maybe in the process
> decide that some should not be ported or not now) I don't see why this
> can't integrate to become like other improvements done to the code. If
> specific changes do require a beta release, do that release from master and
> pick the right moment.
> >> >>
> >> >> I'm willing to help and I believe others will too if the amount of
> work for contributing is reasonable (i.e. not a three months effort). This
> requires documenting the changes done in that branch, pointing to where
> these changes happened and then picking them up one by one and porting them
> more or less independently of each other. We might only port a subset of
> changes by the time 9.0 is released, that's fine we can continue in
> following releases.
> >> >>
> >> >> My 2 cents...
> >> >> Ilan
> >> >>
> >> >> Le mar. 6 oct. 2020 à 09:56, Noble Paul <noble.p...@gmail.com> a
> écrit :
> >> >>>
> >> >>> Yes, A docker image will definitely help. I wasn't trying to
> downplay that
> >> >>>
> >> >>> On Tue, Oct 6, 2020 at 6:55 PM Ishan Chattopadhyaya
> >> >>> <ichattopadhy...@gmail.com> wrote:
> >> >>> >
> >> >>> >
> >> >>> > > Docker is not a big requirement for large scale installations.
> Most of them already have their own install scripts. Availability of docker
> is not important for them. If a user is only encouraged to install Solr
> because of a docker image , most likely they are not running a large enough
> cluster
> >> >>> >
> >> >>> > I disagree, Noble. Having a docker image us going to be useful to
> some clients, with complex usecases. Great point, David!
> >> >>> >
> >> >>> > On Tue, 6 Oct, 2020, 1:09 pm Ishan Chattopadhyaya, <
> ichattopadhy...@gmail.com> wrote:
> >> >>> >>
> >> >>> >> As I said, I'm *personally* not confident in putting such a big
> changeset into master that wasn't vetted in a real user environment widely.
> I have, in the past, done enough bad things to Solr (directly or
> indirectly), and I don't want to repeat the same. Also, I'll be very
> uncomfortable if someone else did so.
> >> >>> >>
> >> >>> >> Having said this, if someone else wants to port the changes over
> to master *without first getting enough real world testing*, feel free to
> do so, and I can focus my efforts elsewhere.
> >> >>> >>
> >> >>> >> On Tue, 6 Oct, 2020, 9:22 am Tomás Fernández Löbbe, <
> tomasflo...@gmail.com> wrote:
> >> >>> >>>
> >> >>> >>> I was thinking (and I haven’t flushed it out completely but
> will throw the idea) that an alternative approach with this timeline could
> be to cut 9x branch around November/December? And then you could merge into
> master, it would have the latest  changes from master plus the ref branch
> changes. From there any nightly build could be use to help test/debug.
> >> >>> >>>
> >> >>> >>> That said I don’t know for sure what are the changes in the
> branch that do not belong in 9. The problem with them being 10x only is
> that backports would potentially be more difficult for all the life of 9.
> >> >>> >>>
> >> >>> >>> On Mon, Oct 5, 2020 at 4:54 PM Noble Paul <noble.p...@gmail.com>
> wrote:
> >> >>> >>>>
> >> >>> >>>> >I don't think it can be said what committers do and don't do
> with regards to running Solr.  All of us would answer this differently and
> at different points in time.
> >> >>> >>>>
> >> >>> >>>> " I have run it in one large cluster, so it is certified to be
> bug free/stable" I don't think it's a reasonable approach. We need as much
> feedback from our users because each of them stress Solr in a different
> way. This is not to suggest that committers are not doing testing or their
> tests are not valid. When I talk to the committers out here they say they
> do not see any performance stability issues at all. But, my client reports
> issues on a day to day basis.
> >> >>> >>>>
> >> >>> >>>>
> >> >>> >>>>
> >> >>> >>>> > Definitely publish a Docker image BTW -- it's the best way
> to try out any software.
> >> >>> >>>>
> >> >>> >>>> Docker is not a big requirement for large scale installations.
> Most of them already have their own install scripts. Availability of docker
> is not important for them. If a user is only encouraged to install Solr
> because of a docker image , most likely they are not running a large enough
> cluster
> >> >>> >>>>
> >> >>> >>>>
> >> >>> >>>>
> >> >>> >>>> On Tue, Oct 6, 2020, 6:30 AM David Smiley <dsmi...@apache.org>
> wrote:
> >> >>> >>>>>
> >> >>> >>>>> Thanks so much for your responses Ishan... I'm getting much
> more information in this thread than my attempts to get questions answered
> on the JIRA issue months ago.  And especially,  thank you for volunteering
> for the difficult porting efforts!
> >> >>> >>>>>
> >> >>> >>>>> Tomas said:
> >> >>> >>>>>>
> >> >>> >>>>>>  I do agree with the previous comments that calling it "Solr
> 10" (even with the "-alpha") would confuse users, maybe use "reference"? or
> maybe something in reference to SOLR-14788?
> >> >>> >>>>>
> >> >>> >>>>>
> >> >>> >>>>> I have the opposite opinion.  This word "reference" is
> baffling to me despite whatever Mark's explanation is.  I like the
> justification Ishan gave for 10-alpha and I don't think I could re-phrase
> his justification any better.  *If* the release was _not_ official (thus
> wouldn't show up in the usual places anyone would look for a release), I
> think it would alleviate that confusion concern even more, although I think
> "alpha" ought to be enough of a signal not to use it without digging deeper
> on what's going on.
> >> >>> >>>>>
> >> >>> >>>>> Alex then Ishan said:
> >> >>> >>>>>>
> >> >>> >>>>>> > Maybe we could release it to
> >> >>> >>>>>> > committers community first and dogfood it "internally"?
> >> >>> >>>>>>
> >> >>> >>>>>> Alex: It is meaningless. Committers don't run large scale
> installations. We barely even have time to take care of running unit tests
> before destabilizing our builds. We are not the right audience. However, we
> all can anyway check out the branch and start playing with it, even without
> a release. There are orgs that don't want to install any code that wasn't
> officially released; this release is geared towards them (to help us test
> this at their scale).
> >> >>> >>>>>
> >> >>> >>>>>
> >> >>> >>>>> I don't think it can be said what committers do and don't do
> with regards to running Solr.  All of us would answer this differently and
> at different points in time.  From time to time, though not at present,
> I've been well positioned to try out a new version of Solr in a stage/test
> environment to see how it goes.  (Putting on my Salesforce metaphorical
> hat...) Even though I'm not able to deploy it in a realistic way today, I'm
> able to run a battery of tests to see if one of the features we depend on
> have changed or is broken.  That's useful feedback to an alpha release!
> And even though I'm saying I'm not well positioned to try out some new Solr
> release in a production-ish setting now, it's something I could make a good
> case for internally since upgrades take a lot of effort where I work.  It's
> in our interest for SolrCloud to be very stable (of course).
> >> >>> >>>>>
> >> >>> >>>>> Regardless, I think what you're driving at Ishan is that you
> want an "official" release -- one that goes through the whole ceremony.
> You believe that people would be more likely to use it.  I think all we
> need to do is announce (similar to a real release) that there is some
> unofficial alpha distribution and that we want to solicit your feedback --
> basically, help us find bugs.  Definitely publish a Docker image BTW --
> it's the best way to try out any software.  I'm -0 on doing an official
> release for alpha software because it's unnecessary to achieve the goals
> and somewhat confusing.  I think the Solr 4 alpha/beta situation was
> different -- it was not some fork a committer was maintaining; it was the
> master branch of its time, and it was destined to be the very next release,
> not some possible future release.
> >> >>> >>>>>
> >> >>> >>>>> ~ David Smiley
> >> >>> >>>>> Apache Lucene/Solr Search Developer
> >> >>> >>>>> http://www.linkedin.com/in/davidwsmiley
> >> >>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> -----------------------------------------------------
> >> >>> Noble Paul
> >> >>>
> >> >>>
> ---------------------------------------------------------------------
> >> >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> >>> For additional commands, e-mail: dev-h...@lucene.apache.org
> >> >>>
> >>
> >>
> >> --
> >> -----------------------------------------------------
> >> Noble Paul
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-- 
Anshum Gupta

Re: Solr Alpha (EA) release of Reference Branch

Reply via email to