Re: Solr Alpha (EA) release of Reference Branch

Ilan Ginzburg Tue, 06 Oct 2020 11:38:12 -0700

Copying below Mark's posts from ASF Slack #solr-next-big-thing channel.

The Solr Reference Branch.
Document 1, a quick intro.
You can think of the Solr Reference Branch as a remaster of Solr. It
is not an attempt to redesign Solr or make it more fancy. The goal of
the Solr Reference Branch is to be a better incarnation of the current
Apache Solr, which will provide a base for future development and
design.
There are a variety of problems with Solr today that make it difficult
to adopt and run. This is me being as honest and objective as I can
be, though no doubt, many will see it as an exaggeration or negative
focus. I just see it as the way it is and has been, it's just taken me
a real long time to actually get all the way under the rug to find the
really hardened nasty cockroaches burrowed in there.
1. Resource usage and management is wasteful, inefficient, buggy, and haphazard.
2. SolrCloud is not long term reliable. Exceptional cases will
frequently flummox the system, and exceptional cases are supposed to
be our wheelhouse and primary focus. Leaders will be lost and not
recover, the Overseer will go away, GC storms will hit, tight loops in
a bad case will crank up resources, and retries will be abundant and
overaggressive.
3. Our blocking and locking is generally not efficient, especially in key paths.
4. We get thread safety wrong (too often) in some important spots.
5. Distributed updates have to be added locally before they are
distributed, and then that distribution is generally inefficient,
prone to blocking and/or timeouts, and hobbled by HTTP1.1 and our need
for packing updates into a single request to achieve any kind of
performance, losing proper error handling and eating the many rough
edges of the ConcurrentUpdateSolrClient.
6. Our Zookeeper foundation code is often inefficient, buggy,
unreliable, and improperly used (we don’t always use async or multi
where we should, we force updates from zk instead of being notified,
we don’t handle session expiration as well as we should, our
algorithms are slow and buggy, we make a multitude more calls than we
should (especially on cluster startup), etc, etc)
7. We have circular dependencies between major classes that can start
threads in their constructors that start interacting with the other
classes before construction is complete.
8. Our XML handling is abysmally outdated and slow for multiple
reasons. Our heavy Xpath usage is incredibly wasteful and expensive.
9. Our thread management is not understandable, not properly tunable,
not efficient, sometimes buggy, not always consistent, and difficult
to understand fundamentally.
10. Our Jetty configuration is lacking in a variety of ways,
especially around shutdown and http2.
11. The dynamic schema feature can be very expensive and not fully thread safe.
12. The Overseer is extremely inefficient, can be extremely slow to
stop, had a buggy leader election algorithm, doesn’t handle session
expiration as well as it should, can keep trying to come back from the
dead, and the list goes on.
13. Our connection resuse is often very poor or non existent, when
it’s improved, it always reverts back to bad or worse.
14. HTTP1.1 is not great for our type of application in a variety of
ways that HTTP2 solves – but we still use a lot of HTTP1.1 and HTTP2
is not configured well and the client needs some work.
15. Lifecycle of important objects is often off, most things can and
will leak (SolrCores, SolrIndexSearchers, Directory’s, Solr clients),
some things will close objects more than once or that don’t belong to
them, or close things in a bad order.
16. There is often sleeps and/or pulling that is a magnitude slower
than proper event driven waits.
17. Our tests are actually pretty unstable and making them stable is
way, way more difficult than most people realize. I’m quite sure I’ve
spent much, much more time on this than anyone out there, and I can
tell you, the tests are not stable in a 1,000 shifting ways that have
and will continue to cause lots of damage.
18. We don’t have good async update/search support for scaling and
better resource usage.
19. We often duplicate resources or create new pools instead of sharing.
20. We don’t do tons of parallelizable stuff in parallel, when we do
it’s inconsistent.
21. Our Collections API can often not wait correctly for the proper
state for what it did to be ready before returning. Even if it gets it
right, a cloud client that made the request won’t necessarily have the
updated state local when the request returns. Things often still work,
but with a variety of interesting and slow results possible.
22. We don’t often holistically look at what we have built and how it
fits together and so often there are silly things, bad fits, one off
bad patterns, lazy attempts at something, etc.
24. Close and shutdown are inefficient and slow across a huge swatch
of our object tree. These issues tend to be growy and breed less
concern over time.
25. There are a variety of ways and places that we can generate an
absurd amount of unnecessary garbage.
26. SolrCore reload is not fully reliable and increasingly important and used.
27. The leader election code has a variety of ugly little bugs and is
based on a recursive implementation that will eventually exhaust stack
space – though it’s likely your cluster will be brought down by
something else before that is a problem (unless you hit the infinite
loop, no one can be leader, eat up the stack as fast as possible case
– which should be hard these days with the leader election throttle).
28. The recovery processes, like almost everything you can imagine,
has a variety of issues and rarer bad cases and affects.
By and large, everything is inefficient and buggy and full of accepted
compromise regardless.
Interestingly, this does not make us an atypical open source Java
distributed project. But, I’m kind of a software snob, and I would not
run this thing and so I cannot work on it. What is there to do ...
The Solr Reference Branch is intended to tackle every one of those
issues. As well as about 1000+ more of varying and lesser importance.
As all of that comes together, cool stuff starts to unlock, and you
begin to see some phenomena that is together much greater than the sum
of it’s many, many parts.
29. Our tests have been getting better and better are stamping out the
legit noise they create - every scream a breadcrumb towards badness -
but we have built a scream catching machine - though we will never be
able to catch them all for a huge variety of deep reasons.


The Solr Reference Branch
Document 2
While the extent of the previously mentioned issues was not clear to
me, that is a deep rabbit hole, I’ve always, as have many others,
known the current state of things with Solr at a higher, broader
level.
So what about this effort is different? Is this not just a bunch of my
standard JIRA issues all crammed into one? Should we not break them
out proper and do things sensible?
Well, previously, as is probably common, I was both a bit lost on
where we were exactly and certainly on where to find firmer ground for
real, not just the mirage always just over the hill.
I love performance and efficiency though. I’ve always avoided it as a
focus with Solr and SolrCloud, thinking stability has to come first.
Having given up on stability and scale after a good 8 years or
something, completely tossed out as a pipe dream, I started work on
something new, something really just for me. I started plugging in
HTTP2. And the effort and work needed for that and the learning and
some of the results, completely opened my eyes. I also attacked very
different than I have in the past, doing something I like for me, I
drowned myself in it. Spent 2-3 weeks at a time here and there sitting
at the computer with intense focus for 16-20 hours a day. The more I
did, the more I found, the more I understood, the more I discovered.
I discovered a discovery processes. It was leading me to everything I
needed to do and I just had to follow the long, ever flowing path,
keeping my mental models strong, re etching, ruminating, obsessing.
I realized many test functions we have – most- should be taking on the
order of milliseconds instead of seconds to dozens of seconds. I
realized tons and tons of our issues and gremlins lived and prospered
in our slow and inefficient smog. I realized that if I just spent the
time to look where slowness and flakiness prevailed, really look, like
take hours just for some random side road - build a bridge, burn it,
and build one further down, etc, etc – that making huge improvement
after huge improvement was actually very low hanging fruit, just
hidden by some thorns and leaves and lack of any reasonable
introspection into the system we have created and continue to build.
Over time, I could see what had to be done and I could see what it
would achieve. I built different parts at different times, lost them
and rebuilt them a different way with different focus. I build and
expanded my introspection and measuring tools and classes.
That’s a sentence trying to cover a universe, but if you want to
really boil it down even further, I’d invoke the normally faulty
broken windows theory. There is magic in perfect windows that only
those that have them know. Can we get perfect? I like to dream and
there is no end to the introspection, experimentation, and
improvements to try. The perfect landing aside though, no doubt we can
move drastically from where we are.

Another thing I learned is the crazy number of ways you can make all
the tests pass like champions, and roll into production unusable.
Which tells me that production users are a large part of our test
strategy, and that can’t be to make any real change in a satisfactory
way.

The current goal is to have a mostly usable and testable system by
mid-late October. Not everything 100%, some known caveats and cleanup
and plenty to do, but it should be in good shape for a user to try out
given the caveats outlined
The biggest risk currently is the absorption of the search side async
work from master - I'm familiar with that, I've worked on it myself,
the code involved is derived from an old branch of mine, but async is
a whole different animal and trying to nail it without any downsides
to the old synchronous model is a tough nut
one that I was already battling on the dist update side, so it's good
stuff to work on and do, but its taking some effort to get in shape

On Tue, Oct 6, 2020 at 8:00 PM Tomás Fernández Löbbe
<[email protected]> wrote:
>
> > Let's say we cut 9x and now there is a new master taken from the reference 
> > branch.
> I never said “make a new master”, I said merge changes in ref branch into 
> master. If things are broken into pieces like Ishan is suggesting, those 
> changes can be merged into 9.x too. I only suggested this because you felt 
> unsure about merging to master now and I guess this is due to fear of 
> introducing bugs so close to a potential 9.0 release, is that not right?
>
>
> > We will never be able to reconcile these 2 branches
> Sorry, but how is that different if we do an alpha release from the branch 
> now? What would be the process after that? Let's say people don't find issues 
> and we want to merge those changes, what’s the plan then?
>
> > Choice 1:
> I’m fine with choice 1 if that’s what you want, as long as it’s not an 
> official release for the reasons stated above.
>
>
> > I promise to do code review & cleanup as much as possible. But I'm hesitant 
> > to give a stamp of approval to make it THE official release
> What do you mean? I thought this is what you were suggesting, make an 
> official release from the reference_impl branch?
>
>
> I think Ilan’s last email is on spot, and I agree 100% with what he can 
> express much better than I can :)
>
> > Mark's descriptions in Slack go in the right way but are still too high 
> > level
> Can someone share those here? or in Jira?
>
> On Tue, Oct 6, 2020 at 5:09 AM Noble Paul <[email protected]> wrote:
>>
>> > I think the danger is high to treat this branch as a black box (or an "all 
>> > or nothing").
>>
>> True Ilan.  Ideally, I would like a few of us to study the code &
>> start pulling in changes we are confident of (even to 8x branch, why
>> not). We cannot burden a single developer to do everything.
>>
>> This cannot be a task just for one or 2 devs. We all will have to work
>> together to decompose the changes and digest them into master. I can
>> do my bit.
>>
>> But, I'm sure we may hit a point where certain changes cannot be
>> isolated and absorbed. We will have to collectively make a call, how
>> to absorb them
>>
>> On Tue, Oct 6, 2020 at 9:00 PM Ishan Chattopadhyaya
>> <[email protected]> wrote:
>> >
>> >
>> > I'm willing to help and I believe others will too if the amount of work 
>> > for contributing is reasonable (i.e. not a three months effort).
>> >
>> > I looked into the possibility of doing so. To me, it seemed to be that it 
>> > is very hard to do so: possibly 1 year project for me. Problem is that it 
>> > is hard to pull out a particular class of improvements (say thread 
>> > management improvement) and have all tests pass with it (because tests 
>> > have gotten extensive improvements of their own) and also observe the 
>> > effect of the improvement. IIUC, every improvement to Solr seemed to 
>> > require many iterations to get the tests happy. I remember Mark telling me 
>> > that it may not even be possible for him to do something like that (i.e. 
>> > bring all changes into master as tiny pieces).
>> >
>> > What I volunteered to do, however, is to decompose roughly all the general 
>> > improvements into smaller, manageable commits. However, making sure all 
>> > tests pass at every commit point is beyond my capability.
>> >
>> > On Tue, 6 Oct, 2020, 3:10 pm Ilan Ginzburg, <[email protected]> wrote:
>> >>
>> >> Another option to integrate this work into the main code line would be to 
>> >> understand what changes have been made and where (Mark's descriptions in 
>> >> Slack go in the right way but are still too high level), and then port or 
>> >> even redo them in main, one by one.
>> >>
>> >> I think the danger is high to treat this branch as a black box (or an 
>> >> "all or nothing"). Using the merging itself to change our understanding 
>> >> and increase our knowledge of what was done can greatly reduce the risk.
>> >>
>> >> We do develop new features in Solr 9 without beta releasing them, so if 
>> >> we port Mark's improvements by small chunks (and maybe in the process 
>> >> decide that some should not be ported or not now) I don't see why this 
>> >> can't integrate to become like other improvements done to the code. If 
>> >> specific changes do require a beta release, do that release from master 
>> >> and pick the right moment.
>> >>
>> >> I'm willing to help and I believe others will too if the amount of work 
>> >> for contributing is reasonable (i.e. not a three months effort). This 
>> >> requires documenting the changes done in that branch, pointing to where 
>> >> these changes happened and then picking them up one by one and porting 
>> >> them more or less independently of each other. We might only port a 
>> >> subset of changes by the time 9.0 is released, that's fine we can 
>> >> continue in following releases.
>> >>
>> >> My 2 cents...
>> >> Ilan
>> >>
>> >> Le mar. 6 oct. 2020 à 09:56, Noble Paul <[email protected]> a écrit :
>> >>>
>> >>> Yes, A docker image will definitely help. I wasn't trying to downplay 
>> >>> that
>> >>>
>> >>> On Tue, Oct 6, 2020 at 6:55 PM Ishan Chattopadhyaya
>> >>> <[email protected]> wrote:
>> >>> >
>> >>> >
>> >>> > > Docker is not a big requirement for large scale installations. Most 
>> >>> > > of them already have their own install scripts. Availability of 
>> >>> > > docker is not important for them. If a user is only encouraged to 
>> >>> > > install Solr because of a docker image , most likely they are not 
>> >>> > > running a large enough cluster
>> >>> >
>> >>> > I disagree, Noble. Having a docker image us going to be useful to some 
>> >>> > clients, with complex usecases. Great point, David!
>> >>> >
>> >>> > On Tue, 6 Oct, 2020, 1:09 pm Ishan Chattopadhyaya, 
>> >>> > <[email protected]> wrote:
>> >>> >>
>> >>> >> As I said, I'm *personally* not confident in putting such a big 
>> >>> >> changeset into master that wasn't vetted in a real user environment 
>> >>> >> widely. I have, in the past, done enough bad things to Solr (directly 
>> >>> >> or indirectly), and I don't want to repeat the same. Also, I'll be 
>> >>> >> very uncomfortable if someone else did so.
>> >>> >>
>> >>> >> Having said this, if someone else wants to port the changes over to 
>> >>> >> master *without first getting enough real world testing*, feel free 
>> >>> >> to do so, and I can focus my efforts elsewhere.
>> >>> >>
>> >>> >> On Tue, 6 Oct, 2020, 9:22 am Tomás Fernández Löbbe, 
>> >>> >> <[email protected]> wrote:
>> >>> >>>
>> >>> >>> I was thinking (and I haven’t flushed it out completely but will 
>> >>> >>> throw the idea) that an alternative approach with this timeline 
>> >>> >>> could be to cut 9x branch around November/December? And then you 
>> >>> >>> could merge into master, it would have the latest  changes from 
>> >>> >>> master plus the ref branch changes. From there any nightly build 
>> >>> >>> could be use to help test/debug.
>> >>> >>>
>> >>> >>> That said I don’t know for sure what are the changes in the branch 
>> >>> >>> that do not belong in 9. The problem with them being 10x only is 
>> >>> >>> that backports would potentially be more difficult for all the life 
>> >>> >>> of 9.
>> >>> >>>
>> >>> >>> On Mon, Oct 5, 2020 at 4:54 PM Noble Paul <[email protected]> 
>> >>> >>> wrote:
>> >>> >>>>
>> >>> >>>> >I don't think it can be said what committers do and don't do with 
>> >>> >>>> >regards to running Solr.  All of us would answer this differently 
>> >>> >>>> >and at different points in time.
>> >>> >>>>
>> >>> >>>> " I have run it in one large cluster, so it is certified to be bug 
>> >>> >>>> free/stable" I don't think it's a reasonable approach. We need as 
>> >>> >>>> much feedback from our users because each of them stress Solr in a 
>> >>> >>>> different way. This is not to suggest that committers are not doing 
>> >>> >>>> testing or their tests are not valid. When I talk to the committers 
>> >>> >>>> out here they say they do not see any performance stability issues 
>> >>> >>>> at all. But, my client reports issues on a day to day basis.
>> >>> >>>>
>> >>> >>>>
>> >>> >>>>
>> >>> >>>> > Definitely publish a Docker image BTW -- it's the best way to try 
>> >>> >>>> > out any software.
>> >>> >>>>
>> >>> >>>> Docker is not a big requirement for large scale installations. Most 
>> >>> >>>> of them already have their own install scripts. Availability of 
>> >>> >>>> docker is not important for them. If a user is only encouraged to 
>> >>> >>>> install Solr because of a docker image , most likely they are not 
>> >>> >>>> running a large enough cluster
>> >>> >>>>
>> >>> >>>>
>> >>> >>>>
>> >>> >>>> On Tue, Oct 6, 2020, 6:30 AM David Smiley <[email protected]> 
>> >>> >>>> wrote:
>> >>> >>>>>
>> >>> >>>>> Thanks so much for your responses Ishan... I'm getting much more 
>> >>> >>>>> information in this thread than my attempts to get questions 
>> >>> >>>>> answered on the JIRA issue months ago.  And especially,  thank you 
>> >>> >>>>> for volunteering for the difficult porting efforts!
>> >>> >>>>>
>> >>> >>>>> Tomas said:
>> >>> >>>>>>
>> >>> >>>>>>  I do agree with the previous comments that calling it "Solr 10" 
>> >>> >>>>>> (even with the "-alpha") would confuse users, maybe use 
>> >>> >>>>>> "reference"? or maybe something in reference to SOLR-14788?
>> >>> >>>>>
>> >>> >>>>>
>> >>> >>>>> I have the opposite opinion.  This word "reference" is baffling to 
>> >>> >>>>> me despite whatever Mark's explanation is.  I like the 
>> >>> >>>>> justification Ishan gave for 10-alpha and I don't think I could 
>> >>> >>>>> re-phrase his justification any better.  *If* the release was 
>> >>> >>>>> _not_ official (thus wouldn't show up in the usual places anyone 
>> >>> >>>>> would look for a release), I think it would alleviate that 
>> >>> >>>>> confusion concern even more, although I think "alpha" ought to be 
>> >>> >>>>> enough of a signal not to use it without digging deeper on what's 
>> >>> >>>>> going on.
>> >>> >>>>>
>> >>> >>>>> Alex then Ishan said:
>> >>> >>>>>>
>> >>> >>>>>> > Maybe we could release it to
>> >>> >>>>>> > committers community first and dogfood it "internally"?
>> >>> >>>>>>
>> >>> >>>>>> Alex: It is meaningless. Committers don't run large scale 
>> >>> >>>>>> installations. We barely even have time to take care of running 
>> >>> >>>>>> unit tests before destabilizing our builds. We are not the right 
>> >>> >>>>>> audience. However, we all can anyway check out the branch and 
>> >>> >>>>>> start playing with it, even without a release. There are orgs 
>> >>> >>>>>> that don't want to install any code that wasn't officially 
>> >>> >>>>>> released; this release is geared towards them (to help us test 
>> >>> >>>>>> this at their scale).
>> >>> >>>>>
>> >>> >>>>>
>> >>> >>>>> I don't think it can be said what committers do and don't do with 
>> >>> >>>>> regards to running Solr.  All of us would answer this differently 
>> >>> >>>>> and at different points in time.  From time to time, though not at 
>> >>> >>>>> present, I've been well positioned to try out a new version of 
>> >>> >>>>> Solr in a stage/test environment to see how it goes.  (Putting on 
>> >>> >>>>> my Salesforce metaphorical hat...) Even though I'm not able to 
>> >>> >>>>> deploy it in a realistic way today, I'm able to run a battery of 
>> >>> >>>>> tests to see if one of the features we depend on have changed or 
>> >>> >>>>> is broken.  That's useful feedback to an alpha release!  And even 
>> >>> >>>>> though I'm saying I'm not well positioned to try out some new Solr 
>> >>> >>>>> release in a production-ish setting now, it's something I could 
>> >>> >>>>> make a good case for internally since upgrades take a lot of 
>> >>> >>>>> effort where I work.  It's in our interest for SolrCloud to be 
>> >>> >>>>> very stable (of course).
>> >>> >>>>>
>> >>> >>>>> Regardless, I think what you're driving at Ishan is that you want 
>> >>> >>>>> an "official" release -- one that goes through the whole ceremony. 
>> >>> >>>>>  You believe that people would be more likely to use it.  I think 
>> >>> >>>>> all we need to do is announce (similar to a real release) that 
>> >>> >>>>> there is some unofficial alpha distribution and that we want to 
>> >>> >>>>> solicit your feedback -- basically, help us find bugs.  Definitely 
>> >>> >>>>> publish a Docker image BTW -- it's the best way to try out any 
>> >>> >>>>> software.  I'm -0 on doing an official release for alpha software 
>> >>> >>>>> because it's unnecessary to achieve the goals and somewhat 
>> >>> >>>>> confusing.  I think the Solr 4 alpha/beta situation was different 
>> >>> >>>>> -- it was not some fork a committer was maintaining; it was the 
>> >>> >>>>> master branch of its time, and it was destined to be the very next 
>> >>> >>>>> release, not some possible future release.
>> >>> >>>>>
>> >>> >>>>> ~ David Smiley
>> >>> >>>>> Apache Lucene/Solr Search Developer
>> >>> >>>>> http://www.linkedin.com/in/davidwsmiley
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> -----------------------------------------------------
>> >>> Noble Paul
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: [email protected]
>> >>> For additional commands, e-mail: [email protected]
>> >>>
>>
>>
>> --
>> -----------------------------------------------------
>> Noble Paul
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Solr Alpha (EA) release of Reference Branch

Reply via email to