Re: Indexing time increase moving from Lucene 8 to 9

Dawid Weiss Tue, 23 Apr 2024 00:55:46 -0700

Thanks for the follow-up, Marc. I'm not familiar with this part of the code
but reading through the original issue that changed this, the rationale
was to avoid a memleak from a thread local. The LRU cache has
synchronized blocks sprinkled all over it - again, I haven't checked but it
seems the overhead
will be there (regardless of cache size?).


This said, it seems like you can use your own cache implementation, similar
to what you can see in tests - TestDirectoryTaxonomyWriter.java
or TestConcurrentFacetedIndexing.java. This would
allow you to plug in the previous implementation (or something even more
fine-tuned to your needs)?

Dawid



On Mon, Apr 22, 2024 at 10:29 PM Marc Davenport
<madavenp...@cargurus.com.invalid> wrote:

> Hello,
> I've done bisect between 9.4.2 and 9.5 and found the PR affecting my
> particular set up : https://github.com/apache/lucene/pull/12093
> This is the switch from UTF8TaxonomyWriterCache to an
> LruTaxonomyWriterCache.   I don't see a way to control the size of this
> cache to never expel items and match the previous behavior.
> Marc
>
>
> On Fri, Apr 19, 2024 at 4:39 PM Marc Davenport <madavenp...@cargurus.com>
> wrote:
>
> > Hello,
> > Thanks for the leads. I haven't yet gone as far as doing a git bisect,
> but
> > I have found that the big jump in time is in the call to
> > facetsConfig.build(taxonomyWriter, doc);  I made a quick and dirty
> > instrumented version of the FacetsConfig class and found that calls to
> > TaxonomyWriter.add(FacetLabel) are significantly slower for me.
> >
> >
> >
> https://github.com/apache/lucene/blob/releases/lucene/9.5.0/lucene/facet/src/java/org/apache/lucene/facet/FacetsConfig.java#L383
> >
> > I don't know what is special about my documents that I would be seeing
> > this change.  I'm going to start dropping groups of our facets from the
> > documents and seeing if there is some threshold that I'm hitting.  I'll
> > probably start with our hierarchies which are not particularly large, but
> > are the most suspect.
> >
> > Thanks for any input,
> > Marc
> >
> > 9.4.2
> > Time(ms) per Document
> > facetConfig.build   : 0.9882365
> > Taxo Add            : 0.8334876
> >
> > 9.5
> > facetConfig.build   : 11.037549
> > Taxo Add            : 10.915726
> >
> > On Fri, Apr 19, 2024 at 2:56 AM Dawid Weiss <dawid.we...@gmail.com>
> wrote:
> >
> >> Hi Marc,
> >>
> >> You could try git bisect lucene repository to pinpoint the commit that
> >> caused what you're observing. It'll take some time to build but it's a
> >> logarithmic bisection and you'd know for sure where the problem is.
> >>
> >> D.
> >>
> >> On Thu, Apr 18, 2024 at 11:16 PM Marc Davenport
> >> <madavenp...@cargurus.com.invalid> wrote:
> >>
> >> > Hi Adrien et al,
> >> > I've been doing some investigation today and it looks like whatever
> the
> >> > change is, it happens between 9.4.2 and 9.5.0.
> >> > I made a smaller test set up for our code that mocks our documents and
> >> just
> >> > runs through the indexing portion of our code sending in batches of 4k
> >> > documents at a time. This way I can run it locally.
> >> > 9.4.2: ~1200-2000 documents per second
> >> > 9.5.0: ~150-400 documents per second
> >> >
> >> > I'll continue investigating, but nothing in the release notes jumped
> >> out to
> >> > me.
> >> > https://lucene.apache.org/core/9_10_0/changes/Changes.html#v9.5.0
> >> >
> >> > Sorry I don't have anything more rigorous yet.  I'm doing this
> >> > investigation in parallel with some other things.
> >> > But any insight or suggestions on areas to look would be appreciated.
> >> > Thank you,
> >> > Marc
> >> >
> >> > On Wed, Apr 17, 2024 at 4:18 PM Adrien Grand <jpou...@gmail.com>
> wrote:
> >> >
> >> > > Hi Marc,
> >> > >
> >> > > Nothing jumps to mind as a potential cause for this 2x regression.
> It
> >> > would
> >> > > be interesting to look at a profile.
> >> > >
> >> > > On Wed, Apr 17, 2024 at 9:32 PM Marc Davenport
> >> > > <madavenp...@cargurus.com.invalid> wrote:
> >> > >
> >> > > > Hello,
> >> > > > I'm finally migrating Lucene from 8.11.2 to 9.10.0 as our overall
> >> build
> >> > > can
> >> > > > now support Java 11. The quick first step of renaming packages and
> >> > > > importing the new libraries has gone well.  I'm even seeing a nice
> >> > > > performance bump in our average query time. I am however seeing a
> >> > > dramatic
> >> > > > increase in our indexing time.  We are indexing ~3.1 million
> >> documents
> >> > > each
> >> > > > with about 100 attributes used for facet filter, and sorting; no
> >> > lexical
> >> > > > text search.  Our indexing time has jumped from ~1k seconds to ~2k
> >> > > > seconds.  I have yet to profile the individual aspects of how we
> >> > convert
> >> > > > our data to records vs time for the index writer to accept the
> >> > documents.
> >> > > > I'm curious if other users discovered this for their migrations at
> >> some
> >> > > > point.  Or if there are some changes to defaults that I did not
> see
> >> in
> >> > > the
> >> > > > migration guide that would account for this?  Looking at the logs
> I
> >> can
> >> > > see
> >> > > > that as we are indexing the documents we commit every 10 minutes.
> >> > > > Thank you,
> >> > > > Marc
> >> > > >
> >> > >
> >> > >
> >> > > --
> >> > > Adrien
> >> > >
> >> >
> >>
> >
>

Re: Indexing time increase moving from Lucene 8 to 9

Reply via email to