Marc, We also ran into this problem on updating to Lucene 9.5. We found it sufficient in our use case to just bump up LRU cache in the constructor to a high enough value to not pose a performance problem. The default value of 4k was way too low for our use case with millions of unique facet values. We also tested with 64k and it was not enough for our use case. We have had no issues with a large value.
taxoWriter = new DirectoryTaxonomyWriter(directory, OpenMode.CREATE_OR_APPEND, new LruTaxonomyWriterCache(32 * 1024 * 1024)); Full context here: https://github.com/zuliaio/zuliasearch/blob/4.1.1/zulia-server/src/main/java/io/zulia/server/index/ShardWriteManager.java#L89 Matt On Tue, Apr 23, 2024 at 3:56 AM Dawid Weiss <dawid.we...@gmail.com> wrote: > Thanks for the follow-up, Marc. I'm not familiar with this part of the code > but reading through the original issue that changed this, the rationale > was to avoid a memleak from a thread local. The LRU cache has > synchronized blocks sprinkled all over it - again, I haven't checked but it > seems the overhead > will be there (regardless of cache size?). > > This said, it seems like you can use your own cache implementation, similar > to what you can see in tests - TestDirectoryTaxonomyWriter.java > or TestConcurrentFacetedIndexing.java. This would > allow you to plug in the previous implementation (or something even more > fine-tuned to your needs)? > > Dawid > > > > On Mon, Apr 22, 2024 at 10:29 PM Marc Davenport > <madavenp...@cargurus.com.invalid> wrote: > > > Hello, > > I've done bisect between 9.4.2 and 9.5 and found the PR affecting my > > particular set up : https://github.com/apache/lucene/pull/12093 > > This is the switch from UTF8TaxonomyWriterCache to an > > LruTaxonomyWriterCache. I don't see a way to control the size of this > > cache to never expel items and match the previous behavior. > > Marc > > > > > > On Fri, Apr 19, 2024 at 4:39 PM Marc Davenport <madavenp...@cargurus.com > > > > wrote: > > > > > Hello, > > > Thanks for the leads. I haven't yet gone as far as doing a git bisect, > > but > > > I have found that the big jump in time is in the call to > > > facetsConfig.build(taxonomyWriter, doc); I made a quick and dirty > > > instrumented version of the FacetsConfig class and found that calls to > > > TaxonomyWriter.add(FacetLabel) are significantly slower for me. > > > > > > > > > > > > https://github.com/apache/lucene/blob/releases/lucene/9.5.0/lucene/facet/src/java/org/apache/lucene/facet/FacetsConfig.java#L383 > > > > > > I don't know what is special about my documents that I would be seeing > > > this change. I'm going to start dropping groups of our facets from the > > > documents and seeing if there is some threshold that I'm hitting. I'll > > > probably start with our hierarchies which are not particularly large, > but > > > are the most suspect. > > > > > > Thanks for any input, > > > Marc > > > > > > 9.4.2 > > > Time(ms) per Document > > > facetConfig.build : 0.9882365 > > > Taxo Add : 0.8334876 > > > > > > 9.5 > > > facetConfig.build : 11.037549 > > > Taxo Add : 10.915726 > > > > > > On Fri, Apr 19, 2024 at 2:56 AM Dawid Weiss <dawid.we...@gmail.com> > > wrote: > > > > > >> Hi Marc, > > >> > > >> You could try git bisect lucene repository to pinpoint the commit that > > >> caused what you're observing. It'll take some time to build but it's a > > >> logarithmic bisection and you'd know for sure where the problem is. > > >> > > >> D. > > >> > > >> On Thu, Apr 18, 2024 at 11:16 PM Marc Davenport > > >> <madavenp...@cargurus.com.invalid> wrote: > > >> > > >> > Hi Adrien et al, > > >> > I've been doing some investigation today and it looks like whatever > > the > > >> > change is, it happens between 9.4.2 and 9.5.0. > > >> > I made a smaller test set up for our code that mocks our documents > and > > >> just > > >> > runs through the indexing portion of our code sending in batches of > 4k > > >> > documents at a time. This way I can run it locally. > > >> > 9.4.2: ~1200-2000 documents per second > > >> > 9.5.0: ~150-400 documents per second > > >> > > > >> > I'll continue investigating, but nothing in the release notes jumped > > >> out to > > >> > me. > > >> > https://lucene.apache.org/core/9_10_0/changes/Changes.html#v9.5.0 > > >> > > > >> > Sorry I don't have anything more rigorous yet. I'm doing this > > >> > investigation in parallel with some other things. > > >> > But any insight or suggestions on areas to look would be > appreciated. > > >> > Thank you, > > >> > Marc > > >> > > > >> > On Wed, Apr 17, 2024 at 4:18 PM Adrien Grand <jpou...@gmail.com> > > wrote: > > >> > > > >> > > Hi Marc, > > >> > > > > >> > > Nothing jumps to mind as a potential cause for this 2x regression. > > It > > >> > would > > >> > > be interesting to look at a profile. > > >> > > > > >> > > On Wed, Apr 17, 2024 at 9:32 PM Marc Davenport > > >> > > <madavenp...@cargurus.com.invalid> wrote: > > >> > > > > >> > > > Hello, > > >> > > > I'm finally migrating Lucene from 8.11.2 to 9.10.0 as our > overall > > >> build > > >> > > can > > >> > > > now support Java 11. The quick first step of renaming packages > and > > >> > > > importing the new libraries has gone well. I'm even seeing a > nice > > >> > > > performance bump in our average query time. I am however seeing > a > > >> > > dramatic > > >> > > > increase in our indexing time. We are indexing ~3.1 million > > >> documents > > >> > > each > > >> > > > with about 100 attributes used for facet filter, and sorting; no > > >> > lexical > > >> > > > text search. Our indexing time has jumped from ~1k seconds to > ~2k > > >> > > > seconds. I have yet to profile the individual aspects of how we > > >> > convert > > >> > > > our data to records vs time for the index writer to accept the > > >> > documents. > > >> > > > I'm curious if other users discovered this for their migrations > at > > >> some > > >> > > > point. Or if there are some changes to defaults that I did not > > see > > >> in > > >> > > the > > >> > > > migration guide that would account for this? Looking at the > logs > > I > > >> can > > >> > > see > > >> > > > that as we are indexing the documents we commit every 10 > minutes. > > >> > > > Thank you, > > >> > > > Marc > > >> > > > > > >> > > > > >> > > > > >> > > -- > > >> > > Adrien > > >> > > > > >> > > > >> > > > > > >