On Tue, May 2, 2023 at 12:49 PM Michael Froh <msf...@gmail.com> wrote: > > Hi all, > > I was looking into a customer issue where they noticed some increased GC time > after upgrading from Lucene 7.x to 9.x. After taking some heap dumps from > both systems, the big difference was tracked down to the float[256] allocated > (as a norms cache) when creating a BM25Scorer (in BM25Similarity.scorer()). > > The change seems to have come in with > https://github.com/apache/lucene/commit/8fd7ead940f69a892dfc951a1aa042e8cae806c1, > which removed some of the special-case logic around the "non-scoring > similarity" embedded in IndexSearcher (returned in the false case from the > old IndexSearcher#scorer(boolean needsScores)). > > While I really like that we no longer have that special-case logic in > IndexSearcher, we now have the issue that every time we create a new > TermWeight (or other Weight) it allocates a float[256], even if the > TermWeight doesn't need scores. Also, I think it's the exact same float[256] > for all non-scoring weights, since it's being computed using the same "all > 1s" CollectionStatistics and TermStatistics. > > (For the record, yes, the queries in question have an obscene number of > TermQueries, so 1024 bytes times lots of TermWeights, times multiple queries > running concurrently makes lots of heap allocation.) > > I'd like to submit a patch to fix this, but I'm wondering what approach to > take. One option I'm considering is precomputing a singleton float[256] for > the non-scoring case (where CollectionStatistics and TermStatistics are all > 1s). That would have the least functional impact, but would let all > non-scoring clauses share the same array. Is there a better way to tackle > this? >
This seems ok if it isn't invasive. I still feel like something is "off" if you are seeing GC time from 1KB-per-segment allocation. Do you have way too many segments? Originally (for various similar reasons) there was a place in the API to do this, so it would only happen per-Weight instead of per-Scorer, which was the SimWeight that got eliminated by the commit you point to. But I'd love if we could steer clear of that complexity: simplifying the API here was definitely the right move. Its been more than 5 years since this change was made, and this is the first complaint i've heard about the 1KB, which is why i asked about your setup. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org