[ https://issues.apache.org/jira/browse/LUCENE-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006815#comment-17006815 ]
Tommaso Teofili commented on LUCENE-9107: ----------------------------------------- thanks Adrien for looking into this, I've tried with a pure disjunction (BooleanQuery) and the numbers are about the same as with {{CommonTermsQuery}}. {{ClassicSimilarity}} slowness contribution is non trivial: top-k scoring with {{ClassicSimilarity}} ranges 2 to 2.5 seconds, whereas it ranges 1.5 to 2 seconds with {{BM25Similarity}}. > CommonsTermsQuery with huge no. of terms slower with top-k scoring > ------------------------------------------------------------------ > > Key: LUCENE-9107 > URL: https://issues.apache.org/jira/browse/LUCENE-9107 > Project: Lucene - Core > Issue Type: Bug > Components: core/search > Affects Versions: 8.3 > Reporter: Tommaso Teofili > Priority: Major > > In [1] a {{CommonTermsQuery}} is used in order to perform a query with lots > of (duplicate) terms. Using a max term frequency cutoff of 0.999 for low > frequency terms, the query, although big, finishes in around 2-300ms with > Lucene 7.6.0. > However, when upgrading the code to Lucene 8.x, the query runs in 2-3s > instead [2]. > After digging a bit into it it seems that the regression in speed comes from > the fact that top-k scoring introduced by default in version 8 is causing > that, not sure "where" exactly in the code though. > When switching back to complete hit scoring [3], the speed goes back to the > initial 2-300ms also in Lucene 8.3.x. > It'd be nice to understand the reason why this is happening and if it is only > concerning {{CommonTermsQuery}} or affecting {{BooleanQuery}} as well. > If this is a case that depends on the data and application involved (Anserini > in this case), the application should handle it, otherwise if it is a > regression/bug in Lucene it'd be nice to fix it. > [1] : > https://github.com/tteofili/Anserini-embeddings/blob/nnsearch/src/main/java/io/anserini/embeddings/nn/fw/FakeWordsRunner.java > [2] : > https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java > [3] : > https://github.com/tteofili/anserini/blob/ann-paper-reproduce/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java#L174 -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org