[jira] [Comment Edited] (LUCENE-8943) Incorrect IDF in MultiPhraseQuery and SpanOrQuery

Christoph Goller (JIRA) Fri, 02 Aug 2019 05:40:13 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16898859#comment-16898859
 ]


Christoph Goller edited comment on LUCENE-8943 at 8/2/19 12:39 PM:
-------------------------------------------------------------------

Why is this an issue?

Because IDFs of SpanOrQueriy and MultiPhraseQuery can get gigantic meaning that 
such queries have an unexpectedly high impact on the final score.


was (Author: gol...@detego-software.de):
Why is this an issue?

Because IDFs of SpanOrQueriy and MultiPhraseQuery can get gigantic meaning that 
such queries get an unexpectedly high impact on the final score.

> Incorrect IDF in MultiPhraseQuery and SpanOrQuery
> -------------------------------------------------
>
>                 Key: LUCENE-8943
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8943
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/query/scoring
>    Affects Versions: 8.0
>            Reporter: Christoph Goller
>            Priority: Major
>
> I recently stumbled across a very old bug in the IDF computation for 
> MultiPhraseQuery and SpanOrQuery.
> BM25Similarity and TFIDFSimilarity / ClassicSimilarity have a method for 
> combining IDF values from more than on term / TermStatistics.
> I mean the method:
> Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics 
> termStats[])
> It simply adds up the IDFs from all termStats[].
> This method is used e.g. in PhraseQuery where it makes sense. If we assume 
> that for the phrase "New York" the occurrences of both words are independent, 
> we can multiply their probabilitis and since IDFs are logarithmic we add them 
> up. Seems to be a reasonable approximation. However, this method is also used 
> to add up the IDFs of all terms in a MultiPhraseQuery as can be seen in:
> Similarity.SimScorer getStats(IndexSearcher searcher)
> A MultiPhraseQuery is actually a PhraseQuery with alternatives at individual 
> positions. IDFs of alternative terms for one position should not be added up. 
> Instead we should use the minimum value as an approcimation because this 
> corresponds to the docFreq of the most frequent term and we know that this is 
> a lower bound for the docFreq for this position.
> In SpanOrQuerry we have the same problem It uses buildSimWeight(...) from 
> SpanWeight and adds up all IDFs of all OR-clauses.
> If my arguments are not convincing, look at SynonymQuery / SynonymWeight in 
> the constructor:
> SynonymWeight(Query query, IndexSearcher searcher, ScoreMode scoreMode, float 
> boost) 
> A SynonymQuery is also a kind of OR-query and it uses the maximum of the 
> docFreq of all its alternative terms. I think this is how it should be.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-8943) Incorrect IDF in MultiPhraseQuery and SpanOrQuery

Reply via email to