[jira] [Commented] (LUCENE-8943) Incorrect IDF in MultiPhraseQuery and SpanOrQuery

Alan Woodward (JIRA) Mon, 05 Aug 2019 01:02:29 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899862#comment-16899862
 ]


Alan Woodward commented on LUCENE-8943:
---------------------------------------

Thanks for opening this issue Christoph. MultiPhraseQuery we can solve this for 
pretty easily, SpanOr will be slightly trickier I think but will be helped once 
LUCENE-8912 is merged and we can simplify SpanWeight.buildSimWeight()

> Incorrect IDF in MultiPhraseQuery and SpanOrQuery
> -------------------------------------------------
>
>                 Key: LUCENE-8943
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8943
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/query/scoring
>    Affects Versions: 8.0
>            Reporter: Christoph Goller
>            Priority: Major
>
> I recently stumbled across a very old bug in the IDF computation for 
> MultiPhraseQuery and SpanOrQuery.
> BM25Similarity and TFIDFSimilarity / ClassicSimilarity have a method for 
> combining IDF values from more than on term / TermStatistics.
> I mean the method:
> Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics 
> termStats[])
> It simply adds up the IDFs from all termStats[].
> This method is used e.g. in PhraseQuery where it makes sense. If we assume 
> that for the phrase "New York" the occurrences of both words are independent, 
> we can multiply their probabilitis and since IDFs are logarithmic we add them 
> up. Seems to be a reasonable approximation. However, this method is also used 
> to add up the IDFs of all terms in a MultiPhraseQuery as can be seen in:
> Similarity.SimScorer getStats(IndexSearcher searcher)
> A MultiPhraseQuery is actually a PhraseQuery with alternatives at individual 
> positions. IDFs of alternative terms for one position should not be added up. 
> Instead we should use the minimum value as an approcimation because this 
> corresponds to the docFreq of the most frequent term and we know that this is 
> a lower bound for the docFreq for this position.
> In SpanOrQuerry we have the same problem It uses buildSimWeight(...) from 
> SpanWeight and adds up all IDFs of all OR-clauses.
> If my arguments are not convincing, look at SynonymQuery / SynonymWeight in 
> the constructor:
> SynonymWeight(Query query, IndexSearcher searcher, ScoreMode scoreMode, float 
> boost) 
> A SynonymQuery is also a kind of OR-query and it uses the maximum of the 
> docFreq of all its alternative terms. I think this is how it should be.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8943) Incorrect IDF in MultiPhraseQuery and SpanOrQuery

Reply via email to