[
https://issues.apache.org/jira/browse/LUCENE-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899862#comment-16899862
]
Alan Woodward commented on LUCENE-8943:
---------------------------------------
Thanks for opening this issue Christoph. MultiPhraseQuery we can solve this for
pretty easily, SpanOr will be slightly trickier I think but will be helped once
LUCENE-8912 is merged and we can simplify SpanWeight.buildSimWeight()
> Incorrect IDF in MultiPhraseQuery and SpanOrQuery
> -------------------------------------------------
>
> Key: LUCENE-8943
> URL: https://issues.apache.org/jira/browse/LUCENE-8943
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/query/scoring
> Affects Versions: 8.0
> Reporter: Christoph Goller
> Priority: Major
>
> I recently stumbled across a very old bug in the IDF computation for
> MultiPhraseQuery and SpanOrQuery.
> BM25Similarity and TFIDFSimilarity / ClassicSimilarity have a method for
> combining IDF values from more than on term / TermStatistics.
> I mean the method:
> Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics
> termStats[])
> It simply adds up the IDFs from all termStats[].
> This method is used e.g. in PhraseQuery where it makes sense. If we assume
> that for the phrase "New York" the occurrences of both words are independent,
> we can multiply their probabilitis and since IDFs are logarithmic we add them
> up. Seems to be a reasonable approximation. However, this method is also used
> to add up the IDFs of all terms in a MultiPhraseQuery as can be seen in:
> Similarity.SimScorer getStats(IndexSearcher searcher)
> A MultiPhraseQuery is actually a PhraseQuery with alternatives at individual
> positions. IDFs of alternative terms for one position should not be added up.
> Instead we should use the minimum value as an approcimation because this
> corresponds to the docFreq of the most frequent term and we know that this is
> a lower bound for the docFreq for this position.
> In SpanOrQuerry we have the same problem It uses buildSimWeight(...) from
> SpanWeight and adds up all IDFs of all OR-clauses.
> If my arguments are not convincing, look at SynonymQuery / SynonymWeight in
> the constructor:
> SynonymWeight(Query query, IndexSearcher searcher, ScoreMode scoreMode, float
> boost)
> A SynonymQuery is also a kind of OR-query and it uses the maximum of the
> docFreq of all its alternative terms. I think this is how it should be.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]