[ 
https://issues.apache.org/jira/browse/LUCENE-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16903968#comment-16903968
 ] 

Jim Ferenczi commented on LUCENE-8943:
--------------------------------------

I don't think we can realistically approximate the doc freq of phrases, 
especially if you consider more than 2 terms. The issue with the score 
difference of "wifi" (single term) vs "wi fi" (multiple terms) is more a 
synonym issue where the association between these terms is made at search time. 
Currently BM25 similarity sums the idf values but this was done to limit the 
difference with the classic (tfidf) similarity. The other similarities take a 
simpler approach that just sum the score of each term that appear in the query 
like a boolean query would do (see MultiSimilarity). It's difficult to pick one 
approach over the other here but the context is important. For single term 
synonym (terms that appear at the same position) we have the SynonymQuery that 
is used to blend the score of such terms. I tend to agree that the 
MultiPhraseQuery should take the same approach so that each position can score 
once instead of per terms. However it is difficult to expand this strategy to 
variable length multi words synonyms. We could try with a specialized 
MultiWordsSynonymQuery that would apply some strategy (approximation of the doc 
count like you propose or anything that makes sense here ;) ) to make sure that 
all variations are scored the same. Does this makes sense ?

> Incorrect IDF in MultiPhraseQuery and SpanOrQuery
> -------------------------------------------------
>
>                 Key: LUCENE-8943
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8943
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/query/scoring
>    Affects Versions: 8.0
>            Reporter: Christoph Goller
>            Priority: Major
>
> I recently stumbled across a very old bug in the IDF computation for 
> MultiPhraseQuery and SpanOrQuery.
> BM25Similarity and TFIDFSimilarity / ClassicSimilarity have a method for 
> combining IDF values from more than on term / TermStatistics.
> I mean the method:
> Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics 
> termStats[])
> It simply adds up the IDFs from all termStats[].
> This method is used e.g. in PhraseQuery where it makes sense. If we assume 
> that for the phrase "New York" the occurrences of both words are independent, 
> we can multiply their probabilitis and since IDFs are logarithmic we add them 
> up. Seems to be a reasonable approximation. However, this method is also used 
> to add up the IDFs of all terms in a MultiPhraseQuery as can be seen in:
> Similarity.SimScorer getStats(IndexSearcher searcher)
> A MultiPhraseQuery is actually a PhraseQuery with alternatives at individual 
> positions. IDFs of alternative terms for one position should not be added up. 
> Instead we should use the minimum value as an approcimation because this 
> corresponds to the docFreq of the most frequent term and we know that this is 
> a lower bound for the docFreq for this position.
> In SpanOrQuerry we have the same problem It uses buildSimWeight(...) from 
> SpanWeight and adds up all IDFs of all OR-clauses.
> If my arguments are not convincing, look at SynonymQuery / SynonymWeight in 
> the constructor:
> SynonymWeight(Query query, IndexSearcher searcher, ScoreMode scoreMode, float 
> boost) 
> A SynonymQuery is also a kind of OR-query and it uses the maximum of the 
> docFreq of all its alternative terms. I think this is how it should be.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to