[jira] [Commented] (LUCENE-8943) Incorrect IDF in MultiPhraseQuery and SpanOrQuery

Christoph Goller (JIRA) Mon, 12 Aug 2019 06:20:13 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16905178#comment-16905178
 ]


Christoph Goller commented on LUCENE-8943:
------------------------------------------

I agree, we cannot realistically approximate the doc freq of phrases.
And yes, actually the scoring problem I brought up is a kind of synonym issue.

Usually, if we are using synonyms we want to score exact query matches higher 
than synonym matches. That's probably one of the reasons why SynonymQuery 
allows to specify boosts.

I am having lots of multiword synonyms. W2k16 e.g. is a synonym for "Windows 
Server 2016". Different boosts for multiword synonyms don't work reliably since 
matches for "Windows Server 2016" may score much higher than those of W2k16 due 
to huge IDFs.

I am not so much looking for an optimal BM25 scoring for Phrases / Multiphrases 
/ Spans. Instead I  am looking for a 
way to score them within BM25 so that boosts work as expected.

One step into this direction would be to limit IDF values in case of Phrases / 
Multiphrases / Spans. In BM25 it seems to be very important that IDF saturates 
and currently the behavior of Phrases / Multiphrases / Spans contradicts that. 
With the solution I proposed we can get rid of huge IDF values for Phrases / 
Multiphrases / Spans. Therefore I still think we should do it. Plus it would 
make scores more camparable and boosts would work more reliable.

Your post made me think of the problem in another way. If we had something like 
MultiWordsSynonymQuery, we could have even more control. Similar to 
SynonymQuery we could use one IDF value for all synonyms. Synonym boost would 
work much more reliably.

MultiWordsSynonymQuery could be very general. In my last post I suggested to 
approximate docFreq instead of IDFs in order to gurantee saturation. For 
implementing it, I thought about adding a member variable pseudoStats 
(TermStatistics) to Weight, which would be used for computing SimScorer. 
Usually the values for pseudoStats would be computed bottom up (SpanWeight, 
PhraseWeight) from the subqueries. But we could implement a general 
MultiWordsSynonymQuery as subclass of BooleanQuery (only allowing disjunction) 
which would set (adapt) pseudoStats in all its subweights (docFreq as max 
docFreq of all synonyms just as SynonymQuery currently does).

> Incorrect IDF in MultiPhraseQuery and SpanOrQuery
> -------------------------------------------------
>
>                 Key: LUCENE-8943
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8943
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/query/scoring
>    Affects Versions: 8.0
>            Reporter: Christoph Goller
>            Priority: Major
>
> I recently stumbled across a very old bug in the IDF computation for 
> MultiPhraseQuery and SpanOrQuery.
> BM25Similarity and TFIDFSimilarity / ClassicSimilarity have a method for 
> combining IDF values from more than on term / TermStatistics.
> I mean the method:
> Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics 
> termStats[])
> It simply adds up the IDFs from all termStats[].
> This method is used e.g. in PhraseQuery where it makes sense. If we assume 
> that for the phrase "New York" the occurrences of both words are independent, 
> we can multiply their probabilitis and since IDFs are logarithmic we add them 
> up. Seems to be a reasonable approximation. However, this method is also used 
> to add up the IDFs of all terms in a MultiPhraseQuery as can be seen in:
> Similarity.SimScorer getStats(IndexSearcher searcher)
> A MultiPhraseQuery is actually a PhraseQuery with alternatives at individual 
> positions. IDFs of alternative terms for one position should not be added up. 
> Instead we should use the minimum value as an approcimation because this 
> corresponds to the docFreq of the most frequent term and we know that this is 
> a lower bound for the docFreq for this position.
> In SpanOrQuerry we have the same problem It uses buildSimWeight(...) from 
> SpanWeight and adds up all IDFs of all OR-clauses.
> If my arguments are not convincing, look at SynonymQuery / SynonymWeight in 
> the constructor:
> SynonymWeight(Query query, IndexSearcher searcher, ScoreMode scoreMode, float 
> boost) 
> A SynonymQuery is also a kind of OR-query and it uses the maximum of the 
> docFreq of all its alternative terms. I think this is how it should be.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8943) Incorrect IDF in MultiPhraseQuery and SpanOrQuery

Reply via email to