[jira] [Commented] (LUCENE-8943) Incorrect IDF in MultiPhraseQuery and SpanOrQuery

2019-08-12 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905332#comment-16905332
 ] 

Jim Ferenczi commented on LUCENE-8943:
--

{quote}

Your post made me think of the problem in another way. If we had something like 
MultiWordsSynonymQuery, we could have even more control. Similar to 
SynonymQuery we could use one IDF value for all synonyms. Synonym boost would 
work much more reliably.

{quote}

 

Yes, that's what I tried to explain in my post. It is a specific issue with 
multi-words synonyms so we should have a dedicated query. 

 

{quote}

Usually the values for pseudoStats would be computed bottom up (SpanWeight, 
PhraseWeight) from the subqueries. But we could implement a general 
MultiWordsSynonymQuery as subclass of BooleanQuery (only allowing disjunction) 
which would set (adapt) pseudoStats in all its subweights (docFreq as max 
docFreq of all synonyms just as SynonymQuery currently does).

{quote}

 

+1, that's how I'd start with this. We don't need to handle all type of queries 
though, only Term (e.g.: body:ny), conjunction of Term queries (e.g.: body:new 
AND body:york) and phrase queries (e.g.: "new york") should be accepted.

> Incorrect IDF in MultiPhraseQuery and SpanOrQuery
> -
>
> Key: LUCENE-8943
> URL: https://issues.apache.org/jira/browse/LUCENE-8943
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/query/scoring
>Affects Versions: 8.0
>Reporter: Christoph Goller
>Priority: Major
>
> I recently stumbled across a very old bug in the IDF computation for 
> MultiPhraseQuery and SpanOrQuery.
> BM25Similarity and TFIDFSimilarity / ClassicSimilarity have a method for 
> combining IDF values from more than on term / TermStatistics.
> I mean the method:
> Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics 
> termStats[])
> It simply adds up the IDFs from all termStats[].
> This method is used e.g. in PhraseQuery where it makes sense. If we assume 
> that for the phrase "New York" the occurrences of both words are independent, 
> we can multiply their probabilitis and since IDFs are logarithmic we add them 
> up. Seems to be a reasonable approximation. However, this method is also used 
> to add up the IDFs of all terms in a MultiPhraseQuery as can be seen in:
> Similarity.SimScorer getStats(IndexSearcher searcher)
> A MultiPhraseQuery is actually a PhraseQuery with alternatives at individual 
> positions. IDFs of alternative terms for one position should not be added up. 
> Instead we should use the minimum value as an approcimation because this 
> corresponds to the docFreq of the most frequent term and we know that this is 
> a lower bound for the docFreq for this position.
> In SpanOrQuerry we have the same problem It uses buildSimWeight(...) from 
> SpanWeight and adds up all IDFs of all OR-clauses.
> If my arguments are not convincing, look at SynonymQuery / SynonymWeight in 
> the constructor:
> SynonymWeight(Query query, IndexSearcher searcher, ScoreMode scoreMode, float 
> boost) 
> A SynonymQuery is also a kind of OR-query and it uses the maximum of the 
> docFreq of all its alternative terms. I think this is how it should be.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8943) Incorrect IDF in MultiPhraseQuery and SpanOrQuery

2019-08-12 Thread Christoph Goller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905178#comment-16905178
 ] 

Christoph Goller commented on LUCENE-8943:
--

I agree, we cannot realistically approximate the doc freq of phrases.
And yes, actually the scoring problem I brought up is a kind of synonym issue.

Usually, if we are using synonyms we want to score exact query matches higher 
than synonym matches. That's probably one of the reasons why SynonymQuery 
allows to specify boosts.

I am having lots of multiword synonyms. W2k16 e.g. is a synonym for "Windows 
Server 2016". Different boosts for multiword synonyms don't work reliably since 
matches for "Windows Server 2016" may score much higher than those of W2k16 due 
to huge IDFs.

I am not so much looking for an optimal BM25 scoring for Phrases / Multiphrases 
/ Spans. Instead I  am looking for a 
way to score them within BM25 so that boosts work as expected.

One step into this direction would be to limit IDF values in case of Phrases / 
Multiphrases / Spans. In BM25 it seems to be very important that IDF saturates 
and currently the behavior of Phrases / Multiphrases / Spans contradicts that. 
With the solution I proposed we can get rid of huge IDF values for Phrases / 
Multiphrases / Spans. Therefore I still think we should do it. Plus it would 
make scores more camparable and boosts would work more reliable.

Your post made me think of the problem in another way. If we had something like 
MultiWordsSynonymQuery, we could have even more control. Similar to 
SynonymQuery we could use one IDF value for all synonyms. Synonym boost would 
work much more reliably.

MultiWordsSynonymQuery could be very general. In my last post I suggested to 
approximate docFreq instead of IDFs in order to gurantee saturation. For 
implementing it, I thought about adding a member variable pseudoStats 
(TermStatistics) to Weight, which would be used for computing SimScorer. 
Usually the values for pseudoStats would be computed bottom up (SpanWeight, 
PhraseWeight) from the subqueries. But we could implement a general 
MultiWordsSynonymQuery as subclass of BooleanQuery (only allowing disjunction) 
which would set (adapt) pseudoStats in all its subweights (docFreq as max 
docFreq of all synonyms just as SynonymQuery currently does).

> Incorrect IDF in MultiPhraseQuery and SpanOrQuery
> -
>
> Key: LUCENE-8943
> URL: https://issues.apache.org/jira/browse/LUCENE-8943
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/query/scoring
>Affects Versions: 8.0
>Reporter: Christoph Goller
>Priority: Major
>
> I recently stumbled across a very old bug in the IDF computation for 
> MultiPhraseQuery and SpanOrQuery.
> BM25Similarity and TFIDFSimilarity / ClassicSimilarity have a method for 
> combining IDF values from more than on term / TermStatistics.
> I mean the method:
> Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics 
> termStats[])
> It simply adds up the IDFs from all termStats[].
> This method is used e.g. in PhraseQuery where it makes sense. If we assume 
> that for the phrase "New York" the occurrences of both words are independent, 
> we can multiply their probabilitis and since IDFs are logarithmic we add them 
> up. Seems to be a reasonable approximation. However, this method is also used 
> to add up the IDFs of all terms in a MultiPhraseQuery as can be seen in:
> Similarity.SimScorer getStats(IndexSearcher searcher)
> A MultiPhraseQuery is actually a PhraseQuery with alternatives at individual 
> positions. IDFs of alternative terms for one position should not be added up. 
> Instead we should use the minimum value as an approcimation because this 
> corresponds to the docFreq of the most frequent term and we know that this is 
> a lower bound for the docFreq for this position.
> In SpanOrQuerry we have the same problem It uses buildSimWeight(...) from 
> SpanWeight and adds up all IDFs of all OR-clauses.
> If my arguments are not convincing, look at SynonymQuery / SynonymWeight in 
> the constructor:
> SynonymWeight(Query query, IndexSearcher searcher, ScoreMode scoreMode, float 
> boost) 
> A SynonymQuery is also a kind of OR-query and it uses the maximum of the 
> docFreq of all its alternative terms. I think this is how it should be.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8943) Incorrect IDF in MultiPhraseQuery and SpanOrQuery

2019-08-09 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903968#comment-16903968
 ] 

Jim Ferenczi commented on LUCENE-8943:
--

I don't think we can realistically approximate the doc freq of phrases, 
especially if you consider more than 2 terms. The issue with the score 
difference of "wifi" (single term) vs "wi fi" (multiple terms) is more a 
synonym issue where the association between these terms is made at search time. 
Currently BM25 similarity sums the idf values but this was done to limit the 
difference with the classic (tfidf) similarity. The other similarities take a 
simpler approach that just sum the score of each term that appear in the query 
like a boolean query would do (see MultiSimilarity). It's difficult to pick one 
approach over the other here but the context is important. For single term 
synonym (terms that appear at the same position) we have the SynonymQuery that 
is used to blend the score of such terms. I tend to agree that the 
MultiPhraseQuery should take the same approach so that each position can score 
once instead of per terms. However it is difficult to expand this strategy to 
variable length multi words synonyms. We could try with a specialized 
MultiWordsSynonymQuery that would apply some strategy (approximation of the doc 
count like you propose or anything that makes sense here ;) ) to make sure that 
all variations are scored the same. Does this makes sense ?

> Incorrect IDF in MultiPhraseQuery and SpanOrQuery
> -
>
> Key: LUCENE-8943
> URL: https://issues.apache.org/jira/browse/LUCENE-8943
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/query/scoring
>Affects Versions: 8.0
>Reporter: Christoph Goller
>Priority: Major
>
> I recently stumbled across a very old bug in the IDF computation for 
> MultiPhraseQuery and SpanOrQuery.
> BM25Similarity and TFIDFSimilarity / ClassicSimilarity have a method for 
> combining IDF values from more than on term / TermStatistics.
> I mean the method:
> Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics 
> termStats[])
> It simply adds up the IDFs from all termStats[].
> This method is used e.g. in PhraseQuery where it makes sense. If we assume 
> that for the phrase "New York" the occurrences of both words are independent, 
> we can multiply their probabilitis and since IDFs are logarithmic we add them 
> up. Seems to be a reasonable approximation. However, this method is also used 
> to add up the IDFs of all terms in a MultiPhraseQuery as can be seen in:
> Similarity.SimScorer getStats(IndexSearcher searcher)
> A MultiPhraseQuery is actually a PhraseQuery with alternatives at individual 
> positions. IDFs of alternative terms for one position should not be added up. 
> Instead we should use the minimum value as an approcimation because this 
> corresponds to the docFreq of the most frequent term and we know that this is 
> a lower bound for the docFreq for this position.
> In SpanOrQuerry we have the same problem It uses buildSimWeight(...) from 
> SpanWeight and adds up all IDFs of all OR-clauses.
> If my arguments are not convincing, look at SynonymQuery / SynonymWeight in 
> the constructor:
> SynonymWeight(Query query, IndexSearcher searcher, ScoreMode scoreMode, float 
> boost) 
> A SynonymQuery is also a kind of OR-query and it uses the maximum of the 
> docFreq of all its alternative terms. I think this is how it should be.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8943) Incorrect IDF in MultiPhraseQuery and SpanOrQuery

2019-08-06 Thread Christoph Goller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901076#comment-16901076
 ] 

Christoph Goller commented on LUCENE-8943:
--

{{Thanks for your quick response Alan. I've been doing some thinking about 
adding up IDF values in case of simple phrase queries and I no longer think 
that is the way we should do it.}}

{{The problem is that we can get very high IDF values, i.e. values that are 
considerably higher than the maximum IDF value for a single term!}}

{{Consider an index with 10 million docs. The maximum IDF value (BM25) for a 
single term is 16.8. Assume we have 10 docs containing "wifi" and 10 docs 
containing "wi-fi" which is split by our tokenizer into 2 tokens. The IDF value 
for "wifi" will be 13.77. If we assume that "wi" and "fi" both occur only in 
"wi-fi" docs, we get an IDF of 27.5 for the "wi fi" phrase query which wee need 
in order to find our 10 "wi-fi" docs. If we search for wifi OR "wi fi" the docs 
containing "wi-fi" will score much higher!}}

{{Admittedly, it is easy to construct examples in which adding the IDF values 
of phrase parts yields values that are too high. The assumption of independence 
of phrase parts does not normally apply. But BM25 has a saturation for IDF 
values and adding up IDF values breaks it. This seems to be a serious 
drawback.}}

{{I propose to switch from combining IDF-values to calculating / approximating 
docFreq. For the OR-case SynonymQuery does this already. It uses the maximum. 
For the AND-case we could use something like}}

{{docFreqPhrase = (docFreq1 * docFreq2) / docCount}}

{{The intuition behind this is again independence of phrase parts. But by 
computing a docFreq we can guarantee the saturation for IDF.}}

{{For the "wi fi" example we get docFreqPhrase of 10^-5 leading to an IDF of 
16.8 (saturation) and the difference to the IDF of wifi is considerably smaller 
compared to adding up IDFs. If phrase parts are rare, we quickly run into 
saturation of the IDF. But we also get some reasonable values. Consider the 
phrase "New York". If we assume that 100,000 docs contain "new" and 10,000 docs 
contain "york". By applying the formula from above we get and IDF for the 
phrase "New York" of 11.5 which is roughly the number we get when we add up the 
IDFs of the parts (current Lucene behavior).}}

{{We could even have some simple adjustments for the fact that usually the 
independence assumption is not correct. For both the OR-case and the AND-case 
we could make values a little bit higher. The exact way for approximating 
docFreq for the AND-case and the OR-case could be defined in the Similarity and 
it could be configurable.}}

{{I also did some research with Google: (multiword OR N-gram) AND BM25 AND IDF}}
{{Unfortunately, I did not find anything that helps. }}
{{Do you know about the benchmarks used to evaluate scoring in Lucene? Are 
there any phrase queries involved?}}
{{Robert told me it’s very Trek-like, so probably no phrase queries?}}

{{In my opinion something like BM25 can only get us to a certain level of 
relevance. Of course, we have to get it right. IDF values of phrases / 
SpanQueries should not have such a big effect on the score simply because they 
get too high IDF-values. We have to do something reasonable. But for real 
break-through improvements we need something like query segmentation or even 
query interpretation and proximity of query terms in documents should have a 
high impact on the score. That's why I think it is important to integrate 
PhraseQueries and SpanQueries properly into BM25.}}

> Incorrect IDF in MultiPhraseQuery and SpanOrQuery
> -
>
> Key: LUCENE-8943
> URL: https://issues.apache.org/jira/browse/LUCENE-8943
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/query/scoring
>Affects Versions: 8.0
>Reporter: Christoph Goller
>Priority: Major
>
> I recently stumbled across a very old bug in the IDF computation for 
> MultiPhraseQuery and SpanOrQuery.
> BM25Similarity and TFIDFSimilarity / ClassicSimilarity have a method for 
> combining IDF values from more than on term / TermStatistics.
> I mean the method:
> Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics 
> termStats[])
> It simply adds up the IDFs from all termStats[].
> This method is used e.g. in PhraseQuery where it makes sense. If we assume 
> that for the phrase "New York" the occurrences of both words are independent, 
> we can multiply their probabilitis and since IDFs are logarithmic we add them 
> up. Seems to be a reasonable approximation. However, this method is also used 
> to add up the IDFs of all terms in a MultiPhraseQuery as can be seen in:
> Similarity.SimScorer getStats(IndexSearcher searcher)
> A MultiPhraseQuery is actually a PhraseQuery 

[jira] [Commented] (LUCENE-8943) Incorrect IDF in MultiPhraseQuery and SpanOrQuery

2019-08-05 Thread Alan Woodward (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899862#comment-16899862
 ] 

Alan Woodward commented on LUCENE-8943:
---

Thanks for opening this issue Christoph. MultiPhraseQuery we can solve this for 
pretty easily, SpanOr will be slightly trickier I think but will be helped once 
LUCENE-8912 is merged and we can simplify SpanWeight.buildSimWeight()

> Incorrect IDF in MultiPhraseQuery and SpanOrQuery
> -
>
> Key: LUCENE-8943
> URL: https://issues.apache.org/jira/browse/LUCENE-8943
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/query/scoring
>Affects Versions: 8.0
>Reporter: Christoph Goller
>Priority: Major
>
> I recently stumbled across a very old bug in the IDF computation for 
> MultiPhraseQuery and SpanOrQuery.
> BM25Similarity and TFIDFSimilarity / ClassicSimilarity have a method for 
> combining IDF values from more than on term / TermStatistics.
> I mean the method:
> Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics 
> termStats[])
> It simply adds up the IDFs from all termStats[].
> This method is used e.g. in PhraseQuery where it makes sense. If we assume 
> that for the phrase "New York" the occurrences of both words are independent, 
> we can multiply their probabilitis and since IDFs are logarithmic we add them 
> up. Seems to be a reasonable approximation. However, this method is also used 
> to add up the IDFs of all terms in a MultiPhraseQuery as can be seen in:
> Similarity.SimScorer getStats(IndexSearcher searcher)
> A MultiPhraseQuery is actually a PhraseQuery with alternatives at individual 
> positions. IDFs of alternative terms for one position should not be added up. 
> Instead we should use the minimum value as an approcimation because this 
> corresponds to the docFreq of the most frequent term and we know that this is 
> a lower bound for the docFreq for this position.
> In SpanOrQuerry we have the same problem It uses buildSimWeight(...) from 
> SpanWeight and adds up all IDFs of all OR-clauses.
> If my arguments are not convincing, look at SynonymQuery / SynonymWeight in 
> the constructor:
> SynonymWeight(Query query, IndexSearcher searcher, ScoreMode scoreMode, float 
> boost) 
> A SynonymQuery is also a kind of OR-query and it uses the maximum of the 
> docFreq of all its alternative terms. I think this is how it should be.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8943) Incorrect IDF in MultiPhraseQuery and SpanOrQuery

2019-08-02 Thread Christoph Goller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898859#comment-16898859
 ] 

Christoph Goller commented on LUCENE-8943:
--

Why is this an issue?

Because IDFs of SpanOrQueriy and MultiPhraseQuery can get gigantic meaning that 
such queries get an unexpectedly high impact on the final score.

> Incorrect IDF in MultiPhraseQuery and SpanOrQuery
> -
>
> Key: LUCENE-8943
> URL: https://issues.apache.org/jira/browse/LUCENE-8943
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/query/scoring
>Affects Versions: 8.0
>Reporter: Christoph Goller
>Priority: Major
>
> I recently stumbled across a very old bug in the IDF computation for 
> MultiPhraseQuery and SpanOrQuery.
> BM25Similarity and TFIDFSimilarity / ClassicSimilarity have a method for 
> combining IDF values from more than on term / TermStatistics.
> I mean the method:
> Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics 
> termStats[])
> It simply adds up the IDFs from all termStats[].
> This method is used e.g. in PhraseQuery where it makes sense. If we assume 
> that for the phrase "New York" the occurrences of both words are independent, 
> we can multiply their probabilitis and since IDFs are logarithmic we add them 
> up. Seems to be a reasonable approximation. However, this method is also used 
> to add up the IDFs of all terms in a MultiPhraseQuery as can be seen in:
> Similarity.SimScorer getStats(IndexSearcher searcher)
> A MultiPhraseQuery is actually a PhraseQuery with alternatives at individual 
> positions. IDFs of alternative terms for one position should not be added up. 
> Instead we should use the minimum value as an approcimation because this 
> corresponds to the docFreq of the most frequent term and we know that this is 
> a lower bound for the docFreq for this position.
> In SpanOrQuerry we have the same problem It uses buildSimWeight(...) from 
> SpanWeight and adds up all IDFs of all OR-clauses.
> If my arguments are not convincing, look at SynonymQuery / SynonymWeight in 
> the constructor:
> SynonymWeight(Query query, IndexSearcher searcher, ScoreMode scoreMode, float 
> boost) 
> A SynonymQuery is also a kind of OR-query and it uses the maximum of the 
> docFreq of all its alternative terms. I think this is how it should be.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org