Re: BlendedTermQuery causing negative IDF?

Robert Muir Tue, 19 Apr 2016 10:54:11 -0700

The scoring algorithm can't be expected to deal with totally bogus
(e.g. mathematically impossible) statistics, such as docFreq >
docCount. Many of them may fall apart. We should try to improve that
about BlendedTermQuery!


SynonymQuery should not really exist. It exists because of problems
like that: what BlendedTermQuery tries to do (fuse terms from multiply
fields) is more complicated. SynonymQuery only works on one field.
Statistics are always in-bounds and all statistics exception for
docFreq are just what the situation would look like if all the terms
were synonyms at index-time: that is the whole goal. It sums up raw TF
values from the postings lists across all the synonyms into one
integer value before passing to the similarity. The only sheisty part
is really the docFreq = max(docFreq), but its always in-bounds at
least and a consistent value. Otherwise it is scored exactly as an
index-time synonym with respect to all other stats. So e.g. this is a
lot closer to the motivation behind what BM25F does, but it should
behave well with any similarity since the task is easier.

Across fields makes things more complex: seems like we should try to improve it.

On Tue, Apr 19, 2016 at 11:33 AM, Ahmet Arslan
<iori...@yahoo.com.invalid> wrote:
> Thanks Dough for letting us know that Lucene's BM25 avoids negative IDF 
> values.
> I didn't know that.
>
> Markus, out of curiosity, why do you need BlendedTermQuery?
> I knew SynonymQuery is now part of query parser base, I think they do similar 
> things?
>
> Ahmet
>
>
>
>
> On Tuesday, April 19, 2016 5:33 PM, Doug Turnbull 
> <dturnb...@opensourceconnections.com> wrote:
> Lucene's BM25 avoids negatives scores for this by adding 1 inside the log
> term of BM25's IDF
>
> Compare this:
> https://github.com/apache/lucene-solr/blob/5e5fd662575105de88d8514b426bccdcb4c76948/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L71
>
> to the Wikipedia article's BM25 IDF
> https://en.wikipedia.org/wiki/Okapi_BM25
>
> Markus another thing to add is that when Elasticsearch uses
> BlendedTermQuery, they add a lot of invariants that must be true. For
> example the fields must share the same analyzer. You may need to research
> what else happens in Elasticsearch outside BlendedTermQuery to fet this
> behavior to work.
>
> Another testing philosophy point: when I do this kind of work I like to
> isolate the Lucene behavior seperate from the Solr behavior. I might
> suggest creating a Lucene unit test to validate your assumptions around
> BlendedTermQuery. Just to help isolate the issues. Here's Lucene's tests
> for BlendedTermQuery as a basis
>
> https://github.com/apache/lucene-solr/blob/5e5fd662575105de88d8514b426bccdcb4c76948/lucene/core/src/test/org/apache/lucene/search/TestBlendedTermQuery.java
>
>
>
>
>
>
>
>
>
> On Tue, Apr 19, 2016 at 10:16 AM Ahmet Arslan <iori...@yahoo.com.invalid>
> wrote:
>
>>
>>
>> Hi Markus,
>>
>> It is a known property of BM25. It produces negative scores for common
>> terms.
>> Most of the term-weighting models are developed for indices in which stop
>> words are eliminated.
>> Therefore, most of the term-weighting models have problems scoring common
>> terms.
>> By the way, DFI model does a decent job when handling common terms.
>>
>> Ahmet
>>
>>
>>
>> On Tuesday, April 19, 2016 4:48 PM, Markus Jelsma <
>> markus.jel...@openindex.io> wrote:
>> Hello,
>>
>> I just made a Solr query parser for BlendedTermQuery on Lucene 6.0 using
>> BM25 similarity and i have a very simple unit test to see if something is
>> working at all. But to my surprise, one of the results has a negative
>> score, caused by a negative IDF because docFreq is higher than docCount for
>> that term on that field. Here are the test documents:
>>
>>     assertU(adoc("id", "1", "text", "rare term"));
>>     assertU(adoc("id", "2", "text_nl", "less rare term"));
>>     assertU(adoc("id", "3", "text_nl", "rarest term"));
>>     assertU(commit());
>>
>> My query parser creates the following Lucene query:
>> BlendedTermQuery(Blended(text:rare text:term text_nl:rare text_nl:term))
>> which looks fine to me. But this is what i am getting back for issueing
>> that query on the above set of documents, the third document is the one
>> with a negative score.
>>
>> <result name="response" numFound="3" start="0" maxScore="0.1805489">
>>   <doc>
>>     <str name="id">3</str>
>>     <float name="score">0.1805489</float></doc>
>>   <doc>
>>     <str name="id">2</str>
>>     <float name="score">0.14785346</float></doc>
>>   <doc>
>>     <str name="id">1</str>
>>     <float name="score">-0.004004207</float></doc>
>> </result>
>> <lst name="debug">
>>   <str name="rawquerystring">{!blended fl=text,text_nl}rare term</str>
>>   <str name="querystring">{!blended fl=text,text_nl}rare term</str>
>>   <str name="parsedquery">BlendedTermQuery(Blended(text:rare text:term
>> text_nl:rare text_nl:term))</str>
>>   <str name="parsedquery_toString">Blended(text:rare text:term
>> text_nl:rare text_nl:term)</str>
>>   <lst name="explain">
>>     <str name="3">
>> 0.1805489 = max plus 0.01 times others of:
>>   0.1805489 = weight(text_nl:term in 2) [], result of:
>>     0.1805489 = score(doc=2,freq=1.0 = termFreq=1.0
>> ), product of:
>>       0.18232156 = idf(docFreq=2, docCount=2)
>>       0.9902773 = tfNorm, computed from:
>>         1.0 = termFreq=1.0
>>         1.2 = parameter k1
>>         0.75 = parameter b
>>         2.5 = avgFieldLength
>>         2.56 = fieldLength
>> </str>
>>     <str name="2">
>> 0.14785345 = max plus 0.01 times others of:
>>   0.14638956 = weight(text_nl:rare in 1) [], result of:
>>     0.14638956 = score(doc=1,freq=1.0 = termFreq=1.0
>> ), product of:
>>       0.18232156 = idf(docFreq=2, docCount=2)
>>       0.8029196 = tfNorm, computed from:
>>         1.0 = termFreq=1.0
>>         1.2 = parameter k1
>>         0.75 = parameter b
>>         2.5 = avgFieldLength
>>         4.0 = fieldLength
>>   0.14638956 = weight(text_nl:term in 1) [], result of:
>>     0.14638956 = score(doc=1,freq=1.0 = termFreq=1.0
>> ), product of:
>>       0.18232156 = idf(docFreq=2, docCount=2)
>>       0.8029196 = tfNorm, computed from:
>>         1.0 = termFreq=1.0
>>         1.2 = parameter k1
>>         0.75 = parameter b
>>         2.5 = avgFieldLength
>>         4.0 = fieldLength
>> </str>
>>     <str name="1">
>> -0.004004207 = max plus 0.01 times others of:
>>   -0.20021036 = weight(text:rare in 0) [], result of:
>>     -0.20021036 = score(doc=0,freq=1.0 = termFreq=1.0
>> ), product of:
>>       -0.22314355 = idf(docFreq=2, docCount=1)
>>       0.89722675 = tfNorm, computed from:
>>         1.0 = termFreq=1.0
>>         1.2 = parameter k1
>>         0.75 = parameter b
>>         2.0 = avgFieldLength
>>         2.56 = fieldLength
>>   -0.20021036 = weight(text:term in 0) [], result of:
>>     -0.20021036 = score(doc=0,freq=1.0 = termFreq=1.0
>> ), product of:
>>       -0.22314355 = idf(docFreq=2, docCount=1)
>>       0.89722675 = tfNorm, computed from:
>>         1.0 = termFreq=1.0
>>         1.2 = parameter k1
>>         0.75 = parameter b
>>         2.0 = avgFieldLength
>>         2.56 = fieldLength
>> </str>
>>
>> What am i doing wrong? Or did i catch a bug?
>>
>> Thanks,
>> Markus
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: BlendedTermQuery causing negative IDF?

Reply via email to