[jira] [Commented] (LUCENE-8000) Document Length Normalization in BM25Similarity correct?

Robert Muir (JIRA) Thu, 19 Oct 2017 08:52:23 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16211242#comment-16211242
 ]


Robert Muir commented on LUCENE-8000:
-------------------------------------

My point is that defaults are for typical use-cases, and the default of 
discountOverlaps meets that goal. It results in better (measured) performance 
for many tokenfilters that are commonly used such as common-grams, WDF, 
synonyms, etc. I ran these tests before proposing the default, it was not done 
flying blind.

You can still turn it off if you have an atypical use case.

I don't think we need to modify the current computation based on 
sumTotalTermFreq/docCount without relevance measurements (multiple datasets) 
indicating that it improves default/common use cases in statistically 
significant ways. Index statistics are expensive and we should keep things 
simple and minimal. 

Counting positions would be this entirely different thing, and mixes in more 
differences that all need to be measured. For example it means that stopwords 
which were removed now count against document's length where they don't do that 
today.

> Document Length Normalization in BM25Similarity correct?
> --------------------------------------------------------
>
>                 Key: LUCENE-8000
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8000
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Christoph Goller
>            Priority: Minor
>
> Length of individual documents only counts the number of positions of a 
> document since discountOverlaps defaults to true.
>  {quote} @Override
>   public final long computeNorm(FieldInvertState state) {
>     final int numTerms = discountOverlaps ? state.getLength() - 
> state.getNumOverlap() : state.getLength();
>     int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
>     if (indexCreatedVersionMajor >= 7) {
>       return SmallFloat.intToByte4(numTerms);
>     } else {
>       return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
>     }
>   }{quote}
> Measureing document length this way seems perfectly ok for me. What bothers 
> me is that
> average document length is based on sumTotalTermFreq for a field. As far as I 
> understand that sums up totalTermFreqs for all terms of a field, therefore 
> counting positions of terms including those that overlap.
> {quote}  protected float avgFieldLength(CollectionStatistics collectionStats) 
> {
>     final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
>     if (sumTotalTermFreq <= 0) {
>       return 1f;       // field does not exist, or stat is unsupported
>     } else {
>       final long docCount = collectionStats.docCount() == -1 ? 
> collectionStats.maxDoc() : collectionStats.docCount();
>       return (float) (sumTotalTermFreq / (double) docCount);
>     }
>   }{quote}
> Are we comparing apples and oranges in the final scoring?
> I haven't run any benchmarks and I am not sure whether this has a serious 
> effect. It just means that documents that have synonyms or in our case 
> different normal forms of tokens on the same position are shorter and 
> therefore get higher scores  than they should and that we do not use the 
> whole spectrum of relative document lenght of BM25.
> I think for BM25  discountOverlaps  should default to false. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8000) Document Length Normalization in BM25Similarity correct?

Reply via email to