[ https://issues.apache.org/jira/browse/LUCENE-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16211286#comment-16211286 ]
Timothy M. Rodriguez commented on LUCENE-8000: ---------------------------------------------- Makes sense, agreed on both points. > Document Length Normalization in BM25Similarity correct? > -------------------------------------------------------- > > Key: LUCENE-8000 > URL: https://issues.apache.org/jira/browse/LUCENE-8000 > Project: Lucene - Core > Issue Type: Bug > Reporter: Christoph Goller > Priority: Minor > > Length of individual documents only counts the number of positions of a > document since discountOverlaps defaults to true. > {quote} @Override > public final long computeNorm(FieldInvertState state) { > final int numTerms = discountOverlaps ? state.getLength() - > state.getNumOverlap() : state.getLength(); > int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor(); > if (indexCreatedVersionMajor >= 7) { > return SmallFloat.intToByte4(numTerms); > } else { > return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms))); > } > }{quote} > Measureing document length this way seems perfectly ok for me. What bothers > me is that > average document length is based on sumTotalTermFreq for a field. As far as I > understand that sums up totalTermFreqs for all terms of a field, therefore > counting positions of terms including those that overlap. > {quote} protected float avgFieldLength(CollectionStatistics collectionStats) > { > final long sumTotalTermFreq = collectionStats.sumTotalTermFreq(); > if (sumTotalTermFreq <= 0) { > return 1f; // field does not exist, or stat is unsupported > } else { > final long docCount = collectionStats.docCount() == -1 ? > collectionStats.maxDoc() : collectionStats.docCount(); > return (float) (sumTotalTermFreq / (double) docCount); > } > }{quote} > Are we comparing apples and oranges in the final scoring? > I haven't run any benchmarks and I am not sure whether this has a serious > effect. It just means that documents that have synonyms or in our case > different normal forms of tokens on the same position are shorter and > therefore get higher scores than they should and that we do not use the > whole spectrum of relative document lenght of BM25. > I think for BM25 discountOverlaps should default to false. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org