[
https://issues.apache.org/jira/browse/LUCENE-8025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir updated LUCENE-8025:
--------------------------------
Attachment: LUCENE-8025.patch
patch. it falls back to the bogus value only if sumDocFreq is unavailable,
which doesn't happen with any codecs since lucene 4 or so.
note for SimilarityBase it doesn't just correct avgdl but also the
numberOfFieldTokens, which was previously (bogusly) set to docFreq as if the
term being scored was the only one in the collection! I will update tests
across more sims such as LM and DFI that are sensitive to this to see any
improvement.
> compute avgdl correctly for DOCS_ONLY
> -------------------------------------
>
> Key: LUCENE-8025
> URL: https://issues.apache.org/jira/browse/LUCENE-8025
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Robert Muir
> Attachments: LUCENE-8025.patch
>
>
> Spinoff of LUCENE-8007:
> If you omit term frequencies, we should score as if all tf values were 1.
> This is the way it worked for e.g. ClassicSimilarity and you can understand
> how it degrades.
> However for sims such as BM25, we bail out on computing avg doclength (and
> just return a bogus value of 1) today, screwing up stuff related to length
> normalization too, which is separate.
> Instead of a bogus value, we should substitute sumDocFreq for
> sumTotalTermFreq (all postings have freq of 1, since you omitted them).
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]