[ 
https://issues.apache.org/jira/browse/LUCENE-8025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-8025:
--------------------------------
    Attachment: LUCENE-8025.patch

patch. it falls back to the bogus value only if sumDocFreq is unavailable, 
which doesn't happen with any codecs since lucene 4 or so.

note for SimilarityBase it doesn't just correct avgdl but also the 
numberOfFieldTokens, which was previously (bogusly) set to docFreq as if the 
term being scored was the only one in the collection! I will update tests 
across more sims such as LM and DFI that are sensitive to this to see any 
improvement.

> compute avgdl correctly for DOCS_ONLY
> -------------------------------------
>
>                 Key: LUCENE-8025
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8025
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: LUCENE-8025.patch
>
>
> Spinoff of LUCENE-8007:
> If you omit term frequencies, we should score as if all tf values were 1. 
> This is the way it worked for e.g. ClassicSimilarity and you can understand 
> how it degrades. 
> However for sims such as BM25, we bail out on computing avg doclength (and 
> just return a bogus value of 1) today, screwing up stuff related to length 
> normalization too, which is separate.
> Instead of a bogus value, we should substitute sumDocFreq for 
> sumTotalTermFreq (all postings have freq of 1, since you omitted them).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to