[
https://issues.apache.org/jira/browse/LUCENE-8020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir updated LUCENE-8020:
--------------------------------
Attachment: LUCENE-8020.patch
I reviewed callers of termStatistics and found also TermAutomatonQuery in
sandbox (scores like phrase query but has the same current issue as SpanOrQuery
if some don't exist), fixed it the same way, and added unit tests. I think its
ready.
> Don't force sim to score bogus terms (e.g. docfreq=0)
> -----------------------------------------------------
>
> Key: LUCENE-8020
> URL: https://issues.apache.org/jira/browse/LUCENE-8020
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Robert Muir
> Attachments: LUCENE-8020.patch, LUCENE-8020.patch, LUCENE-8020.patch
>
>
> Today all sim formulas have to be "hacked" to deal with the fact that they
> may be passed stats such as docFreq=0, totalTermFreq=0. This happens easily
> with spans and there is even a dedicated test for it. All formulas have hacks
> such as what you see in https://issues.apache.org/jira/browse/LUCENE-6818:
> Instead of:
> {code}
> expected = stats.getTotalTermFreq() * docLen / stats.getNumberOfFieldTokens();
> {code}
> they must do tricks such as:
> {code}
> expected = (1 + stats.getTotalTermFreq()) * docLen / (1 +
> stats.getNumberOfFieldTokens());
> {code}
> There is no good reason for this, it is just sloppiness in the
> Query/Weight/Scorer api. I think formulas should work unmodified, we
> shouldn't pass terms that dont exist or bogus statistics.
> It adds a lot of complexity to the scoring api and makes it difficult to have
> meaningful/useful explanations, to debug problems, etc. It also makes it
> really hard to add a new sim.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]