[
https://issues.apache.org/jira/browse/LUCENE-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225972#comment-16225972
]
Robert Muir commented on LUCENE-8007:
-------------------------------------
Based on this issue, SimilarityBase can be really improved and simplified with
pseudocode like this:
{code}
long totalTermFreq = termStats.totalTermFreq();
if (totalTermFreq == -1) {
// term frequencies were omitted, so we assume all tf values for the field
were 1
assert collectionStats.sumTotalTermFreq() == -1;
totalTermFreq = termStats.docFreq(); // appears 1 time per docFreq
sumTotalTermFreq = collectionStats.sumDocFreq(); // number of postings
where each freq is 1
}
{code}
I think this is better than bogusly setting sumTotalTermFreq to docFreq and
avgdl to 1 like we do today? It should behave much better for the omitTF case.
Note that BM25 has the same problem for its avgdl computation too, which should
also be fixed.
Just makes me wonder if we should reconsider returning -1 for term's
totalTermFreq and field's sumTotalTermFreq, when we can alternatively
substitute with docFreq and sumDocFreq and it will be the same values as if we
actually tracked freqs of 1 and tracked these stats across them for the field?
And postings lists return 1 as the freq for such cases today so it seems
consistent and may simplify code like CheckIndex etc as well.
Returning -1 doesn't really provide value, i think its just the codec api
showing too much of its guts. if you really want to know if freqs were omitted
for the field (versus all being 1), you can inspect the IndexOptions for that.
> Require that codecs always store totalTermFreq, sumDocFreq and
> sumTotalTermFreq
> -------------------------------------------------------------------------------
>
> Key: LUCENE-8007
> URL: https://issues.apache.org/jira/browse/LUCENE-8007
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Fix For: master (8.0)
>
> Attachments: LUCENE-8007.patch, LUCENE-8007.patch
>
>
> Javadocs allow codecs to not store some index statistics. Given discussion
> that occurred on LUCENE-4100, this was mostly implemented this way to support
> pre-flex codecs. We should now require that all codecs store these statistics.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]