[ 
https://issues.apache.org/jira/browse/LUCENE-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225972#comment-16225972
 ] 

Robert Muir commented on LUCENE-8007:
-------------------------------------

Based on this issue, SimilarityBase can be really improved and simplified with 
pseudocode like this:

{code}
long totalTermFreq = termStats.totalTermFreq();
if (totalTermFreq == -1) {
    // term frequencies were omitted, so we assume all tf values for the field 
were 1
    assert collectionStats.sumTotalTermFreq() == -1;
    totalTermFreq = termStats.docFreq(); // appears 1 time per docFreq
    sumTotalTermFreq = collectionStats.sumDocFreq(); // number of postings 
where each freq is 1
}
{code}

I think this is better than bogusly setting sumTotalTermFreq to docFreq and 
avgdl to 1 like we do today? It should behave much better for the omitTF case. 

Note that BM25 has the same problem for its avgdl computation too, which should 
also be fixed.

Just makes me wonder if we should reconsider returning -1 for term's 
totalTermFreq and field's sumTotalTermFreq, when we can alternatively 
substitute with docFreq and sumDocFreq and it will be the same values as if we 
actually tracked freqs of 1 and tracked these stats across them for the field? 
And postings lists return 1 as the freq for such cases today so it seems 
consistent and may simplify code like CheckIndex etc as well.

Returning -1 doesn't really provide value, i think its just the codec api 
showing too much of its guts. if you really want to know if freqs were omitted 
for the field (versus all being 1), you can inspect the IndexOptions for that.


> Require that codecs always store totalTermFreq, sumDocFreq and 
> sumTotalTermFreq
> -------------------------------------------------------------------------------
>
>                 Key: LUCENE-8007
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8007
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>             Fix For: master (8.0)
>
>         Attachments: LUCENE-8007.patch, LUCENE-8007.patch
>
>
> Javadocs allow codecs to not store some index statistics. Given discussion 
> that occurred on LUCENE-4100, this was mostly implemented this way to support 
> pre-flex codecs. We should now require that all codecs store these statistics.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to