Dawid Weiss commented on LUCENE-8221:

Using numDocs percentage to set maxDocFreq effectively lowers it, providing a 
stricter estimate and possibly discarding terms that were frequent originally, 
but could be eligible to pass the docFreq > maxDocFreq threshold after 
deletions are applied (so they're no longer frequent). Using maxDoc on segments 
with heavy deletions potentially passes through a lot of terms that could be 
discarded earlier without being inserted into the queue at all. 

What further complicates things here is that the term score similarity function 
in createQueue also uses numDocs as one of the arguments, so changing to maxDoc 
all across the board would definitely change the output for people -- don't 
know whether it'd be better or worse.

> MoreLikeThis.setMaxDocFreqPct can easily int-overflow on larger indexes
> -----------------------------------------------------------------------
>                 Key: LUCENE-8221
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8221
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>            Priority: Minor
>         Attachments: LUCENE-8221.patch
> {code}
>   public void setMaxDocFreqPct(int maxPercentage) {
>     this.maxDocFreq = maxPercentage * ir.numDocs() / 100;
>   }
> {code}
> The above overflows integer range into negative numbers on even fairly small 
> indexes (for maxPercentage = 75, it happens for just over 28 million 
> documents.
> We should make the computations on long range so that it doesn't overflow and 
> have a more strict argument validation.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to