[ 
https://issues.apache.org/jira/browse/LUCENE-8221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428670#comment-16428670
 ] 

Dawid Weiss commented on LUCENE-8221:
-------------------------------------

I don't mind changing the formula (even if I disagree that catering for 
internal representation of deleted documents justifies this), but not as part 
of this issue. Changing the formula will change the results people get from 
MLT: this should go into a major release, not a point release; what I patched 
was a trivial overflow problem that doesn't touch any internals.

bq. And so is the range check in your patch, because percentage can be larger 
than 100% with the broken numDocs formula used here. When a percentage can be 
bigger than 100, man that's your first sign that shit is wrong!

To me the percentage remains within 0-100% with numDocs; you compute the 
threshold against the current state of your index (live documents). The 
computed value of the cutoff threshold is correct, it is the comparison against 
docFreq that isn't sound here because docFreq doesn't have deleted documents 
information. I don't quite understand the way you perceive only one of those as 
"correct" vs. "utter shit" and I don't think I want to explore this subject 
further.

Is it ok if I apply the overflow fix against 7.x, master and create a new issue 
cutting over to maxDoc (everywhere in mlt) and apply it to master only? If no, 
speak up.

> MoreLikeThis.setMaxDocFreqPct can easily int-overflow on larger indexes
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-8221
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8221
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>            Priority: Minor
>         Attachments: LUCENE-8221.patch
>
>
> {code}
>   public void setMaxDocFreqPct(int maxPercentage) {
>     this.maxDocFreq = maxPercentage * ir.numDocs() / 100;
>   }
> {code}
> The above overflows integer range into negative numbers on even fairly small 
> indexes (for maxPercentage = 75, it happens for just over 28 million 
> documents.
> We should make the computations on long range so that it doesn't overflow and 
> have a more strict argument validation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to