[ https://issues.apache.org/jira/browse/LUCENE-8221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428670#comment-16428670 ]
Dawid Weiss commented on LUCENE-8221: ------------------------------------- I don't mind changing the formula (even if I disagree that catering for internal representation of deleted documents justifies this), but not as part of this issue. Changing the formula will change the results people get from MLT: this should go into a major release, not a point release; what I patched was a trivial overflow problem that doesn't touch any internals. bq. And so is the range check in your patch, because percentage can be larger than 100% with the broken numDocs formula used here. When a percentage can be bigger than 100, man that's your first sign that shit is wrong! To me the percentage remains within 0-100% with numDocs; you compute the threshold against the current state of your index (live documents). The computed value of the cutoff threshold is correct, it is the comparison against docFreq that isn't sound here because docFreq doesn't have deleted documents information. I don't quite understand the way you perceive only one of those as "correct" vs. "utter shit" and I don't think I want to explore this subject further. Is it ok if I apply the overflow fix against 7.x, master and create a new issue cutting over to maxDoc (everywhere in mlt) and apply it to master only? If no, speak up. > MoreLikeThis.setMaxDocFreqPct can easily int-overflow on larger indexes > ----------------------------------------------------------------------- > > Key: LUCENE-8221 > URL: https://issues.apache.org/jira/browse/LUCENE-8221 > Project: Lucene - Core > Issue Type: Bug > Reporter: Dawid Weiss > Assignee: Dawid Weiss > Priority: Minor > Attachments: LUCENE-8221.patch > > > {code} > public void setMaxDocFreqPct(int maxPercentage) { > this.maxDocFreq = maxPercentage * ir.numDocs() / 100; > } > {code} > The above overflows integer range into negative numbers on even fairly small > indexes (for maxPercentage = 75, it happens for just over 28 million > documents. > We should make the computations on long range so that it doesn't overflow and > have a more strict argument validation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org