[jira] [Commented] (LUCENE-8221) MoreLikeThis.setMaxDocFreqPct can easily int-overflow on larger indexes

Robert Muir (JIRA) Fri, 06 Apr 2018 05:03:15 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-8221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428239#comment-16428239
 ]


Robert Muir commented on LUCENE-8221:
-------------------------------------

It is easy to document, and see the skew, because we explain it elsewhere in 
scoring

[https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/CollectionStatistics.java#L43]
[https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/TermStatistics.java#L42]
[https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L139-L142]

{quote}
What further complicates things here is that the term score similarity function 
in createQueue also uses numDocs as one of the arguments, so changing to maxDoc 
all across the board would definitely change the output for people – don't know 
whether it'd be better or worse.
{quote}

That is broken too, then.

And so is the range check in your patch, because percentage can be larger than 
100% with the broken numDocs formula used here. When a percentage can be bigger 
than 100, man that's your first sign that shit is wrong!

> MoreLikeThis.setMaxDocFreqPct can easily int-overflow on larger indexes
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-8221
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8221
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>            Priority: Minor
>         Attachments: LUCENE-8221.patch
>
>
> {code}
>   public void setMaxDocFreqPct(int maxPercentage) {
>     this.maxDocFreq = maxPercentage * ir.numDocs() / 100;
>   }
> {code}
> The above overflows integer range into negative numbers on even fairly small 
> indexes (for maxPercentage = 75, it happens for just over 28 million 
> documents.
> We should make the computations on long range so that it doesn't overflow and 
> have a more strict argument validation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8221) MoreLikeThis.setMaxDocFreqPct can easily int-overflow on larger indexes

Reply via email to