[ https://issues.apache.org/jira/browse/LUCENE-7589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15741431#comment-15741431 ]
Adrien Grand commented on LUCENE-7589: -------------------------------------- Thanks Mike for having a look. The patch does not show much reduction in spite of the quality issues of the dataset since existing fields tend to not have outliers. However if you add a new field that stores the average number of miles per hour as a long doc values field, then it highlights the quality issues of this dataset and disk usage for this field goes from 40 to 15.7 bits per value (-60%) with the patch. > Prevent outliers from raising the number of bits of everyone with numeric doc > values > ------------------------------------------------------------------------------------ > > Key: LUCENE-7589 > URL: https://issues.apache.org/jira/browse/LUCENE-7589 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Assignee: Adrien Grand > Priority: Minor > Attachments: LUCENE-7589.patch > > > Today we encode entire segments with a single number of bits per value. It > was done this way because it was faster, but it also means a single outlier > can significantly increase the space requirements. I think we should have > protection against that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org