[ 
https://issues.apache.org/jira/browse/LUCENE-7589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15741431#comment-15741431
 ] 

Adrien Grand commented on LUCENE-7589:
--------------------------------------

Thanks Mike for having a look. The patch does not show much reduction in spite 
of the quality issues of the dataset since existing fields tend to not have 
outliers. However if you add a new field that stores the average number of 
miles per hour as a long doc values field, then it highlights the quality 
issues of this dataset and disk usage for this field goes from 40 to 15.7 bits 
per value (-60%) with the patch.

> Prevent outliers from raising the number of bits of everyone with numeric doc 
> values
> ------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7589
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7589
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-7589.patch
>
>
> Today we encode entire segments with a single number of bits per value. It 
> was done this way because it was faster, but it also means a single outlier 
> can significantly increase the space requirements. I think we should have 
> protection against that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to