[ 
https://issues.apache.org/jira/browse/ORC-210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219679#comment-16219679
 ] 

Owen O'Malley commented on ORC-210:
-----------------------------------

Ok, here are some large public data sets from UCI's machine learning data set 
archive to consider:

* HIGGS - https://archive.ics.uci.edu/ml/datasets/HIGGS
* HEPMASS - https://archive.ics.uci.edu/ml/datasets/HEPMASS

HIGGS has 11million rows with 29 columns. There is still a lot of repetition, 
but if you look at the first 16k rows for column 5 you get 99% distinct. 

> Add new ORC 2.0 encoding for Double, Float.
> -------------------------------------------
>
>                 Key: ORC-210
>                 URL: https://issues.apache.org/jira/browse/ORC-210
>             Project: ORC
>          Issue Type: Improvement
>          Components: encoding, Java
>    Affects Versions: 2.0.0
>            Reporter: Dapeng Sun
>            Assignee: Teddy Choi
>         Attachments: ORC-210.1.patch, ORC-210.2.patch, patch.txt
>
>
> Currently, Double and Float are using PLAIN encoding, it is better to support 
> encoding such as Dictionary or BitPacking to reduce the storage cost.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to