[ https://issues.apache.org/jira/browse/ORC-210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148044#comment-16148044 ]
Owen O'Malley edited comment on ORC-210 at 8/31/17 4:52 PM: ------------------------------------------------------------ One of the points that should have been obvious to me, but I wasn't thinking about it right. As the number of samples goes up, you asymptotically approach the size of population. On the taxi pick up location, you end up with the following curve: || samples || uniques || ratio || | 1,000 | 924 | 1.1 | | 10,000 | 5,885 | 1.7 | | 100,000 | 12,572 | 8.0 | | 1,000,000 | 20,939 | 47.8 | | 10,000,000 | 34,154 | 293 | | 100,000,000 | 61,267 | 1,632 | With that in view, the iot data is a lot more dictionary friendly than the taxi data, which makes more sense. (If you take 20155 samples of the taxi data, you get 8275 uniques compared to 146 for the iot data.) was (Author: owen.omalley): One of the points that should have been obvious to me, but I wasn't thinking about it right. As the number of samples goes up, you asymptotically approach the size of population. On the taxi pick up location, you end up with the following curve: || samples || uniques || ratio || | 1,000 | 924 | 1.1 | | 10,000 | 5,885 | 1.7 | | 100,000 | 12,572 | 8.0 | | 1,000,000 | 20,939 | 47.8 | | 10,000,000 | 34,154 | 293 | With that in view, the iot data is a lot more dictionary friendly than the taxi data, which makes more sense. (If you take 20155 samples of the taxi data, you get 8275 uniques compared to 146 for the iot data.) > Add encoding for Double, Float. > ------------------------------- > > Key: ORC-210 > URL: https://issues.apache.org/jira/browse/ORC-210 > Project: ORC > Issue Type: Improvement > Components: encoding, Java > Affects Versions: 1.5.0 > Reporter: Dapeng Sun > Assignee: Teddy Choi > Attachments: ORC-210.1.patch, ORC-210.2.patch, patch.txt > > > Currently, Double and Float are using PLAIN encoding, it is better to support > encoding such as Dictionary or BitPacking to reduce the storage cost. -- This message was sent by Atlassian JIRA (v6.4.14#64029)