[ 
https://issues.apache.org/jira/browse/ORC-210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148044#comment-16148044
 ] 

Owen O'Malley edited comment on ORC-210 at 8/31/17 4:52 PM:
------------------------------------------------------------

One of the points that should have been obvious to me, but I wasn't thinking 
about it right. As the number of samples goes up, you asymptotically approach 
the size of population. On the taxi pick up location, you end up with the 
following curve:

|| samples || uniques || ratio ||
| 1,000 | 924 | 1.1 |
| 10,000 | 5,885 | 1.7 |
| 100,000 | 12,572 | 8.0 |
| 1,000,000 | 20,939 | 47.8 |
| 10,000,000 | 34,154 | 293 |
| 100,000,000 | 61,267 | 1,632 |

With that in view, the iot data is a lot more dictionary friendly than the taxi 
data, which makes more sense. (If you take 20155 samples of the taxi data, you 
get 8275 uniques compared to 146 for the iot data.)


was (Author: owen.omalley):
One of the points that should have been obvious to me, but I wasn't thinking 
about it right. As the number of samples goes up, you asymptotically approach 
the size of population. On the taxi pick up location, you end up with the 
following curve:

|| samples || uniques || ratio ||
| 1,000 | 924 | 1.1 |
| 10,000 | 5,885 | 1.7 |
| 100,000 | 12,572 | 8.0 |
| 1,000,000 | 20,939 | 47.8 |
| 10,000,000 | 34,154 | 293 |

With that in view, the iot data is a lot more dictionary friendly than the taxi 
data, which makes more sense. (If you take 20155 samples of the taxi data, you 
get 8275 uniques compared to 146 for the iot data.)

> Add encoding for Double, Float.
> -------------------------------
>
>                 Key: ORC-210
>                 URL: https://issues.apache.org/jira/browse/ORC-210
>             Project: ORC
>          Issue Type: Improvement
>          Components: encoding, Java
>    Affects Versions: 1.5.0
>            Reporter: Dapeng Sun
>            Assignee: Teddy Choi
>         Attachments: ORC-210.1.patch, ORC-210.2.patch, patch.txt
>
>
> Currently, Double and Float are using PLAIN encoding, it is better to support 
> encoding such as Dictionary or BitPacking to reduce the storage cost.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to