[
https://issues.apache.org/jira/browse/ORC-210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218025#comment-16218025
]
Owen O'Malley commented on ORC-210:
-----------------------------------
Ok, there are a couple points that are clear:
* We must have better datasets.
* TPC-DS is decimal and not floating point data. Additionally, it is
synthetic instead of real.
* We need datasets of ~10 million values.
* I'd propose the nyc-taxi drop off long & lat data.
* We need some non-repetitive datasets (cardinality / count > 99%)
* The interesting metrics are:
* Write speed
* Read speed
* Compression
* The multiply by 100 trick on your modified fpc method is too tied to the
particular datasets and isn't generally useful.
* We need some high cardinality data sets because the current ones would be
best done using a dictionary.
> Add new ORC 2.0 encoding for Double, Float.
> -------------------------------------------
>
> Key: ORC-210
> URL: https://issues.apache.org/jira/browse/ORC-210
> Project: ORC
> Issue Type: Improvement
> Components: encoding, Java
> Affects Versions: 2.0.0
> Reporter: Dapeng Sun
> Assignee: Teddy Choi
> Attachments: ORC-210.1.patch, ORC-210.2.patch, patch.txt
>
>
> Currently, Double and Float are using PLAIN encoding, it is better to support
> encoding such as Dictionary or BitPacking to reduce the storage cost.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)