[ 
https://issues.apache.org/jira/browse/CARBONDATA-431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated CARBONDATA-431:
---------------------------------
    Description: 
Carbon has better compression ratio for String type, but worst for numeric data 
type, identify issues with current numeric datatype compression for carbon to 
get better compression ratio.

DataType            Text        Parquet   Orc   Carbon
decimal   16G  |        11G      |       6G        |    13G
int               5G       |     1G          |    1G       |    3G
String    24G  |        22G          |    11G   |        3G   (no dictionary)   
    -------    high cardinality
String  30G    |        4G           |    4G       |    1G  -- Dictionary 
encode            1G  -- Dictionary encode without inverted index            3G 
 -- No dictionary encode              -----------low cardinality


  was:
For the data type, carbon's string type has better compression ratio, but for 
numeric type, orc has the best compression. we should analysis numeric datatype 
for carbon to get better compression ratio

DataType            Text        Parquet   Orc   Carbon
decimal   16G  |        11G      |       6G        |    13G
int               5G       |     1G          |    1G       |    3G
String    24G  |        22G          |    11G   |        3G   (no dictionary)   
    -------    high cardinality
String  30G    |        4G           |    4G       |    1G  -- Dictionary 
encode            1G  -- Dictionary encode without inverted index            3G 
 -- No dictionary encode              -----------low cardinality


        Summary: Improve compression ratio for numeric datatype   (was: 
Analysis compression for numeric datatype compared with Parquet/ORC)

> Improve compression ratio for numeric datatype 
> -----------------------------------------------
>
>                 Key: CARBONDATA-431
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-431
>             Project: CarbonData
>          Issue Type: Sub-task
>            Reporter: suo tong
>            Assignee: Ashok Kumar
>             Fix For: 1.0.0-incubating
>
>          Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Carbon has better compression ratio for String type, but worst for numeric 
> data type, identify issues with current numeric datatype compression for 
> carbon to get better compression ratio.
> DataType          Text        Parquet   Orc   Carbon
> decimal         16G  |        11G      |       6G        |    13G
> int             5G       |     1G          |    1G       |    3G
> String          24G  |        22G          |    11G   |        3G   (no 
> dictionary)       -------    high cardinality
> String        30G    |        4G           |    4G       |    1G  -- 
> Dictionary encode            1G  -- Dictionary encode without inverted index  
>           3G  -- No dictionary encode              -----------low cardinality



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to