Discussion how to crate the CarbonData table with good performance
Suggestion to create Carbon table
Recently we used CarbonData to do the performance in Telecommunication filed
and summarize some of the Suggestions while creating the CarbonData table.We
have tables which range from 10 thousand rows to 10 billion rows and have
from 100 columns to 300 columns. Following are some of the columns used in
the table.
Column name     Data type       Cardinality     Attribution
msisdn  String  30 million      dimension
BEGIN_TIME      bigint  10 thousand     dimension
HOST    String  1 million       dimension
Dime_1  String  1 thousand      dimension
counter_1       numeric(20,0)   NA      measure
...     ...     NA      ...
counter_100     numeric(20,0)   NA      measure
We have about more than 50 test cases; according to the test case we
summarize some suggestion to create the table which can have a better query
performance. 1. Put the frequently-used column filter in the beginning. For
example, MSISDN filter is used in most of the query then put the MSISDN in
the first column. The create table command can be as follows, the query
which has MSISDN as a filter will be good (because the MSISDN is high
cardinality, if create table like this the compress ratio will be
decreased)/create table carbondata_table(//msisdn String,//...//)STORED BY
'org.apache.carbondata.format' //TBLPROPERTIES (
multiple column which is frequently-use in the filter, put it to the front
in the order as low cardinality to high cardinality.For example if msisdn,
host and dime_1 is frequently-used column, the table column order can be
like dime_1->host->msisdn, because the dime_1 cardinality is low. Create
table command can be as follows. This will increase the compression ratio
and good performance for filter on dime_1, host and msisdn. /create table
carbondata_table(Dime_1 String,HOST String,MSISDN String,...)STORED BY
'org.apache.carbondata.format' /TBLPROPERTIES (
no column is frequent-use in filter, then can put all the dimension column
order as from low cardinality to high cardinality. Create table command can
be as following: /create table carbondata_table(Dime_1 String,BEGIN_TIME
bigintHOST String,MSISDN String,...)STORED BY 'org.apache.carbondata.format'
For measure that needs no high accuracy, then no need to use numeric(20,0)
data type, suggestion is to use double to replace it than will increase the
query performance. If one test case uses double to replace the numeric (20,
0) the query improve 5 times from 15 second to 3 second. Create table
command can be as follows. /create table carbondata_table(Dime_1
String,BEGIN_TIME bigintHOST String,MSISDN String,counter_1 double,counter_2
double,...counter_100 double,)STORED BY 'org.apache.carbondata.format'
If the column which is always incremental like start_time. For example one
scenario: every day we will load data into carbon and the start_time is
incremental for each load. For this scenario you can put the start_time
column in the back of dimension, because always incremental value can use
the min/max index well always. Create table command can be as following.
/create table carbondata_table(Dime_1 String,HOST String,MSISDN
String,counter_1 double,counter_2 double,BEGIN_TIME bigint,...counter_100
double,)STORED BY 'org.apache.carbondata.format' TBLPROPERTIES (
more is for the dimension whether dictionary is needed or not, we suggest if
the cardinality higher than 50 thousand do not put it as dictionary column.
If high cardinality column put as dictionary will impact the load

View this message in context: 
Sent from the Apache CarbonData Mailing List archive mailing list archive at 

Reply via email to