Re: [DISCUSS] For the dimension default should be no dictionary

Jacky Li Tue, 28 Feb 2017 16:19:13 -0800

> 在 2017年2月28日，下午8:35，Liang Chen <chenliang6...@gmail.com> 写道：
> 
> Hi
> 
> A couple of questions:
> 
> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax
> index" for these columns which be specified into the option(SORT_KEY)  ?
> 
Yes, build MDK index, inverted index, minimax index for columns in SORT_KEY


> 2) If users don't specify TABLE_DICTIONARY,  then all columns don't make
> dictionary encoding, and all shuffle operations are based on fact value, is
> my understanding right ?
> -------------------------------------------------------------------------------------------------------
> If this option is not specified by user, means all columns encoding without
> global dictionary support. Normal shuffle on decoded value will be applied
> when doing group by operation.
> 
Yes

> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY",
> supposed  if "C2" be specified into SORT_KEY, but not be specified into
> TABLE_DICTIONARY, then system how to handle this case ?
> -----------------------------------------------------------------------------------------------------------
> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded as
> Inverted Index and with Minmax Index
> 
Sort it using original value

> Regards
> Liang
> 
> 2017-02-28 19:35 GMT+08:00 Jacky Li <jacky.li...@qq.com>:
> 
>> Yes, first we should simplify the DDL options. I propose following options,
>> please check weather it miss some scenario.
>> 
>> 1. SORT_COLUMNS, or SORT_KEY
>> This indicates three things:
>> 1) All columns specified in options will be used to construct
>> Multi-Dimensional Key, which will be sorted along this key
>> 2) They will be encoded as Inverted Index and thus again sorted within
>> column chunk in one blocklet
>> 3) Minmax index will also be created for these columns
>> 
>> When to use: This option is designed for accelerating filter query, so put
>> all filter columns into this option. The order of it can be:
>> 1) From low cardinality to high cardinality, this will make most
>> compression
>> and fit for scenario that does not have frequent filter on high card column
>> 2) Put high cardinality column first, then put others. This fits for
>> frequent filter on high card column
>> 
>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded as
>> Inverted Index and with Minmax Index
>> Note that while C1,C2,C3 can be dimension but they also can be measure. So
>> if user need to filter on measure column, it can be put in SORT_COLUMNS
>> option.
>> 
>> If this option is not specified by user, carbon will pick MDK as it is now.
>> 
>> 2. TABLE_DICTIONARY
>> This is to specify the table level dictionary columns. Will create global
>> dictionary for all columns in this option for every data load.
>> 
>> When to use: The option is designed for accelerating aggregate query, so
>> put
>> group by columns into this option
>> 
>> For example. TABLE_DICTIONARY=“C2,C3,C5”
>> 
>> If this option is not specified by user, means all columns encoding without
>> global dictionary support. Normal shuffle on decoded value will be applied
>> when doing group by operation.
>> 
>> I think these two options should be the basic option for normal user, the
>> goal of them is to satisfy the most scenario without deep tuning of the
>> table
>> For advanced user who want to do deep tuning, we can debate to add more
>> options. But we need to identify what scenario is not satisfied by using
>> these two options first.
>> 
>> Regards,
>> Jacky
>> 
>> 
>> 
>> --
>> View this message in context: http://apache-carbondata-
>> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
>> dimension-default-should-be-no-dictionary-tp8010p8081.html
>> Sent from the Apache CarbonData Mailing List archive mailing list archive
>> at Nabble.com.
>> 
> 
> 
> -- 
> Regards
> Liang

Re: [DISCUSS] For the dimension default should be no dictionary

Reply via email to