Re: [DISCUSS] For the dimension default should be no dictionary

Jacky Li Tue, 28 Feb 2017 03:41:00 -0800

Yes, first we should simplify the DDL options. I propose following options, 
please check weather it miss some scenario.


1. SORT_COLUMNS, or SORT_KEY
This indicates three things:
1) All columns specified in options will be used to construct Multi-Dimensional 
Key, which will be sorted along this key
2) They will be encoded as Inverted Index and thus again sorted within column 
chunk in one blocklet
3) Minmax index will also be created for these columns

When to use: This option is designed for accelerating filter query, so put all 
filter columns into this option. The order of it can be:
1) From low cardinality to high cardinality, this will make most compression 
and fit for scenario that does not have frequent filter on high card column
2) Put high cardinality column first, then put others. This fits for frequent 
filter on card column

For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded as 
Inverted Index and with Minmax Index
Note that while C1,C2,C3 can be dimension but they also can be measure. So if 
user need to filter on measure column, it can be put in SORT_COLUMNS option.

If this option is not specified by user, carbon will pick MDK as it is now.

2. TABLE_DICTIONARY 
This is to specify the table level dictionary columns. Will create global 
dictionary for all columns in this option for every data load.

When to use: The option is designed for accelerating aggregate query, so put 
group by columns into this option

For example. TABLE_DICTIONARY=“C2,C3,C5”

If this option is not specified by user, means all columns encoding without 
global dictionary support. Normal shuffle on decoded value will apply when 
doing group by operation.

I think these two options should be the basic option for normal user, the goal 
of them is to satisfy the most scenario without deep tuning of the table
For advanced user who want to do deep tuning, we can debate to add more 
options. But we need to identify what scenario is not satisfied by using these 
two options first.

Regards,
Jacky

> 在 2017年2月27日，下午8:27，Ravindra Pesala <ravi.pes...@gmail.com> 写道：
> 
> Hi Bill,
> 
> I got your point, but the solution of making no-dictionary as default may
> not be perfect solution. Basically no-dictionary columns are only meant for
> high cardinality dimensions, so the usage may change from user to user or
> scenario to scenario .
> This is the basic issue of usability of DDL, please first focus on to
> simplify DDL usability.
> 
> For example we have 6 columns , we can mention DDL as below.
> case 1 :
> SORT_COLUMNS="C1,C2,C3"
> NON_SORT_COLUMNS="C4,C5,C6"
> In above case C1, C2 , C3 are sort columns and part of MDK key. And
> C4,C5,C6 are become non sort columns(measure/complex)
> 
> DICTIONARY_EXCLUDE= 'ALL'
> DICTIONARY_INCLUDE='C3'
> In the above case all sort columns((C1,C2,C3) are non-dictionary columns
> except C3, here C3 is dictionary column.
> 
> case 2:
> SORT_COLUMNS="ALL"
> NON_SORT_COLUMNS="C6"
> In this case all columns are sort columns except C6.
> 
> DICTIONARY_EXCLUDE= 'C2'
> DICTIONARY_INCLUDE='ALL'
> In the above case all sort columns(C1,C2,C3,C4,C5) are dictionary columns
> except C2, here C2 is no-dictionary column.
> 
> Above mentioned are just my idea of how to simplify DDL to handle all
> scenarios. We can have more discussion towards it to simplify the DDL.
> 
> Regards,
> Ravindra.
> 
> On 27 February 2017 at 12:38, bill.zhou <zgcsk...@163.com> wrote:
> 
>> Dear Vishal & Ravindra
>> 
>>  Thanks for you reply,  I think I didn't describe it clearly so that you
>> don't get full idea.
>> 1. dictionary is important feature in CarbonData, for every new customer we
>> will introduce this feature to him. So for new customer will know it
>> clearly, will set the dictionary column when create table.
>> 2. For all customer like bank customer, telecom customer and traffic
>> customer have a same scenario is: have more column but only set few column
>> as dictionary.
>>    like telecom customer, 300 column only set 5 column dictionary, other
>> dim don't set dictionary.
>>    like bank customer, 100 column only set about 5 column dictionary,
>> other
>> dim don't set dictionary.
>> *For currently customer actually user scenario, they only set the dim which
>> used for filter and group by related column as dictionary*
>> 3. mys suggestion is that: dim column default as no dictionary is only for
>> the dim which not put into the dictionary_include properties, not for all
>> dim column. If customer always used 5 columns add into dictionary_include
>> and others column no dictionary, this will not impact the query
>> performance.
>> 
>> So that I suggestion the dim column default set as no dictionary which not
>> added in to dictionary_include properties.
>> 
>> Regards
>> Bill
>> 
>> 
>> 
>> kumarvishal09 wrote
>>> Hi,
>>>    I completely agree with Ravindra's points, more number of no
>>> dictionary
>>> column will impact the IO reading+writing both as in case of no
>> dictionary
>>> data size will increase. Late decoding is one of main advantage, no
>>> dictionary column aggregation will be slower. Filter query will suffer as
>>> in case of dictionary column we are comparing on byte pack value, in case
>>> of no dictionary it will be on actual value.
>>> 
>>> -Regards
>>> Kumar Vishal
>>> 
>>> On Mon, Feb 27, 2017 at 12:34 AM, Ravindra Pesala &lt;
>> 
>>> ravi.pesala@
>> 
>>> &gt;
>>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I feel there are more disadvantages than advantages in this approach. In
>>>> your current scenario you want to set dictionary only for columns which
>>>> are
>>>> used as filters, but the usage of dictionary is not only limited for
>>>> filters, it can reduce the store size and improve the aggregation
>>>> queries.
>>>> I think you should set no_inverted_index false on non filtered columns
>> to
>>>> reduce the store size and improve the performance.
>>>> 
>>>> If we make no dictionary as default then user no need set them in DDL
>> but
>>>> user needs to set the dictionary columns. If user wants to set more
>>>> dictionary columns then the same problem what you mentioned arises again
>>>> so
>>>> it does not solve the problem. I feel we should give more flexibility in
>>>> our DDL to simplify these scenarios and we should have more discussion
>> on
>>>> it.
>>>> 
>>>> Pros & Cons of your suggestion.
>>>> Advantages :
>>>> 1. Decoding/Encoding of dictionary could be avoided.
>>>> 
>>>> Disadvantages :
>>>> 1. Store size will increase drastically.
>>>> 2. IO will increase so query performance will come down.
>>>> 3. Aggregation queries performance will suffer.
>>>> 
>>>> 
>>>> 
>>>> Regards,
>>>> Ravindra.
>>>> 
>>>> On 26 February 2017 at 20:04, bill.zhou &lt;
>> 
>>> zgcsky08@
>> 
>>> &gt; wrote:
>>>> 
>>>>> hi All
>>>>>    Now when create the CarbonData table,if  the dimension don't add
>>>> into
>>>>> the dictionary_exclude properties, the dimension will be consider as
>>>>> dictionary default. I think default should be no dictionary.
>>>>> 
>>>>>    For example when I do the POC for one customer, it has 300 columns
>>>> and
>>>>> 200 dimensions, but only 5 columns is used for filter, so he only need
>>>> set
>>>>> this 5 columns to dictionary and leave other 195 columns to no
>>>> dictionary.
>>>>> But now he need specify for the 195 columns to dictionary_exclude
>>>>> properties
>>>>> the will waste time and make the create table command huge, also will
>>>>> impact
>>>>> the load performance.
>>>>> 
>>>>>    So I suggestion dimension default should be no dictionary and this
>>>> can
>>>>> also help customer easy to know the dictionary column which is useful.
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> View this message in context: http://apache-carbondata-
>>>>> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
>>>>> dimension-default-should-be-no-dictionary-tp8010.html
>>>>> Sent from the Apache CarbonData Mailing List archive mailing list
>>>> archive
>>>>> at Nabble.com.
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Thanks & Regards,
>>>> Ravi
>>>> 
>> 
>> 
>> 
>> 
>> 
>> --
>> View this message in context: http://apache-carbondata-
>> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
>> dimension-default-should-be-no-dictionary-tp8010p8027.html
>> Sent from the Apache CarbonData Mailing List archive mailing list archive
>> at Nabble.com.
>> 
> 
> 
> -- 
> Thanks & Regards,
> Ravi

Re: [DISCUSS] For the dimension default should be no dictionary

Reply via email to