Re: [DISCUSS] For the dimension default should be no dictionary

bill.zhou Thu, 02 Mar 2017 08:23:36 -0800

hi All
 I summary this discussion.
1. to make carbonData compatibility for older vesion, keep
DICTIONARY_INCLUDE and DICTIONARY_EXCLUDE, default is no dictionary. do not
suggestion change this properties to table_dictionary. 
2. Suggestion keep the sort_column properties as the same style for
dictionary. so this new properties suggestion use SORT_INCLUDE and
SORT_EXCLUDE, default is no sort.


Regards
Bill 


ravipesala wrote
> Hi All,
> 
> In order to make no-dictionary columns as default we should improve the
> storage and performance for these columns. I have sent another mail to
> discuss the improvement points. Please comment on it.
> 
> Regards,
> Ravindra
> 
> On 1 March 2017 at 10:12, Ravindra Pesala &lt;

> ravi.pesala@

> &gt; wrote:
> 
>> Hi Likun,
>>
>> It would be same case if we use all non dictionary columns by default, it
>> will increase the store size and decrease the performance so it is also
>> does not encourage more users if performance is poor.
>>
>> If we need to make no-dictionary columns as default then we should first
>> focus on reducing the store size and improve the filter queries on
>> non-dictionary columns.Even memory usage is higher while querying the
>> non-dictionary columns.
>>
>> Regards,
>> Ravindra.
>>
>> On 1 March 2017 at 06:00, Jacky Li &lt;

> jacky.likun@

> &gt; wrote:
>>
>>> Yes, I agree to your point. The only concern I have is for loading, I
>>> have seen many users accidentally put high cardinality column into
>>> dictionary column then the loading failed because out of memory or
>>> loading
>>> very slow. I guess they just do not know to use DICTIONARY_EXCLUDE for
>>> these columns, or they do not have a easy way to identify the high card
>>> columns. I feel preventing such misusage is important in order to
>>> encourage
>>> more users to use carbondata.
>>>
>>> Any suggestion on solving this issue?
>>>
>>>
>>> Regards,
>>> Likun
>>>
>>>
>>> > 在 2017年2月28日，下午10:20，Ravindra Pesala &lt;

> ravi.pesala@

> &gt; 写道：
>>> >
>>> > Hi Likun,
>>> >
>>> > You mentioned that if user does not specify dictionary columns then by
>>> > default those are chosen as no dictionary columns.
>>> > But we have many disadvantages as I mentioned in above mail if you
>>> keep
>>> no
>>> > dictionary as default. We have initially introduced no dictionary
>>> columns
>>> > to handle high cardinality dimensions, but now making every thing as
>>> no
>>> > dictionary columns by default looses our unique feature compare to
>>> parquet.
>>> > Dictionary columns are introduced not only for aggregation queries, it
>>> is
>>> > for better compression and better filter queries as well. With out
>>> > dictionary store size will be increased a lot.
>>> >
>>> > Regards,
>>> > Ravindra.
>>> >
>>> > On 28 February 2017 at 18:05, Liang Chen &lt;

> chenliang6136@

> &gt;
>>> wrote:
>>> >
>>> >> Hi
>>> >>
>>> >> A couple of questions:
>>> >>
>>> >> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax
>>> >> index" for these columns which be specified into the option(SORT_KEY)
>>> ?
>>> >>
>>> >> 2) If users don't specify TABLE_DICTIONARY,  then all columns don't
>>> make
>>> >> dictionary encoding, and all shuffle operations are based on fact
>>> value, is
>>> >> my understanding right ?
>>> >> ------------------------------------------------------------
>>> >> -------------------------------------------
>>> >> If this option is not specified by user, means all columns encoding
>>> without
>>> >> global dictionary support. Normal shuffle on decoded value will be
>>> applied
>>> >> when doing group by operation.
>>> >>
>>> >> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY",
>>> >> supposed  if "C2" be specified into SORT_KEY, but not be specified
>>> into
>>> >> TABLE_DICTIONARY, then system how to handle this case ?
>>> >> ------------------------------------------------------------
>>> >> -----------------------------------------------
>>> >> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and
>>> encoded as
>>> >> Inverted Index and with Minmax Index
>>> >>
>>> >> Regards
>>> >> Liang
>>> >>
>>> >> 2017-02-28 19:35 GMT+08:00 Jacky Li &lt;

> jacky.likun@

> &gt;:
>>> >>
>>> >>> Yes, first we should simplify the DDL options. I propose following
>>> >> options,
>>> >>> please check weather it miss some scenario.
>>> >>>
>>> >>> 1. SORT_COLUMNS, or SORT_KEY
>>> >>> This indicates three things:
>>> >>> 1) All columns specified in options will be used to construct
>>> >>> Multi-Dimensional Key, which will be sorted along this key
>>> >>> 2) They will be encoded as Inverted Index and thus again sorted
>>> within
>>> >>> column chunk in one blocklet
>>> >>> 3) Minmax index will also be created for these columns
>>> >>>
>>> >>> When to use: This option is designed for accelerating filter query,
>>> so
>>> >> put
>>> >>> all filter columns into this option. The order of it can be:
>>> >>> 1) From low cardinality to high cardinality, this will make most
>>> >>> compression
>>> >>> and fit for scenario that does not have frequent filter on high card
>>> >> column
>>> >>> 2) Put high cardinality column first, then put others. This fits for
>>> >>> frequent filter on high card column
>>> >>>
>>> >>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and
>>> encoded
>>> >> as
>>> >>> Inverted Index and with Minmax Index
>>> >>> Note that while C1,C2,C3 can be dimension but they also can be
>>> measure.
>>> >> So
>>> >>> if user need to filter on measure column, it can be put in
>>> SORT_COLUMNS
>>> >>> option.
>>> >>>
>>> >>> If this option is not specified by user, carbon will pick MDK as it
>>> is
>>> >> now.
>>> >>>
>>> >>> 2. TABLE_DICTIONARY
>>> >>> This is to specify the table level dictionary columns. Will create
>>> global
>>> >>> dictionary for all columns in this option for every data load.
>>> >>>
>>> >>> When to use: The option is designed for accelerating aggregate
>>> query,
>>> so
>>> >>> put
>>> >>> group by columns into this option
>>> >>>
>>> >>> For example. TABLE_DICTIONARY=“C2,C3,C5”
>>> >>>
>>> >>> If this option is not specified by user, means all columns encoding
>>> >> without
>>> >>> global dictionary support. Normal shuffle on decoded value will be
>>> >> applied
>>> >>> when doing group by operation.
>>> >>>
>>> >>> I think these two options should be the basic option for normal
>>> user,
>>> the
>>> >>> goal of them is to satisfy the most scenario without deep tuning of
>>> the
>>> >>> table
>>> >>> For advanced user who want to do deep tuning, we can debate to add
>>> more
>>> >>> options. But we need to identify what scenario is not satisfied by
>>> using
>>> >>> these two options first.
>>> >>>
>>> >>> Regards,
>>> >>> Jacky
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> View this message in context: http://apache-carbondata-
>>> >>> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
>>> >>> dimension-default-should-be-no-dictionary-tp8010p8081.html
>>> >>> Sent from the Apache CarbonData Mailing List archive mailing list
>>> archive
>>> >>> at Nabble.com.
>>> >>>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Regards
>>> >> Liang
>>> >>
>>> >
>>> >
>>> > --
>>> > Thanks & Regards,
>>> > Ravi
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Ravi
>>
> 
> 
> 
> -- 
> Thanks & Regards,
> Ravi





--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-dimension-default-should-be-no-dictionary-tp8010p8198.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.

Re: [DISCUSS] For the dimension default should be no dictionary

Reply via email to