Re: carbon data performance doubts

Liang Chen Fri, 21 Jul 2017 18:28:43 -0700

Hi

Some more info :
In release 1.1.1, there was a good improvement "measure filter
optimization",  system will use minmax index to do filter for measure
column filter.


So for INT

Regards
Liang

2017-07-22 9:22 GMT+08:00 Liang Chen <chenliang...@apache.org>:

> Hi Swapnil
>
> Actually,  current system's behavior is  : Index and dictionary encoding
> are decoupled, no relationship.
>
> 1. If you want to make some columns have good filter , just add these
> columns to sort_columns (like tblproperties('sort_columns'='empno')), to
> build good MDX index for these columns,  just add INT column to
> sort_columns list for filter.
>
> 2. If you want to make some columns have good aggregation for group by,
>  just dictionary encodes these columns. By default INT column doesn't do
> dictionary encode, so don't need to add "DICTIONARY_EXCLUDE",  if the INT
> column is low cardinality and you also want to have good aggregation on the
> INT column, use "DICTIONARY_INCLUDE = the INT column".
>
> So , in a word :  INT column with high cardinality doesn't have
> DICTIONARY_EXCLUDE scenario :)
>
> HTH.
>
> Regards
> Liang
>
>
> 2017-07-22 6:09 GMT+08:00 Swapnil Shinde <swapnilushi...@gmail.com>:
>
>> Thank you Jacky! Above encoding property makes sense. How would you handle
>> an INT column with high cardinality? as per my understanding, this column
>> will be considered as measure and only way to make it dimension is to
>> specify "dictionary_include" for that column.
>> Any reason why a column being a dimension or measure is tied with
>> dictionary encoding? Does it make sense to have column as dimension with
>> no
>> encoding so indexes can be used for filter?
>>
>> Thanks
>> Swapnil
>>
>>
>> On Fri, Jul 21, 2017 at 12:30 PM, Jacky Li <jacky.li...@qq.com> wrote:
>>
>> > Hi Swapnil,
>> >
>> > Dictionary is beneficial for aggregation query (carbon will leverage
>> late
>> > decode optimization in sql optimizer), so you can use it for columns on
>> > which you frequently do group by. While it can improve query
>> performance,
>> > but it also requires more memory and CPU while loading. Normally, you
>> > should consider to use dictionary only on low cardinality columns.
>> >
>> > In current apache master branch (and all history release before 1.2),
>> > carbon data’s default encoding strategy favor query performance over
>> > loading performance. By default,  all string data type by default is
>> > encoded as dictionary. But it creates some problems sometimes, for
>> example,
>> > if there are high cardinality column in the table, loading may fail due
>> to
>> > not enough memory in JVM. To avoid this, we have added
>> DICTIONARY_EXCLUDE
>> > option so that user can disable this default behavior manually. So,
>> > DICTIONARY_EXCLUDE property is designed for String column only.
>> >
>> > And, if you have low cardinality integer column ( like some ID field),
>> you
>> > can choose to encode it as dictionary by specifying DICTIONARY_INCLUDE,
>> so
>> > group by on this integer column will be faster.
>> >
>> > All these are current behavior, and there was discussion to change this
>> > behavior and give more control to the user, in the coming release (1.2)
>> > The new proposed target behavior will be:
>> > 1. There will be a default encoding strategy for each data type. If user
>> > does not specify any encoding related property in CREATE TABLE, carbon
>> will
>> > use the default encoding strategy for each column.
>> > 2. And there will be a ENCODING property through which user can override
>> > the system default strategy. For example, user can create table by:
>> >
>> > CREATE TABLE t1 (city_name STRING, city_id INT, population INT, area
>> > DOUBLE)
>> > TBLPROPERTIES (‘ENCODING’ = ‘city_name: dictionary, city_id:
>> {dictionary,
>> > RLE}, population: delta’)
>> >
>> > This SQL means city_name is encoded using dictionary, city_id is encoded
>> > using dictionary then apply RLE encoding (for numeric value),
>> population is
>> > encoded using delta encoding, and area is encoded using system default
>> > encoding for double data type.
>> >
>> > This change is still going on (CARBONDATA-1014,
>> https://issues.apache.org/
>> > jira/browse/CARBONDATA-1014 <https://issues.apache.org/
>> > jira/browse/CARBONDATA-1014>), on apache/encoding_override branch. Once
>> > it is done and stable it will be merged into master.
>> >
>> > Please advise if you have any suggestions.
>> >
>> > Regards,
>> > Jacky
>> >
>> >
>> > > 在 2017年7月21日，上午12:12，Swapnil Shinde <swapnilushi...@gmail.com> 写道：
>> > >
>> > > Ok. Just curious - Any reason not to support numeric columns with
>> > > dictionary_exclude? Wouldn't it be useful for numeric unique column
>> which
>> > > should be dimension but avoid creating dictionary  (as it may not be
>> > > beneficial).
>> > >
>> > > Thanks
>> > > Swapnil
>> > >
>> > >
>> > > On Thu, Jul 20, 2017 at 4:20 AM, manishgupta88 <
>> > tomanishgupt...@gmail.com>
>> > > wrote:
>> > >
>> > >> No Dictionary_Exclude is supported only for String data type columns.
>> > >>
>> > >> Regards
>> > >> Manish Gupta
>> > >>
>> > >>
>> > >>
>> > >> --
>> > >> View this message in context: http://apache-carbondata-dev-
>> > >> mailing-list-archive.1130556.n5.nabble.com/carbon-data-
>> > performance-doubts-
>> > >> tp18438p18559.html
>> > >> Sent from the Apache CarbonData Dev Mailing List archive mailing list
>> > >> archive at Nabble.com.
>> > >>
>> >
>> >
>>
>
>

Re: carbon data performance doubts

Reply via email to