Hi Some more info : In release 1.1.1, there was a good improvement "measure filter optimization", system will use minmax index to do filter for measure column filter.
So for INT Regards Liang 2017-07-22 9:22 GMT+08:00 Liang Chen <chenliang...@apache.org>: > Hi Swapnil > > Actually, current system's behavior is : Index and dictionary encoding > are decoupled, no relationship. > > 1. If you want to make some columns have good filter , just add these > columns to sort_columns (like tblproperties('sort_columns'='empno')), to > build good MDX index for these columns, just add INT column to > sort_columns list for filter. > > 2. If you want to make some columns have good aggregation for group by, > just dictionary encodes these columns. By default INT column doesn't do > dictionary encode, so don't need to add "DICTIONARY_EXCLUDE", if the INT > column is low cardinality and you also want to have good aggregation on the > INT column, use "DICTIONARY_INCLUDE = the INT column". > > So , in a word : INT column with high cardinality doesn't have > DICTIONARY_EXCLUDE scenario :) > > HTH. > > Regards > Liang > > > 2017-07-22 6:09 GMT+08:00 Swapnil Shinde <swapnilushi...@gmail.com>: > >> Thank you Jacky! Above encoding property makes sense. How would you handle >> an INT column with high cardinality? as per my understanding, this column >> will be considered as measure and only way to make it dimension is to >> specify "dictionary_include" for that column. >> Any reason why a column being a dimension or measure is tied with >> dictionary encoding? Does it make sense to have column as dimension with >> no >> encoding so indexes can be used for filter? >> >> Thanks >> Swapnil >> >> >> On Fri, Jul 21, 2017 at 12:30 PM, Jacky Li <jacky.li...@qq.com> wrote: >> >> > Hi Swapnil, >> > >> > Dictionary is beneficial for aggregation query (carbon will leverage >> late >> > decode optimization in sql optimizer), so you can use it for columns on >> > which you frequently do group by. While it can improve query >> performance, >> > but it also requires more memory and CPU while loading. Normally, you >> > should consider to use dictionary only on low cardinality columns. >> > >> > In current apache master branch (and all history release before 1.2), >> > carbon data’s default encoding strategy favor query performance over >> > loading performance. By default, all string data type by default is >> > encoded as dictionary. But it creates some problems sometimes, for >> example, >> > if there are high cardinality column in the table, loading may fail due >> to >> > not enough memory in JVM. To avoid this, we have added >> DICTIONARY_EXCLUDE >> > option so that user can disable this default behavior manually. So, >> > DICTIONARY_EXCLUDE property is designed for String column only. >> > >> > And, if you have low cardinality integer column ( like some ID field), >> you >> > can choose to encode it as dictionary by specifying DICTIONARY_INCLUDE, >> so >> > group by on this integer column will be faster. >> > >> > All these are current behavior, and there was discussion to change this >> > behavior and give more control to the user, in the coming release (1.2) >> > The new proposed target behavior will be: >> > 1. There will be a default encoding strategy for each data type. If user >> > does not specify any encoding related property in CREATE TABLE, carbon >> will >> > use the default encoding strategy for each column. >> > 2. And there will be a ENCODING property through which user can override >> > the system default strategy. For example, user can create table by: >> > >> > CREATE TABLE t1 (city_name STRING, city_id INT, population INT, area >> > DOUBLE) >> > TBLPROPERTIES (‘ENCODING’ = ‘city_name: dictionary, city_id: >> {dictionary, >> > RLE}, population: delta’) >> > >> > This SQL means city_name is encoded using dictionary, city_id is encoded >> > using dictionary then apply RLE encoding (for numeric value), >> population is >> > encoded using delta encoding, and area is encoded using system default >> > encoding for double data type. >> > >> > This change is still going on (CARBONDATA-1014, >> https://issues.apache.org/ >> > jira/browse/CARBONDATA-1014 <https://issues.apache.org/ >> > jira/browse/CARBONDATA-1014>), on apache/encoding_override branch. Once >> > it is done and stable it will be merged into master. >> > >> > Please advise if you have any suggestions. >> > >> > Regards, >> > Jacky >> > >> > >> > > 在 2017年7月21日,上午12:12,Swapnil Shinde <swapnilushi...@gmail.com> 写道: >> > > >> > > Ok. Just curious - Any reason not to support numeric columns with >> > > dictionary_exclude? Wouldn't it be useful for numeric unique column >> which >> > > should be dimension but avoid creating dictionary (as it may not be >> > > beneficial). >> > > >> > > Thanks >> > > Swapnil >> > > >> > > >> > > On Thu, Jul 20, 2017 at 4:20 AM, manishgupta88 < >> > tomanishgupt...@gmail.com> >> > > wrote: >> > > >> > >> No Dictionary_Exclude is supported only for String data type columns. >> > >> >> > >> Regards >> > >> Manish Gupta >> > >> >> > >> >> > >> >> > >> -- >> > >> View this message in context: http://apache-carbondata-dev- >> > >> mailing-list-archive.1130556.n5.nabble.com/carbon-data- >> > performance-doubts- >> > >> tp18438p18559.html >> > >> Sent from the Apache CarbonData Dev Mailing List archive mailing list >> > >> archive at Nabble.com. >> > >> >> > >> > >> > >