hi All I summary this discussion. 1. to make carbonData compatibility for older vesion, keep DICTIONARY_INCLUDE and DICTIONARY_EXCLUDE, default is no dictionary. do not suggestion change this properties to table_dictionary. 2. Suggestion keep the sort_column properties as the same style for dictionary. so this new properties suggestion use SORT_INCLUDE and SORT_EXCLUDE, default is no sort.
Regards Bill ravipesala wrote > Hi All, > > In order to make no-dictionary columns as default we should improve the > storage and performance for these columns. I have sent another mail to > discuss the improvement points. Please comment on it. > > Regards, > Ravindra > > On 1 March 2017 at 10:12, Ravindra Pesala < > ravi.pesala@ > > wrote: > >> Hi Likun, >> >> It would be same case if we use all non dictionary columns by default, it >> will increase the store size and decrease the performance so it is also >> does not encourage more users if performance is poor. >> >> If we need to make no-dictionary columns as default then we should first >> focus on reducing the store size and improve the filter queries on >> non-dictionary columns.Even memory usage is higher while querying the >> non-dictionary columns. >> >> Regards, >> Ravindra. >> >> On 1 March 2017 at 06:00, Jacky Li < > jacky.likun@ > > wrote: >> >>> Yes, I agree to your point. The only concern I have is for loading, I >>> have seen many users accidentally put high cardinality column into >>> dictionary column then the loading failed because out of memory or >>> loading >>> very slow. I guess they just do not know to use DICTIONARY_EXCLUDE for >>> these columns, or they do not have a easy way to identify the high card >>> columns. I feel preventing such misusage is important in order to >>> encourage >>> more users to use carbondata. >>> >>> Any suggestion on solving this issue? >>> >>> >>> Regards, >>> Likun >>> >>> >>> > 在 2017年2月28日,下午10:20,Ravindra Pesala < > ravi.pesala@ > > 写道: >>> > >>> > Hi Likun, >>> > >>> > You mentioned that if user does not specify dictionary columns then by >>> > default those are chosen as no dictionary columns. >>> > But we have many disadvantages as I mentioned in above mail if you >>> keep >>> no >>> > dictionary as default. We have initially introduced no dictionary >>> columns >>> > to handle high cardinality dimensions, but now making every thing as >>> no >>> > dictionary columns by default looses our unique feature compare to >>> parquet. >>> > Dictionary columns are introduced not only for aggregation queries, it >>> is >>> > for better compression and better filter queries as well. With out >>> > dictionary store size will be increased a lot. >>> > >>> > Regards, >>> > Ravindra. >>> > >>> > On 28 February 2017 at 18:05, Liang Chen < > chenliang6136@ > > >>> wrote: >>> > >>> >> Hi >>> >> >>> >> A couple of questions: >>> >> >>> >> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax >>> >> index" for these columns which be specified into the option(SORT_KEY) >>> ? >>> >> >>> >> 2) If users don't specify TABLE_DICTIONARY, then all columns don't >>> make >>> >> dictionary encoding, and all shuffle operations are based on fact >>> value, is >>> >> my understanding right ? >>> >> ------------------------------------------------------------ >>> >> ------------------------------------------- >>> >> If this option is not specified by user, means all columns encoding >>> without >>> >> global dictionary support. Normal shuffle on decoded value will be >>> applied >>> >> when doing group by operation. >>> >> >>> >> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY", >>> >> supposed if "C2" be specified into SORT_KEY, but not be specified >>> into >>> >> TABLE_DICTIONARY, then system how to handle this case ? >>> >> ------------------------------------------------------------ >>> >> ----------------------------------------------- >>> >> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and >>> encoded as >>> >> Inverted Index and with Minmax Index >>> >> >>> >> Regards >>> >> Liang >>> >> >>> >> 2017-02-28 19:35 GMT+08:00 Jacky Li < > jacky.likun@ > >: >>> >> >>> >>> Yes, first we should simplify the DDL options. I propose following >>> >> options, >>> >>> please check weather it miss some scenario. >>> >>> >>> >>> 1. SORT_COLUMNS, or SORT_KEY >>> >>> This indicates three things: >>> >>> 1) All columns specified in options will be used to construct >>> >>> Multi-Dimensional Key, which will be sorted along this key >>> >>> 2) They will be encoded as Inverted Index and thus again sorted >>> within >>> >>> column chunk in one blocklet >>> >>> 3) Minmax index will also be created for these columns >>> >>> >>> >>> When to use: This option is designed for accelerating filter query, >>> so >>> >> put >>> >>> all filter columns into this option. The order of it can be: >>> >>> 1) From low cardinality to high cardinality, this will make most >>> >>> compression >>> >>> and fit for scenario that does not have frequent filter on high card >>> >> column >>> >>> 2) Put high cardinality column first, then put others. This fits for >>> >>> frequent filter on high card column >>> >>> >>> >>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and >>> encoded >>> >> as >>> >>> Inverted Index and with Minmax Index >>> >>> Note that while C1,C2,C3 can be dimension but they also can be >>> measure. >>> >> So >>> >>> if user need to filter on measure column, it can be put in >>> SORT_COLUMNS >>> >>> option. >>> >>> >>> >>> If this option is not specified by user, carbon will pick MDK as it >>> is >>> >> now. >>> >>> >>> >>> 2. TABLE_DICTIONARY >>> >>> This is to specify the table level dictionary columns. Will create >>> global >>> >>> dictionary for all columns in this option for every data load. >>> >>> >>> >>> When to use: The option is designed for accelerating aggregate >>> query, >>> so >>> >>> put >>> >>> group by columns into this option >>> >>> >>> >>> For example. TABLE_DICTIONARY=“C2,C3,C5” >>> >>> >>> >>> If this option is not specified by user, means all columns encoding >>> >> without >>> >>> global dictionary support. Normal shuffle on decoded value will be >>> >> applied >>> >>> when doing group by operation. >>> >>> >>> >>> I think these two options should be the basic option for normal >>> user, >>> the >>> >>> goal of them is to satisfy the most scenario without deep tuning of >>> the >>> >>> table >>> >>> For advanced user who want to do deep tuning, we can debate to add >>> more >>> >>> options. But we need to identify what scenario is not satisfied by >>> using >>> >>> these two options first. >>> >>> >>> >>> Regards, >>> >>> Jacky >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> View this message in context: http://apache-carbondata- >>> >>> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the- >>> >>> dimension-default-should-be-no-dictionary-tp8010p8081.html >>> >>> Sent from the Apache CarbonData Mailing List archive mailing list >>> archive >>> >>> at Nabble.com. >>> >>> >>> >> >>> >> >>> >> >>> >> -- >>> >> Regards >>> >> Liang >>> >> >>> > >>> > >>> > -- >>> > Thanks & Regards, >>> > Ravi >>> >>> >>> >>> >> >> >> -- >> Thanks & Regards, >> Ravi >> > > > > -- > Thanks & Regards, > Ravi -- View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-dimension-default-should-be-no-dictionary-tp8010p8198.html Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.