> 在 2017年2月28日,下午8:35,Liang Chen <chenliang6...@gmail.com> 写道: > > Hi > > A couple of questions: > > 1) For SORT_KEY option: only build "MDK index, inverted index, minmax > index" for these columns which be specified into the option(SORT_KEY) ? > Yes, build MDK index, inverted index, minimax index for columns in SORT_KEY
> 2) If users don't specify TABLE_DICTIONARY, then all columns don't make > dictionary encoding, and all shuffle operations are based on fact value, is > my understanding right ? > ------------------------------------------------------------------------------------------------------- > If this option is not specified by user, means all columns encoding without > global dictionary support. Normal shuffle on decoded value will be applied > when doing group by operation. > Yes > 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY", > supposed if "C2" be specified into SORT_KEY, but not be specified into > TABLE_DICTIONARY, then system how to handle this case ? > ----------------------------------------------------------------------------------------------------------- > For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded as > Inverted Index and with Minmax Index > Sort it using original value > Regards > Liang > > 2017-02-28 19:35 GMT+08:00 Jacky Li <jacky.li...@qq.com>: > >> Yes, first we should simplify the DDL options. I propose following options, >> please check weather it miss some scenario. >> >> 1. SORT_COLUMNS, or SORT_KEY >> This indicates three things: >> 1) All columns specified in options will be used to construct >> Multi-Dimensional Key, which will be sorted along this key >> 2) They will be encoded as Inverted Index and thus again sorted within >> column chunk in one blocklet >> 3) Minmax index will also be created for these columns >> >> When to use: This option is designed for accelerating filter query, so put >> all filter columns into this option. The order of it can be: >> 1) From low cardinality to high cardinality, this will make most >> compression >> and fit for scenario that does not have frequent filter on high card column >> 2) Put high cardinality column first, then put others. This fits for >> frequent filter on high card column >> >> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded as >> Inverted Index and with Minmax Index >> Note that while C1,C2,C3 can be dimension but they also can be measure. So >> if user need to filter on measure column, it can be put in SORT_COLUMNS >> option. >> >> If this option is not specified by user, carbon will pick MDK as it is now. >> >> 2. TABLE_DICTIONARY >> This is to specify the table level dictionary columns. Will create global >> dictionary for all columns in this option for every data load. >> >> When to use: The option is designed for accelerating aggregate query, so >> put >> group by columns into this option >> >> For example. TABLE_DICTIONARY=“C2,C3,C5” >> >> If this option is not specified by user, means all columns encoding without >> global dictionary support. Normal shuffle on decoded value will be applied >> when doing group by operation. >> >> I think these two options should be the basic option for normal user, the >> goal of them is to satisfy the most scenario without deep tuning of the >> table >> For advanced user who want to do deep tuning, we can debate to add more >> options. But we need to identify what scenario is not satisfied by using >> these two options first. >> >> Regards, >> Jacky >> >> >> >> -- >> View this message in context: http://apache-carbondata- >> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the- >> dimension-default-should-be-no-dictionary-tp8010p8081.html >> Sent from the Apache CarbonData Mailing List archive mailing list archive >> at Nabble.com. >> > > > -- > Regards > Liang