Hi Jacky, I agree with Ravindra's point by making no dictionary column by default will increase the store size and it will impact IO+ currently in carbon for no dictionary column only String data type is supported, so we cannot set dimension column as no dictionary column by default.
-Regards Kumar Vishal On Wed, Mar 1, 2017 at 12:42 PM, Ravindra Pesala <ravi.pes...@gmail.com> wrote: > Hi Likun, > > It would be same case if we use all non dictionary columns by default, it > will increase the store size and decrease the performance so it is also > does not encourage more users if performance is poor. > > If we need to make no-dictionary columns as default then we should first > focus on reducing the store size and improve the filter queries on > non-dictionary columns.Even memory usage is higher while querying the > non-dictionary columns. > > Regards, > Ravindra. > > On 1 March 2017 at 06:00, Jacky Li <jacky.li...@qq.com> wrote: > > > Yes, I agree to your point. The only concern I have is for loading, I > have > > seen many users accidentally put high cardinality column into dictionary > > column then the loading failed because out of memory or loading very > slow. > > I guess they just do not know to use DICTIONARY_EXCLUDE for these > columns, > > or they do not have a easy way to identify the high card columns. I feel > > preventing such misusage is important in order to encourage more users to > > use carbondata. > > > > Any suggestion on solving this issue? > > > > > > Regards, > > Likun > > > > > > > 在 2017年2月28日,下午10:20,Ravindra Pesala <ravi.pes...@gmail.com> 写道: > > > > > > Hi Likun, > > > > > > You mentioned that if user does not specify dictionary columns then by > > > default those are chosen as no dictionary columns. > > > But we have many disadvantages as I mentioned in above mail if you keep > > no > > > dictionary as default. We have initially introduced no dictionary > columns > > > to handle high cardinality dimensions, but now making every thing as no > > > dictionary columns by default looses our unique feature compare to > > parquet. > > > Dictionary columns are introduced not only for aggregation queries, it > is > > > for better compression and better filter queries as well. With out > > > dictionary store size will be increased a lot. > > > > > > Regards, > > > Ravindra. > > > > > > On 28 February 2017 at 18:05, Liang Chen <chenliang6...@gmail.com> > > wrote: > > > > > >> Hi > > >> > > >> A couple of questions: > > >> > > >> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax > > >> index" for these columns which be specified into the > option(SORT_KEY) ? > > >> > > >> 2) If users don't specify TABLE_DICTIONARY, then all columns don't > make > > >> dictionary encoding, and all shuffle operations are based on fact > > value, is > > >> my understanding right ? > > >> ------------------------------------------------------------ > > >> ------------------------------------------- > > >> If this option is not specified by user, means all columns encoding > > without > > >> global dictionary support. Normal shuffle on decoded value will be > > applied > > >> when doing group by operation. > > >> > > >> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY", > > >> supposed if "C2" be specified into SORT_KEY, but not be specified > into > > >> TABLE_DICTIONARY, then system how to handle this case ? > > >> ------------------------------------------------------------ > > >> ----------------------------------------------- > > >> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and > encoded > > as > > >> Inverted Index and with Minmax Index > > >> > > >> Regards > > >> Liang > > >> > > >> 2017-02-28 19:35 GMT+08:00 Jacky Li <jacky.li...@qq.com>: > > >> > > >>> Yes, first we should simplify the DDL options. I propose following > > >> options, > > >>> please check weather it miss some scenario. > > >>> > > >>> 1. SORT_COLUMNS, or SORT_KEY > > >>> This indicates three things: > > >>> 1) All columns specified in options will be used to construct > > >>> Multi-Dimensional Key, which will be sorted along this key > > >>> 2) They will be encoded as Inverted Index and thus again sorted > within > > >>> column chunk in one blocklet > > >>> 3) Minmax index will also be created for these columns > > >>> > > >>> When to use: This option is designed for accelerating filter query, > so > > >> put > > >>> all filter columns into this option. The order of it can be: > > >>> 1) From low cardinality to high cardinality, this will make most > > >>> compression > > >>> and fit for scenario that does not have frequent filter on high card > > >> column > > >>> 2) Put high cardinality column first, then put others. This fits for > > >>> frequent filter on high card column > > >>> > > >>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and > encoded > > >> as > > >>> Inverted Index and with Minmax Index > > >>> Note that while C1,C2,C3 can be dimension but they also can be > measure. > > >> So > > >>> if user need to filter on measure column, it can be put in > SORT_COLUMNS > > >>> option. > > >>> > > >>> If this option is not specified by user, carbon will pick MDK as it > is > > >> now. > > >>> > > >>> 2. TABLE_DICTIONARY > > >>> This is to specify the table level dictionary columns. Will create > > global > > >>> dictionary for all columns in this option for every data load. > > >>> > > >>> When to use: The option is designed for accelerating aggregate query, > > so > > >>> put > > >>> group by columns into this option > > >>> > > >>> For example. TABLE_DICTIONARY=“C2,C3,C5” > > >>> > > >>> If this option is not specified by user, means all columns encoding > > >> without > > >>> global dictionary support. Normal shuffle on decoded value will be > > >> applied > > >>> when doing group by operation. > > >>> > > >>> I think these two options should be the basic option for normal user, > > the > > >>> goal of them is to satisfy the most scenario without deep tuning of > the > > >>> table > > >>> For advanced user who want to do deep tuning, we can debate to add > more > > >>> options. But we need to identify what scenario is not satisfied by > > using > > >>> these two options first. > > >>> > > >>> Regards, > > >>> Jacky > > >>> > > >>> > > >>> > > >>> -- > > >>> View this message in context: http://apache-carbondata- > > >>> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the- > > >>> dimension-default-should-be-no-dictionary-tp8010p8081.html > > >>> Sent from the Apache CarbonData Mailing List archive mailing list > > archive > > >>> at Nabble.com. > > >>> > > >> > > >> > > >> > > >> -- > > >> Regards > > >> Liang > > >> > > > > > > > > > -- > > > Thanks & Regards, > > > Ravi > > > > > > > > > > > -- > Thanks & Regards, > Ravi >