Re: [DISCUSS] For the dimension default should be no dictionary

Kumar Vishal Wed, 01 Mar 2017 04:29:12 -0800

Hi Jacky,
I agree with Ravindra's point by making no dictionary column by default
will increase the store size and it will impact IO+ currently in carbon for
no dictionary column only String data type is supported, so we cannot set
dimension column as no dictionary column by default.


-Regards
Kumar Vishal

On Wed, Mar 1, 2017 at 12:42 PM, Ravindra Pesala <ravi.pes...@gmail.com>
wrote:

> Hi Likun,
>
> It would be same case if we use all non dictionary columns by default, it
> will increase the store size and decrease the performance so it is also
> does not encourage more users if performance is poor.
>
> If we need to make no-dictionary columns as default then we should first
> focus on reducing the store size and improve the filter queries on
> non-dictionary columns.Even memory usage is higher while querying the
> non-dictionary columns.
>
> Regards,
> Ravindra.
>
> On 1 March 2017 at 06:00, Jacky Li <jacky.li...@qq.com> wrote:
>
> > Yes, I agree to your point. The only concern I have is for loading, I
> have
> > seen many users accidentally put high cardinality column into dictionary
> > column then the loading failed because out of memory or loading very
> slow.
> > I guess they just do not know to use DICTIONARY_EXCLUDE for these
> columns,
> > or they do not have a easy way to identify the high card columns. I feel
> > preventing such misusage is important in order to encourage more users to
> > use carbondata.
> >
> > Any suggestion on solving this issue?
> >
> >
> > Regards,
> > Likun
> >
> >
> > > 在 2017年2月28日，下午10:20，Ravindra Pesala <ravi.pes...@gmail.com> 写道：
> > >
> > > Hi Likun,
> > >
> > > You mentioned that if user does not specify dictionary columns then by
> > > default those are chosen as no dictionary columns.
> > > But we have many disadvantages as I mentioned in above mail if you keep
> > no
> > > dictionary as default. We have initially introduced no dictionary
> columns
> > > to handle high cardinality dimensions, but now making every thing as no
> > > dictionary columns by default looses our unique feature compare to
> > parquet.
> > > Dictionary columns are introduced not only for aggregation queries, it
> is
> > > for better compression and better filter queries as well. With out
> > > dictionary store size will be increased a lot.
> > >
> > > Regards,
> > > Ravindra.
> > >
> > > On 28 February 2017 at 18:05, Liang Chen <chenliang6...@gmail.com>
> > wrote:
> > >
> > >> Hi
> > >>
> > >> A couple of questions:
> > >>
> > >> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax
> > >> index" for these columns which be specified into the
> option(SORT_KEY)  ?
> > >>
> > >> 2) If users don't specify TABLE_DICTIONARY,  then all columns don't
> make
> > >> dictionary encoding, and all shuffle operations are based on fact
> > value, is
> > >> my understanding right ?
> > >> ------------------------------------------------------------
> > >> -------------------------------------------
> > >> If this option is not specified by user, means all columns encoding
> > without
> > >> global dictionary support. Normal shuffle on decoded value will be
> > applied
> > >> when doing group by operation.
> > >>
> > >> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY",
> > >> supposed  if "C2" be specified into SORT_KEY, but not be specified
> into
> > >> TABLE_DICTIONARY, then system how to handle this case ?
> > >> ------------------------------------------------------------
> > >> -----------------------------------------------
> > >> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and
> encoded
> > as
> > >> Inverted Index and with Minmax Index
> > >>
> > >> Regards
> > >> Liang
> > >>
> > >> 2017-02-28 19:35 GMT+08:00 Jacky Li <jacky.li...@qq.com>:
> > >>
> > >>> Yes, first we should simplify the DDL options. I propose following
> > >> options,
> > >>> please check weather it miss some scenario.
> > >>>
> > >>> 1. SORT_COLUMNS, or SORT_KEY
> > >>> This indicates three things:
> > >>> 1) All columns specified in options will be used to construct
> > >>> Multi-Dimensional Key, which will be sorted along this key
> > >>> 2) They will be encoded as Inverted Index and thus again sorted
> within
> > >>> column chunk in one blocklet
> > >>> 3) Minmax index will also be created for these columns
> > >>>
> > >>> When to use: This option is designed for accelerating filter query,
> so
> > >> put
> > >>> all filter columns into this option. The order of it can be:
> > >>> 1) From low cardinality to high cardinality, this will make most
> > >>> compression
> > >>> and fit for scenario that does not have frequent filter on high card
> > >> column
> > >>> 2) Put high cardinality column first, then put others. This fits for
> > >>> frequent filter on high card column
> > >>>
> > >>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and
> encoded
> > >> as
> > >>> Inverted Index and with Minmax Index
> > >>> Note that while C1,C2,C3 can be dimension but they also can be
> measure.
> > >> So
> > >>> if user need to filter on measure column, it can be put in
> SORT_COLUMNS
> > >>> option.
> > >>>
> > >>> If this option is not specified by user, carbon will pick MDK as it
> is
> > >> now.
> > >>>
> > >>> 2. TABLE_DICTIONARY
> > >>> This is to specify the table level dictionary columns. Will create
> > global
> > >>> dictionary for all columns in this option for every data load.
> > >>>
> > >>> When to use: The option is designed for accelerating aggregate query,
> > so
> > >>> put
> > >>> group by columns into this option
> > >>>
> > >>> For example. TABLE_DICTIONARY=“C2,C3,C5”
> > >>>
> > >>> If this option is not specified by user, means all columns encoding
> > >> without
> > >>> global dictionary support. Normal shuffle on decoded value will be
> > >> applied
> > >>> when doing group by operation.
> > >>>
> > >>> I think these two options should be the basic option for normal user,
> > the
> > >>> goal of them is to satisfy the most scenario without deep tuning of
> the
> > >>> table
> > >>> For advanced user who want to do deep tuning, we can debate to add
> more
> > >>> options. But we need to identify what scenario is not satisfied by
> > using
> > >>> these two options first.
> > >>>
> > >>> Regards,
> > >>> Jacky
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> View this message in context: http://apache-carbondata-
> > >>> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
> > >>> dimension-default-should-be-no-dictionary-tp8010p8081.html
> > >>> Sent from the Apache CarbonData Mailing List archive mailing list
> > archive
> > >>> at Nabble.com.
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> Regards
> > >> Liang
> > >>
> > >
> > >
> > > --
> > > Thanks & Regards,
> > > Ravi
> >
> >
> >
> >
>
>
> --
> Thanks & Regards,
> Ravi
>

Re: [DISCUSS] For the dimension default should be no dictionary

Reply via email to