Re: [DISCUSS] For the dimension default should be no dictionary

Liang Chen Mon, 27 Feb 2017 18:57:02 -0800

Hi

+1  , through adding "DICTIONARY_EXCLUDE= 'ALL'  and DICTIONARY_INCLUDE=
'ALL' " to improve the usability of DDL.
This solution is more flexible than put no-dictionary as default.


Regards
Liang

2017-02-27 20:27 GMT+08:00 Ravindra Pesala <ravi.pes...@gmail.com>:

> Hi Bill,
>
> I got your point, but the solution of making no-dictionary as default may
> not be perfect solution. Basically no-dictionary columns are only meant for
> high cardinality dimensions, so the usage may change from user to user or
> scenario to scenario .
> This is the basic issue of usability of DDL, please first focus on to
> simplify DDL usability.
>
> For example we have 6 columns , we can mention DDL as below.
> case 1 :
> SORT_COLUMNS="C1,C2,C3"
> NON_SORT_COLUMNS="C4,C5,C6"
> In above case C1, C2 , C3 are sort columns and part of MDK key. And
> C4,C5,C6 are become non sort columns(measure/complex)
>
> DICTIONARY_EXCLUDE= 'ALL'
> DICTIONARY_INCLUDE='C3'
> In the above case all sort columns((C1,C2,C3) are non-dictionary columns
> except C3, here C3 is dictionary column.
>
> case 2:
> SORT_COLUMNS="ALL"
> NON_SORT_COLUMNS="C6"
> In this case all columns are sort columns except C6.
>
> DICTIONARY_EXCLUDE= 'C2'
> DICTIONARY_INCLUDE='ALL'
> In the above case all sort columns(C1,C2,C3,C4,C5) are dictionary columns
> except C2, here C2 is no-dictionary column.
>
> Above mentioned are just my idea of how to simplify DDL to handle all
> scenarios. We can have more discussion towards it to simplify the DDL.
>
> Regards,
> Ravindra.
>
> On 27 February 2017 at 12:38, bill.zhou <zgcsk...@163.com> wrote:
>
> > Dear Vishal & Ravindra
> >
> >   Thanks for you reply,  I think I didn't describe it clearly so that you
> > don't get full idea.
> > 1. dictionary is important feature in CarbonData, for every new customer
> we
> > will introduce this feature to him. So for new customer will know it
> > clearly, will set the dictionary column when create table.
> > 2. For all customer like bank customer, telecom customer and traffic
> > customer have a same scenario is: have more column but only set few
> column
> > as dictionary.
> >     like telecom customer, 300 column only set 5 column dictionary, other
> > dim don't set dictionary.
> >     like bank customer, 100 column only set about 5 column dictionary,
> > other
> > dim don't set dictionary.
> > *For currently customer actually user scenario, they only set the dim
> which
> > used for filter and group by related column as dictionary*
> > 3. mys suggestion is that: dim column default as no dictionary is only
> for
> > the dim which not put into the dictionary_include properties, not for all
> > dim column. If customer always used 5 columns add into dictionary_include
> > and others column no dictionary, this will not impact the query
> > performance.
> >
> > So that I suggestion the dim column default set as no dictionary which
> not
> > added in to dictionary_include properties.
> >
> > Regards
> > Bill
> >
> >
> >
> > kumarvishal09 wrote
> > > Hi,
> > >     I completely agree with Ravindra's points, more number of no
> > > dictionary
> > > column will impact the IO reading+writing both as in case of no
> > dictionary
> > > data size will increase. Late decoding is one of main advantage, no
> > > dictionary column aggregation will be slower. Filter query will suffer
> as
> > > in case of dictionary column we are comparing on byte pack value, in
> case
> > > of no dictionary it will be on actual value.
> > >
> > > -Regards
> > > Kumar Vishal
> > >
> > > On Mon, Feb 27, 2017 at 12:34 AM, Ravindra Pesala &lt;
> >
> > > ravi.pesala@
> >
> > > &gt;
> > > wrote:
> > >
> > >> Hi,
> > >>
> > >> I feel there are more disadvantages than advantages in this approach.
> In
> > >> your current scenario you want to set dictionary only for columns
> which
> > >> are
> > >> used as filters, but the usage of dictionary is not only limited for
> > >> filters, it can reduce the store size and improve the aggregation
> > >> queries.
> > >> I think you should set no_inverted_index false on non filtered columns
> > to
> > >> reduce the store size and improve the performance.
> > >>
> > >> If we make no dictionary as default then user no need set them in DDL
> > but
> > >> user needs to set the dictionary columns. If user wants to set more
> > >> dictionary columns then the same problem what you mentioned arises
> again
> > >> so
> > >> it does not solve the problem. I feel we should give more flexibility
> in
> > >> our DDL to simplify these scenarios and we should have more discussion
> > on
> > >> it.
> > >>
> > >> Pros & Cons of your suggestion.
> > >> Advantages :
> > >> 1. Decoding/Encoding of dictionary could be avoided.
> > >>
> > >> Disadvantages :
> > >> 1. Store size will increase drastically.
> > >> 2. IO will increase so query performance will come down.
> > >> 3. Aggregation queries performance will suffer.
> > >>
> > >>
> > >>
> > >> Regards,
> > >> Ravindra.
> > >>
> > >> On 26 February 2017 at 20:04, bill.zhou &lt;
> >
> > > zgcsky08@
> >
> > > &gt; wrote:
> > >>
> > >> > hi All
> > >> >     Now when create the CarbonData table,if  the dimension don't add
> > >> into
> > >> > the dictionary_exclude properties, the dimension will be consider as
> > >> > dictionary default. I think default should be no dictionary.
> > >> >
> > >> >     For example when I do the POC for one customer, it has 300
> columns
> > >> and
> > >> > 200 dimensions, but only 5 columns is used for filter, so he only
> need
> > >> set
> > >> > this 5 columns to dictionary and leave other 195 columns to no
> > >> dictionary.
> > >> > But now he need specify for the 195 columns to dictionary_exclude
> > >> > properties
> > >> > the will waste time and make the create table command huge, also
> will
> > >> > impact
> > >> > the load performance.
> > >> >
> > >> >     So I suggestion dimension default should be no dictionary and
> this
> > >> can
> > >> > also help customer easy to know the dictionary column which is
> useful.
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > View this message in context: http://apache-carbondata-
> > >> > mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
> > >> > dimension-default-should-be-no-dictionary-tp8010.html
> > >> > Sent from the Apache CarbonData Mailing List archive mailing list
> > >> archive
> > >> > at Nabble.com.
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Thanks & Regards,
> > >> Ravi
> > >>
> >
> >
> >
> >
> >
> > --
> > View this message in context: http://apache-carbondata-
> > mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
> > dimension-default-should-be-no-dictionary-tp8010p8027.html
> > Sent from the Apache CarbonData Mailing List archive mailing list archive
> > at Nabble.com.
> >
>
>
>
> --
> Thanks & Regards,
> Ravi
>



-- 
Regards
Liang

Re: [DISCUSS] For the dimension default should be no dictionary

Reply via email to