Re: Kylin Performance

Alberto Ramón Tue, 27 Dec 2016 10:40:47 -0800

Kylin 2165 will be nice

Yes 30% of total cube, because the cardinality of  DIM was low ( 2K and 11K)


You are in true: When the cardinality of  DIM are 1M,  the intermediate
table is only 5% of total: Picture (I don't know you can see pictures in
this mailList)
[image: Imágenes integradas 1]


2016-12-27 2:32 GMT+01:00 ShaoFeng Shi <[email protected]>:

> Alberto, I didn't test ORC format; but as you know, Kylin consumes the
> source data row by row (all columns at once), so I guess columnar format
> like ORC may not benefit much. But this is a good try, if there is better
> format we can switch to it.
>
> The "redistribute flat hive table" will add time but it can reduce time in
> subsequent cube building (avoid data skew), especially when there are lots
> of records. Usually it is fast (a couple minutes to ten or twenty minutes)
> comparing to the cube build time. You mentioned it took 30% of total time,
> what's the total time and what's the input number? When the input is small,
> the overhead may overcome the benefit.
>
> For the method you mentioned (count on fact table, then put the
> redistribute to step 1), actually it is supported in Kylin 1.5.4 (maybe
> also 1.5.3) with a config parameter; but that method is not recommended as
> it is unstable: In some cases (e.g, the fact table is a big hive view, or
> it is a big table but not partitioned by date), a simple "select count(*)
> from fact_table" will cost lots of resources on Hadoop, a second "create
> intermediate_table as select ..." will start the same mappers again.
>
> In contrast, the as-is method is relatively stable for extreme case;
> usually the intermediate table is much smaller than fact table, count and
> redistribute on it will be low-cost; In next version there will be a
> further optimization (https://issues.apache.org/jira/browse/KYLIN-2165) to
> reduce the time in this step.
>
>
> 2016-12-27 1:20 GMT+08:00 Alberto Ramón <[email protected]>:
>
> > Hello
> >
> > from v0, I correct english sintaxis
> >
> >
> > After tunning of cube:
> >   -  Use Hive input compress table
> >   -  Define  Hierarchy, Joint, Dim
> >   -  . . .
> >
> > Now:  57% if for first steps (flat table, steps: 1,2,3)  and 43% for
> build
> > cube
> >
> > I saw flat table uses SEQUENCEFILE, then I tested to use
> >    ORC,
> >    ORC + Snappy
> >    ORC + Snappy + Vectorization
> >
> > without good results, more ideas ??
> >
> >
> > I'm thinking that 'Redistribute Flat Hive Table' is a simple count and
> uses
> >
> > *30% of total time*
> >   Is this the normal case ?
> >   We can aprox this count to: count of Fact Table (Will true 99% of
> time),
> > and put in // with step 1, is necessary be precise?
> >
> > 2016-12-22 14:00 GMT+01:00 Li Yang <[email protected]>:
> >
> > > Very good work!
> > >
> > > Btw, we are also doing benchmarks on SSB and TPC-H data sets, based on
> > > below work. Will share more info soon.
> > >
> > > - http://www.cs.umb.edu/~poneil/StarSchemaB.PDF
> > > - https://github.com/hortonworks/hive-testbench
> > >
> > >
> > > Cheers
> > > Yang
> > >
> > > On Wed, Dec 21, 2016 at 8:45 PM, Alberto Ramón <
> > [email protected]>
> > > wrote:
> > >
> > > > When Kylin 2149 <https://issues.apache.org/jira/browse/KYLIN-2149>
> > will
> > > be
> > > > solved the performance will be* improve even more*, because:
> > > >
> > > > you know that 2016-05-05 Belongs to May, Week 18, and friday , but
> > kylin
> > > > doesnt know it
> > > > It will try to calulate the combination of 2016-05-05 with January
> > > February
> > > > March, ... Monday Tuesday ..., W1 W2 ..., Q2 Q3 Q4 ==> There are a
> lot
> > of
> > > > combination wasted
> > > >
> > > > 2016-12-21 12:57 GMT+01:00 Luke_Selina <[email protected]>:
> > > >
> > > > > Great and Agree! But I still have an question like Alberto, why in
> an
> > > AGG
> > > > > one
> > > > > dim can use only one regulation(mandatory, join, hierachy)?
> > > > >
> > > > > --
> > > > > View this message in context: http://apache-kylin.74782.x6.
> > > > > nabble.com/Kylin-Performance-tp6713p6728.html
> > > > > Sent from the Apache Kylin mailing list archive at Nabble.com.
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>

Re: Kylin Performance

Reply via email to