Re: Kylin Performance

Alberto Ramón Wed, 28 Dec 2016 02:32:48 -0800

Don`t worry, I'm going to completed my KylinPerformace_I.pdf with new tests
and some notes


2016-12-28 11:19 GMT+01:00 ShaoFeng Shi <[email protected]>:

> Alberto, the image can not be displayed :-<
>
> 2016-12-28 2:39 GMT+08:00 Alberto Ramón <[email protected]>:
>
> > Kylin 2165 will be nice
> >
> > Yes 30% of total cube, because the cardinality of  DIM was low ( 2K and
> > 11K)
> >
> > You are in true: When the cardinality of  DIM are 1M,  the intermediate
> > table is only 5% of total: Picture (I don't know you can see pictures in
> > this mailList)
> > [image: Imágenes integradas 1]
> >
> >
> > 2016-12-27 2:32 GMT+01:00 ShaoFeng Shi <[email protected]>:
> >
> >> Alberto, I didn't test ORC format; but as you know, Kylin consumes the
> >> source data row by row (all columns at once), so I guess columnar format
> >> like ORC may not benefit much. But this is a good try, if there is
> better
> >> format we can switch to it.
> >>
> >> The "redistribute flat hive table" will add time but it can reduce time
> in
> >> subsequent cube building (avoid data skew), especially when there are
> lots
> >> of records. Usually it is fast (a couple minutes to ten or twenty
> minutes)
> >> comparing to the cube build time. You mentioned it took 30% of total
> time,
> >> what's the total time and what's the input number? When the input is
> >> small,
> >> the overhead may overcome the benefit.
> >>
> >> For the method you mentioned (count on fact table, then put the
> >> redistribute to step 1), actually it is supported in Kylin 1.5.4 (maybe
> >> also 1.5.3) with a config parameter; but that method is not recommended
> as
> >> it is unstable: In some cases (e.g, the fact table is a big hive view,
> or
> >> it is a big table but not partitioned by date), a simple "select
> count(*)
> >> from fact_table" will cost lots of resources on Hadoop, a second "create
> >> intermediate_table as select ..." will start the same mappers again.
> >>
> >> In contrast, the as-is method is relatively stable for extreme case;
> >> usually the intermediate table is much smaller than fact table, count
> and
> >> redistribute on it will be low-cost; In next version there will be a
> >> further optimization (https://issues.apache.org/jira/browse/KYLIN-2165)
> >> to
> >> reduce the time in this step.
> >>
> >>
> >> 2016-12-27 1:20 GMT+08:00 Alberto Ramón <[email protected]>:
> >>
> >> > Hello
> >> >
> >> > from v0, I correct english sintaxis
> >> >
> >> >
> >> > After tunning of cube:
> >> >   -  Use Hive input compress table
> >> >   -  Define  Hierarchy, Joint, Dim
> >> >   -  . . .
> >> >
> >> > Now:  57% if for first steps (flat table, steps: 1,2,3)  and 43% for
> >> build
> >> > cube
> >> >
> >> > I saw flat table uses SEQUENCEFILE, then I tested to use
> >> >    ORC,
> >> >    ORC + Snappy
> >> >    ORC + Snappy + Vectorization
> >> >
> >> > without good results, more ideas ??
> >> >
> >> >
> >> > I'm thinking that 'Redistribute Flat Hive Table' is a simple count and
> >> uses
> >> >
> >> > *30% of total time*
> >> >   Is this the normal case ?
> >> >   We can aprox this count to: count of Fact Table (Will true 99% of
> >> time),
> >> > and put in // with step 1, is necessary be precise?
> >> >
> >> > 2016-12-22 14:00 GMT+01:00 Li Yang <[email protected]>:
> >> >
> >> > > Very good work!
> >> > >
> >> > > Btw, we are also doing benchmarks on SSB and TPC-H data sets, based
> on
> >> > > below work. Will share more info soon.
> >> > >
> >> > > - http://www.cs.umb.edu/~poneil/StarSchemaB.PDF
> >> > > - https://github.com/hortonworks/hive-testbench
> >> > >
> >> > >
> >> > > Cheers
> >> > > Yang
> >> > >
> >> > > On Wed, Dec 21, 2016 at 8:45 PM, Alberto Ramón <
> >> > [email protected]>
> >> > > wrote:
> >> > >
> >> > > > When Kylin 2149 <https://issues.apache.org/jira/browse/KYLIN-2149
> >
> >> > will
> >> > > be
> >> > > > solved the performance will be* improve even more*, because:
> >> > > >
> >> > > > you know that 2016-05-05 Belongs to May, Week 18, and friday , but
> >> > kylin
> >> > > > doesnt know it
> >> > > > It will try to calulate the combination of 2016-05-05 with January
> >> > > February
> >> > > > March, ... Monday Tuesday ..., W1 W2 ..., Q2 Q3 Q4 ==> There are a
> >> lot
> >> > of
> >> > > > combination wasted
> >> > > >
> >> > > > 2016-12-21 12:57 GMT+01:00 Luke_Selina <[email protected]
> >:
> >> > > >
> >> > > > > Great and Agree! But I still have an question like Alberto, why
> >> in an
> >> > > AGG
> >> > > > > one
> >> > > > > dim can use only one regulation(mandatory, join, hierachy)?
> >> > > > >
> >> > > > > --
> >> > > > > View this message in context: http://apache-kylin.74782.x6.
> >> > > > > nabble.com/Kylin-Performance-tp6713p6728.html
> >> > > > > Sent from the Apache Kylin mailing list archive at Nabble.com.
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Best regards,
> >>
> >> Shaofeng Shi 史少锋
> >>
> >
> >
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>

Re: Kylin Performance

Reply via email to