Re: Kylin Performance

Alberto Ramón Fri, 30 Dec 2016 16:16:44 -0800

About Kylin performance, I completed some uses cases:


https://github.com/albertoRamon/Kylin/tree/master/KylinPerformance


Any contribution or correction will be appreciated
BR, Alb

2016-12-28 11:32 GMT+01:00 Alberto Ramón <a.ramonporto...@gmail.com>:

> Don`t worry, I'm going to completed my KylinPerformace_I.pdf with new
> tests and some notes
>
> 2016-12-28 11:19 GMT+01:00 ShaoFeng Shi <shaofeng...@apache.org>:
>
>> Alberto, the image can not be displayed :-<
>>
>> 2016-12-28 2:39 GMT+08:00 Alberto Ramón <a.ramonporto...@gmail.com>:
>>
>> > Kylin 2165 will be nice
>> >
>> > Yes 30% of total cube, because the cardinality of  DIM was low ( 2K and
>> > 11K)
>> >
>> > You are in true: When the cardinality of  DIM are 1M,  the intermediate
>> > table is only 5% of total: Picture (I don't know you can see pictures in
>> > this mailList)
>> > [image: Imágenes integradas 1]
>> >
>> >
>> > 2016-12-27 2:32 GMT+01:00 ShaoFeng Shi <shaofeng...@apache.org>:
>> >
>> >> Alberto, I didn't test ORC format; but as you know, Kylin consumes the
>> >> source data row by row (all columns at once), so I guess columnar
>> format
>> >> like ORC may not benefit much. But this is a good try, if there is
>> better
>> >> format we can switch to it.
>> >>
>> >> The "redistribute flat hive table" will add time but it can reduce
>> time in
>> >> subsequent cube building (avoid data skew), especially when there are
>> lots
>> >> of records. Usually it is fast (a couple minutes to ten or twenty
>> minutes)
>> >> comparing to the cube build time. You mentioned it took 30% of total
>> time,
>> >> what's the total time and what's the input number? When the input is
>> >> small,
>> >> the overhead may overcome the benefit.
>> >>
>> >> For the method you mentioned (count on fact table, then put the
>> >> redistribute to step 1), actually it is supported in Kylin 1.5.4 (maybe
>> >> also 1.5.3) with a config parameter; but that method is not
>> recommended as
>> >> it is unstable: In some cases (e.g, the fact table is a big hive view,
>> or
>> >> it is a big table but not partitioned by date), a simple "select
>> count(*)
>> >> from fact_table" will cost lots of resources on Hadoop, a second
>> "create
>> >> intermediate_table as select ..." will start the same mappers again.
>> >>
>> >> In contrast, the as-is method is relatively stable for extreme case;
>> >> usually the intermediate table is much smaller than fact table, count
>> and
>> >> redistribute on it will be low-cost; In next version there will be a
>> >> further optimization (https://issues.apache.org/jira/browse/KYLIN-2165
>> )
>> >> to
>> >> reduce the time in this step.
>> >>
>> >>
>> >> 2016-12-27 1:20 GMT+08:00 Alberto Ramón <a.ramonporto...@gmail.com>:
>> >>
>> >> > Hello
>> >> >
>> >> > from v0, I correct english sintaxis
>> >> >
>> >> >
>> >> > After tunning of cube:
>> >> >   -  Use Hive input compress table
>> >> >   -  Define  Hierarchy, Joint, Dim
>> >> >   -  . . .
>> >> >
>> >> > Now:  57% if for first steps (flat table, steps: 1,2,3)  and 43% for
>> >> build
>> >> > cube
>> >> >
>> >> > I saw flat table uses SEQUENCEFILE, then I tested to use
>> >> >    ORC,
>> >> >    ORC + Snappy
>> >> >    ORC + Snappy + Vectorization
>> >> >
>> >> > without good results, more ideas ??
>> >> >
>> >> >
>> >> > I'm thinking that 'Redistribute Flat Hive Table' is a simple count
>> and
>> >> uses
>> >> >
>> >> > *30% of total time*
>> >> >   Is this the normal case ?
>> >> >   We can aprox this count to: count of Fact Table (Will true 99% of
>> >> time),
>> >> > and put in // with step 1, is necessary be precise?
>> >> >
>> >> > 2016-12-22 14:00 GMT+01:00 Li Yang <liy...@apache.org>:
>> >> >
>> >> > > Very good work!
>> >> > >
>> >> > > Btw, we are also doing benchmarks on SSB and TPC-H data sets,
>> based on
>> >> > > below work. Will share more info soon.
>> >> > >
>> >> > > - http://www.cs.umb.edu/~poneil/StarSchemaB.PDF
>> >> > > - https://github.com/hortonworks/hive-testbench
>> >> > >
>> >> > >
>> >> > > Cheers
>> >> > > Yang
>> >> > >
>> >> > > On Wed, Dec 21, 2016 at 8:45 PM, Alberto Ramón <
>> >> > a.ramonporto...@gmail.com>
>> >> > > wrote:
>> >> > >
>> >> > > > When Kylin 2149 <https://issues.apache.org/jir
>> a/browse/KYLIN-2149>
>> >> > will
>> >> > > be
>> >> > > > solved the performance will be* improve even more*, because:
>> >> > > >
>> >> > > > you know that 2016-05-05 Belongs to May, Week 18, and friday ,
>> but
>> >> > kylin
>> >> > > > doesnt know it
>> >> > > > It will try to calulate the combination of 2016-05-05 with
>> January
>> >> > > February
>> >> > > > March, ... Monday Tuesday ..., W1 W2 ..., Q2 Q3 Q4 ==> There are
>> a
>> >> lot
>> >> > of
>> >> > > > combination wasted
>> >> > > >
>> >> > > > 2016-12-21 12:57 GMT+01:00 Luke_Selina <
>> huangzhendon...@gmail.com>:
>> >> > > >
>> >> > > > > Great and Agree! But I still have an question like Alberto, why
>> >> in an
>> >> > > AGG
>> >> > > > > one
>> >> > > > > dim can use only one regulation(mandatory, join, hierachy)?
>> >> > > > >
>> >> > > > > --
>> >> > > > > View this message in context: http://apache-kylin.74782.x6.
>> >> > > > > nabble.com/Kylin-Performance-tp6713p6728.html
>> >> > > > > Sent from the Apache Kylin mailing list archive at Nabble.com.
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Best regards,
>> >>
>> >> Shaofeng Shi 史少锋
>> >>
>> >
>> >
>>
>>
>> --
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>>
>
>

Re: Kylin Performance

Reply via email to