Re: Kylin Performance

ShaoFeng Shi Wed, 28 Dec 2016 02:20:45 -0800

Alberto, the image can not be displayed :-<

2016-12-28 2:39 GMT+08:00 Alberto Ramón <[email protected]>:


> Kylin 2165 will be nice
>
> Yes 30% of total cube, because the cardinality of  DIM was low ( 2K and
> 11K)
>
> You are in true: When the cardinality of  DIM are 1M,  the intermediate
> table is only 5% of total: Picture (I don't know you can see pictures in
> this mailList)
> [image: Imágenes integradas 1]
>
>
> 2016-12-27 2:32 GMT+01:00 ShaoFeng Shi <[email protected]>:
>
>> Alberto, I didn't test ORC format; but as you know, Kylin consumes the
>> source data row by row (all columns at once), so I guess columnar format
>> like ORC may not benefit much. But this is a good try, if there is better
>> format we can switch to it.
>>
>> The "redistribute flat hive table" will add time but it can reduce time in
>> subsequent cube building (avoid data skew), especially when there are lots
>> of records. Usually it is fast (a couple minutes to ten or twenty minutes)
>> comparing to the cube build time. You mentioned it took 30% of total time,
>> what's the total time and what's the input number? When the input is
>> small,
>> the overhead may overcome the benefit.
>>
>> For the method you mentioned (count on fact table, then put the
>> redistribute to step 1), actually it is supported in Kylin 1.5.4 (maybe
>> also 1.5.3) with a config parameter; but that method is not recommended as
>> it is unstable: In some cases (e.g, the fact table is a big hive view, or
>> it is a big table but not partitioned by date), a simple "select count(*)
>> from fact_table" will cost lots of resources on Hadoop, a second "create
>> intermediate_table as select ..." will start the same mappers again.
>>
>> In contrast, the as-is method is relatively stable for extreme case;
>> usually the intermediate table is much smaller than fact table, count and
>> redistribute on it will be low-cost; In next version there will be a
>> further optimization (https://issues.apache.org/jira/browse/KYLIN-2165)
>> to
>> reduce the time in this step.
>>
>>
>> 2016-12-27 1:20 GMT+08:00 Alberto Ramón <[email protected]>:
>>
>> > Hello
>> >
>> > from v0, I correct english sintaxis
>> >
>> >
>> > After tunning of cube:
>> >   -  Use Hive input compress table
>> >   -  Define  Hierarchy, Joint, Dim
>> >   -  . . .
>> >
>> > Now:  57% if for first steps (flat table, steps: 1,2,3)  and 43% for
>> build
>> > cube
>> >
>> > I saw flat table uses SEQUENCEFILE, then I tested to use
>> >    ORC,
>> >    ORC + Snappy
>> >    ORC + Snappy + Vectorization
>> >
>> > without good results, more ideas ??
>> >
>> >
>> > I'm thinking that 'Redistribute Flat Hive Table' is a simple count and
>> uses
>> >
>> > *30% of total time*
>> >   Is this the normal case ?
>> >   We can aprox this count to: count of Fact Table (Will true 99% of
>> time),
>> > and put in // with step 1, is necessary be precise?
>> >
>> > 2016-12-22 14:00 GMT+01:00 Li Yang <[email protected]>:
>> >
>> > > Very good work!
>> > >
>> > > Btw, we are also doing benchmarks on SSB and TPC-H data sets, based on
>> > > below work. Will share more info soon.
>> > >
>> > > - http://www.cs.umb.edu/~poneil/StarSchemaB.PDF
>> > > - https://github.com/hortonworks/hive-testbench
>> > >
>> > >
>> > > Cheers
>> > > Yang
>> > >
>> > > On Wed, Dec 21, 2016 at 8:45 PM, Alberto Ramón <
>> > [email protected]>
>> > > wrote:
>> > >
>> > > > When Kylin 2149 <https://issues.apache.org/jira/browse/KYLIN-2149>
>> > will
>> > > be
>> > > > solved the performance will be* improve even more*, because:
>> > > >
>> > > > you know that 2016-05-05 Belongs to May, Week 18, and friday , but
>> > kylin
>> > > > doesnt know it
>> > > > It will try to calulate the combination of 2016-05-05 with January
>> > > February
>> > > > March, ... Monday Tuesday ..., W1 W2 ..., Q2 Q3 Q4 ==> There are a
>> lot
>> > of
>> > > > combination wasted
>> > > >
>> > > > 2016-12-21 12:57 GMT+01:00 Luke_Selina <[email protected]>:
>> > > >
>> > > > > Great and Agree! But I still have an question like Alberto, why
>> in an
>> > > AGG
>> > > > > one
>> > > > > dim can use only one regulation(mandatory, join, hierachy)?
>> > > > >
>> > > > > --
>> > > > > View this message in context: http://apache-kylin.74782.x6.
>> > > > > nabble.com/Kylin-Performance-tp6713p6728.html
>> > > > > Sent from the Apache Kylin mailing list archive at Nabble.com.
>> > > > >
>> > > >
>> > >
>> >
>>
>>
>>
>> --
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>>
>
>


-- 
Best regards,

Shaofeng Shi 史少锋

Re: Kylin Performance

Reply via email to