About Kylin performance, I completed some uses cases:
https://github.com/albertoRamon/Kylin/tree/master/KylinPerformance Any contribution or correction will be appreciated BR, Alb 2016-12-28 11:32 GMT+01:00 Alberto Ramón <a.ramonporto...@gmail.com>: > Don`t worry, I'm going to completed my KylinPerformace_I.pdf with new > tests and some notes > > 2016-12-28 11:19 GMT+01:00 ShaoFeng Shi <shaofeng...@apache.org>: > >> Alberto, the image can not be displayed :-< >> >> 2016-12-28 2:39 GMT+08:00 Alberto Ramón <a.ramonporto...@gmail.com>: >> >> > Kylin 2165 will be nice >> > >> > Yes 30% of total cube, because the cardinality of DIM was low ( 2K and >> > 11K) >> > >> > You are in true: When the cardinality of DIM are 1M, the intermediate >> > table is only 5% of total: Picture (I don't know you can see pictures in >> > this mailList) >> > [image: Imágenes integradas 1] >> > >> > >> > 2016-12-27 2:32 GMT+01:00 ShaoFeng Shi <shaofeng...@apache.org>: >> > >> >> Alberto, I didn't test ORC format; but as you know, Kylin consumes the >> >> source data row by row (all columns at once), so I guess columnar >> format >> >> like ORC may not benefit much. But this is a good try, if there is >> better >> >> format we can switch to it. >> >> >> >> The "redistribute flat hive table" will add time but it can reduce >> time in >> >> subsequent cube building (avoid data skew), especially when there are >> lots >> >> of records. Usually it is fast (a couple minutes to ten or twenty >> minutes) >> >> comparing to the cube build time. You mentioned it took 30% of total >> time, >> >> what's the total time and what's the input number? When the input is >> >> small, >> >> the overhead may overcome the benefit. >> >> >> >> For the method you mentioned (count on fact table, then put the >> >> redistribute to step 1), actually it is supported in Kylin 1.5.4 (maybe >> >> also 1.5.3) with a config parameter; but that method is not >> recommended as >> >> it is unstable: In some cases (e.g, the fact table is a big hive view, >> or >> >> it is a big table but not partitioned by date), a simple "select >> count(*) >> >> from fact_table" will cost lots of resources on Hadoop, a second >> "create >> >> intermediate_table as select ..." will start the same mappers again. >> >> >> >> In contrast, the as-is method is relatively stable for extreme case; >> >> usually the intermediate table is much smaller than fact table, count >> and >> >> redistribute on it will be low-cost; In next version there will be a >> >> further optimization (https://issues.apache.org/jira/browse/KYLIN-2165 >> ) >> >> to >> >> reduce the time in this step. >> >> >> >> >> >> 2016-12-27 1:20 GMT+08:00 Alberto Ramón <a.ramonporto...@gmail.com>: >> >> >> >> > Hello >> >> > >> >> > from v0, I correct english sintaxis >> >> > >> >> > >> >> > After tunning of cube: >> >> > - Use Hive input compress table >> >> > - Define Hierarchy, Joint, Dim >> >> > - . . . >> >> > >> >> > Now: 57% if for first steps (flat table, steps: 1,2,3) and 43% for >> >> build >> >> > cube >> >> > >> >> > I saw flat table uses SEQUENCEFILE, then I tested to use >> >> > ORC, >> >> > ORC + Snappy >> >> > ORC + Snappy + Vectorization >> >> > >> >> > without good results, more ideas ?? >> >> > >> >> > >> >> > I'm thinking that 'Redistribute Flat Hive Table' is a simple count >> and >> >> uses >> >> > >> >> > *30% of total time* >> >> > Is this the normal case ? >> >> > We can aprox this count to: count of Fact Table (Will true 99% of >> >> time), >> >> > and put in // with step 1, is necessary be precise? >> >> > >> >> > 2016-12-22 14:00 GMT+01:00 Li Yang <liy...@apache.org>: >> >> > >> >> > > Very good work! >> >> > > >> >> > > Btw, we are also doing benchmarks on SSB and TPC-H data sets, >> based on >> >> > > below work. Will share more info soon. >> >> > > >> >> > > - http://www.cs.umb.edu/~poneil/StarSchemaB.PDF >> >> > > - https://github.com/hortonworks/hive-testbench >> >> > > >> >> > > >> >> > > Cheers >> >> > > Yang >> >> > > >> >> > > On Wed, Dec 21, 2016 at 8:45 PM, Alberto Ramón < >> >> > a.ramonporto...@gmail.com> >> >> > > wrote: >> >> > > >> >> > > > When Kylin 2149 <https://issues.apache.org/jir >> a/browse/KYLIN-2149> >> >> > will >> >> > > be >> >> > > > solved the performance will be* improve even more*, because: >> >> > > > >> >> > > > you know that 2016-05-05 Belongs to May, Week 18, and friday , >> but >> >> > kylin >> >> > > > doesnt know it >> >> > > > It will try to calulate the combination of 2016-05-05 with >> January >> >> > > February >> >> > > > March, ... Monday Tuesday ..., W1 W2 ..., Q2 Q3 Q4 ==> There are >> a >> >> lot >> >> > of >> >> > > > combination wasted >> >> > > > >> >> > > > 2016-12-21 12:57 GMT+01:00 Luke_Selina < >> huangzhendon...@gmail.com>: >> >> > > > >> >> > > > > Great and Agree! But I still have an question like Alberto, why >> >> in an >> >> > > AGG >> >> > > > > one >> >> > > > > dim can use only one regulation(mandatory, join, hierachy)? >> >> > > > > >> >> > > > > -- >> >> > > > > View this message in context: http://apache-kylin.74782.x6. >> >> > > > > nabble.com/Kylin-Performance-tp6713p6728.html >> >> > > > > Sent from the Apache Kylin mailing list archive at Nabble.com. >> >> > > > > >> >> > > > >> >> > > >> >> > >> >> >> >> >> >> >> >> -- >> >> Best regards, >> >> >> >> Shaofeng Shi 史少锋 >> >> >> > >> > >> >> >> -- >> Best regards, >> >> Shaofeng Shi 史少锋 >> > >