Don`t worry, I'm going to completed my KylinPerformace_I.pdf with new tests and some notes
2016-12-28 11:19 GMT+01:00 ShaoFeng Shi <[email protected]>: > Alberto, the image can not be displayed :-< > > 2016-12-28 2:39 GMT+08:00 Alberto Ramón <[email protected]>: > > > Kylin 2165 will be nice > > > > Yes 30% of total cube, because the cardinality of DIM was low ( 2K and > > 11K) > > > > You are in true: When the cardinality of DIM are 1M, the intermediate > > table is only 5% of total: Picture (I don't know you can see pictures in > > this mailList) > > [image: Imágenes integradas 1] > > > > > > 2016-12-27 2:32 GMT+01:00 ShaoFeng Shi <[email protected]>: > > > >> Alberto, I didn't test ORC format; but as you know, Kylin consumes the > >> source data row by row (all columns at once), so I guess columnar format > >> like ORC may not benefit much. But this is a good try, if there is > better > >> format we can switch to it. > >> > >> The "redistribute flat hive table" will add time but it can reduce time > in > >> subsequent cube building (avoid data skew), especially when there are > lots > >> of records. Usually it is fast (a couple minutes to ten or twenty > minutes) > >> comparing to the cube build time. You mentioned it took 30% of total > time, > >> what's the total time and what's the input number? When the input is > >> small, > >> the overhead may overcome the benefit. > >> > >> For the method you mentioned (count on fact table, then put the > >> redistribute to step 1), actually it is supported in Kylin 1.5.4 (maybe > >> also 1.5.3) with a config parameter; but that method is not recommended > as > >> it is unstable: In some cases (e.g, the fact table is a big hive view, > or > >> it is a big table but not partitioned by date), a simple "select > count(*) > >> from fact_table" will cost lots of resources on Hadoop, a second "create > >> intermediate_table as select ..." will start the same mappers again. > >> > >> In contrast, the as-is method is relatively stable for extreme case; > >> usually the intermediate table is much smaller than fact table, count > and > >> redistribute on it will be low-cost; In next version there will be a > >> further optimization (https://issues.apache.org/jira/browse/KYLIN-2165) > >> to > >> reduce the time in this step. > >> > >> > >> 2016-12-27 1:20 GMT+08:00 Alberto Ramón <[email protected]>: > >> > >> > Hello > >> > > >> > from v0, I correct english sintaxis > >> > > >> > > >> > After tunning of cube: > >> > - Use Hive input compress table > >> > - Define Hierarchy, Joint, Dim > >> > - . . . > >> > > >> > Now: 57% if for first steps (flat table, steps: 1,2,3) and 43% for > >> build > >> > cube > >> > > >> > I saw flat table uses SEQUENCEFILE, then I tested to use > >> > ORC, > >> > ORC + Snappy > >> > ORC + Snappy + Vectorization > >> > > >> > without good results, more ideas ?? > >> > > >> > > >> > I'm thinking that 'Redistribute Flat Hive Table' is a simple count and > >> uses > >> > > >> > *30% of total time* > >> > Is this the normal case ? > >> > We can aprox this count to: count of Fact Table (Will true 99% of > >> time), > >> > and put in // with step 1, is necessary be precise? > >> > > >> > 2016-12-22 14:00 GMT+01:00 Li Yang <[email protected]>: > >> > > >> > > Very good work! > >> > > > >> > > Btw, we are also doing benchmarks on SSB and TPC-H data sets, based > on > >> > > below work. Will share more info soon. > >> > > > >> > > - http://www.cs.umb.edu/~poneil/StarSchemaB.PDF > >> > > - https://github.com/hortonworks/hive-testbench > >> > > > >> > > > >> > > Cheers > >> > > Yang > >> > > > >> > > On Wed, Dec 21, 2016 at 8:45 PM, Alberto Ramón < > >> > [email protected]> > >> > > wrote: > >> > > > >> > > > When Kylin 2149 <https://issues.apache.org/jira/browse/KYLIN-2149 > > > >> > will > >> > > be > >> > > > solved the performance will be* improve even more*, because: > >> > > > > >> > > > you know that 2016-05-05 Belongs to May, Week 18, and friday , but > >> > kylin > >> > > > doesnt know it > >> > > > It will try to calulate the combination of 2016-05-05 with January > >> > > February > >> > > > March, ... Monday Tuesday ..., W1 W2 ..., Q2 Q3 Q4 ==> There are a > >> lot > >> > of > >> > > > combination wasted > >> > > > > >> > > > 2016-12-21 12:57 GMT+01:00 Luke_Selina <[email protected] > >: > >> > > > > >> > > > > Great and Agree! But I still have an question like Alberto, why > >> in an > >> > > AGG > >> > > > > one > >> > > > > dim can use only one regulation(mandatory, join, hierachy)? > >> > > > > > >> > > > > -- > >> > > > > View this message in context: http://apache-kylin.74782.x6. > >> > > > > nabble.com/Kylin-Performance-tp6713p6728.html > >> > > > > Sent from the Apache Kylin mailing list archive at Nabble.com. > >> > > > > > >> > > > > >> > > > >> > > >> > >> > >> > >> -- > >> Best regards, > >> > >> Shaofeng Shi 史少锋 > >> > > > > > > > -- > Best regards, > > Shaofeng Shi 史少锋 >
