Alberto, the image can not be displayed :-< 2016-12-28 2:39 GMT+08:00 Alberto Ramón <[email protected]>:
> Kylin 2165 will be nice > > Yes 30% of total cube, because the cardinality of DIM was low ( 2K and > 11K) > > You are in true: When the cardinality of DIM are 1M, the intermediate > table is only 5% of total: Picture (I don't know you can see pictures in > this mailList) > [image: Imágenes integradas 1] > > > 2016-12-27 2:32 GMT+01:00 ShaoFeng Shi <[email protected]>: > >> Alberto, I didn't test ORC format; but as you know, Kylin consumes the >> source data row by row (all columns at once), so I guess columnar format >> like ORC may not benefit much. But this is a good try, if there is better >> format we can switch to it. >> >> The "redistribute flat hive table" will add time but it can reduce time in >> subsequent cube building (avoid data skew), especially when there are lots >> of records. Usually it is fast (a couple minutes to ten or twenty minutes) >> comparing to the cube build time. You mentioned it took 30% of total time, >> what's the total time and what's the input number? When the input is >> small, >> the overhead may overcome the benefit. >> >> For the method you mentioned (count on fact table, then put the >> redistribute to step 1), actually it is supported in Kylin 1.5.4 (maybe >> also 1.5.3) with a config parameter; but that method is not recommended as >> it is unstable: In some cases (e.g, the fact table is a big hive view, or >> it is a big table but not partitioned by date), a simple "select count(*) >> from fact_table" will cost lots of resources on Hadoop, a second "create >> intermediate_table as select ..." will start the same mappers again. >> >> In contrast, the as-is method is relatively stable for extreme case; >> usually the intermediate table is much smaller than fact table, count and >> redistribute on it will be low-cost; In next version there will be a >> further optimization (https://issues.apache.org/jira/browse/KYLIN-2165) >> to >> reduce the time in this step. >> >> >> 2016-12-27 1:20 GMT+08:00 Alberto Ramón <[email protected]>: >> >> > Hello >> > >> > from v0, I correct english sintaxis >> > >> > >> > After tunning of cube: >> > - Use Hive input compress table >> > - Define Hierarchy, Joint, Dim >> > - . . . >> > >> > Now: 57% if for first steps (flat table, steps: 1,2,3) and 43% for >> build >> > cube >> > >> > I saw flat table uses SEQUENCEFILE, then I tested to use >> > ORC, >> > ORC + Snappy >> > ORC + Snappy + Vectorization >> > >> > without good results, more ideas ?? >> > >> > >> > I'm thinking that 'Redistribute Flat Hive Table' is a simple count and >> uses >> > >> > *30% of total time* >> > Is this the normal case ? >> > We can aprox this count to: count of Fact Table (Will true 99% of >> time), >> > and put in // with step 1, is necessary be precise? >> > >> > 2016-12-22 14:00 GMT+01:00 Li Yang <[email protected]>: >> > >> > > Very good work! >> > > >> > > Btw, we are also doing benchmarks on SSB and TPC-H data sets, based on >> > > below work. Will share more info soon. >> > > >> > > - http://www.cs.umb.edu/~poneil/StarSchemaB.PDF >> > > - https://github.com/hortonworks/hive-testbench >> > > >> > > >> > > Cheers >> > > Yang >> > > >> > > On Wed, Dec 21, 2016 at 8:45 PM, Alberto Ramón < >> > [email protected]> >> > > wrote: >> > > >> > > > When Kylin 2149 <https://issues.apache.org/jira/browse/KYLIN-2149> >> > will >> > > be >> > > > solved the performance will be* improve even more*, because: >> > > > >> > > > you know that 2016-05-05 Belongs to May, Week 18, and friday , but >> > kylin >> > > > doesnt know it >> > > > It will try to calulate the combination of 2016-05-05 with January >> > > February >> > > > March, ... Monday Tuesday ..., W1 W2 ..., Q2 Q3 Q4 ==> There are a >> lot >> > of >> > > > combination wasted >> > > > >> > > > 2016-12-21 12:57 GMT+01:00 Luke_Selina <[email protected]>: >> > > > >> > > > > Great and Agree! But I still have an question like Alberto, why >> in an >> > > AGG >> > > > > one >> > > > > dim can use only one regulation(mandatory, join, hierachy)? >> > > > > >> > > > > -- >> > > > > View this message in context: http://apache-kylin.74782.x6. >> > > > > nabble.com/Kylin-Performance-tp6713p6728.html >> > > > > Sent from the Apache Kylin mailing list archive at Nabble.com. >> > > > > >> > > > >> > > >> > >> >> >> >> -- >> Best regards, >> >> Shaofeng Shi 史少锋 >> > > -- Best regards, Shaofeng Shi 史少锋
