Awesome, very nice job! So happy to bought you guys ice cream today;-) For people who are using 0.7.x version: we are working on back port this to 0.7 branch now, will release under 0.7.3 or 0.7.4 version after testing with our internal real cases, with more confident.
Since 0.8.x is still very early version, we do not recommend to use 0.8.x version for your production case. Thanks. Best Regards! --------------------- Luke Han On Wed, Jul 1, 2015 at 6:45 PM, Li Yang <[email protected]> wrote: > We've got good test result about the new cubing algorithm, ~50% reduced > build time according to initial test. The implementation (KYLIN-607) is on > branch 0.8. > - Test 1, 128 MB input, cube size 3.3 GB, expansion rate 26, build time > reduced from 112 min to 49 min. > - Test 2, 50 GB input, cube size 1.6 GB, expansion rate 0.03, build time > reduced from 75 min to 45 min. > > The new cube build has following steps. > > #1 Create Intermediate Flat Hive Table > - Same as before > #2 Extract Fact Table Distinct Columns > - One round MR that scans full input > - Extract distinct values of each column that requires dictionary (same > as before) > - Estimate cuboid size using HyperLogLog > #3 Save Cuboid Statistics > - Saves stats collected in #2 to metadata store > #4 Build Dimension Dictionary > - Same as before > #5 Create HTable > - Create HTable according to stats collected in #2 > #6 Build Cube > - One round MR to calculate whole cube from input. > - Each mapper takes a split and calculate all cuboids in memory. The > output is like a cube segment built from the split. > - Each reducer corresponds to a HTable region, get inputs from mappers > and do a final round of aggregate if mapper key space overlaps. > #7 Step Name: Load HFile to HBase Table > #8 Step Name: Update Cube Info > #9 Step Name: Garbage Collection > > Discussions on why new build is faster. > > - The new algorithm reduces shuffling a lot, because aggregation first > happens in mapper and then the aggregated result is shuffled to reducer. In > the current algorithm, it's the records BEFORE aggregation gets shuffled. > - Only two MR jobs, saves some MR overhead especially if your cluster is > busy. > - Mapper becomes CPU intensive task because it does the in-mem cubing. > - Mapper splits are cut very small, say 10 MB each. Because of cube > expansion, 10 MB input may already yield 2.5 GB output bytes without > compression. > - The new build creates HFile directly in #6, versus in the current > build, cuboids are saved in HDFS first and an additional step is required > to convert HFile. > > Will continue to do more thorough testing on the new algorithm with larger > data sets. > > Cheers > Yang >
