We've got good test result about the new cubing algorithm, ~50% reduced build time according to initial test. The implementation (KYLIN-607) is on branch 0.8. - Test 1, 128 MB input, cube size 3.3 GB, expansion rate 26, build time reduced from 112 min to 49 min. - Test 2, 50 GB input, cube size 1.6 GB, expansion rate 0.03, build time reduced from 75 min to 45 min.
The new cube build has following steps. #1 Create Intermediate Flat Hive Table - Same as before #2 Extract Fact Table Distinct Columns - One round MR that scans full input - Extract distinct values of each column that requires dictionary (same as before) - Estimate cuboid size using HyperLogLog #3 Save Cuboid Statistics - Saves stats collected in #2 to metadata store #4 Build Dimension Dictionary - Same as before #5 Create HTable - Create HTable according to stats collected in #2 #6 Build Cube - One round MR to calculate whole cube from input. - Each mapper takes a split and calculate all cuboids in memory. The output is like a cube segment built from the split. - Each reducer corresponds to a HTable region, get inputs from mappers and do a final round of aggregate if mapper key space overlaps. #7 Step Name: Load HFile to HBase Table #8 Step Name: Update Cube Info #9 Step Name: Garbage Collection Discussions on why new build is faster. - The new algorithm reduces shuffling a lot, because aggregation first happens in mapper and then the aggregated result is shuffled to reducer. In the current algorithm, it's the records BEFORE aggregation gets shuffled. - Only two MR jobs, saves some MR overhead especially if your cluster is busy. - Mapper becomes CPU intensive task because it does the in-mem cubing. - Mapper splits are cut very small, say 10 MB each. Because of cube expansion, 10 MB input may already yield 2.5 GB output bytes without compression. - The new build creates HFile directly in #6, versus in the current build, cuboids are saved in HDFS first and an additional step is required to convert HFile. Will continue to do more thorough testing on the new algorithm with larger data sets. Cheers Yang
