We've got good test result about the new cubing algorithm, ~50% reduced
build time according to initial test. The implementation (KYLIN-607) is on
branch 0.8.
- Test 1, 128 MB input, cube size 3.3 GB, expansion rate 26, build time
reduced from 112 min to 49 min.
- Test 2, 50 GB input, cube size 1.6 GB, expansion rate 0.03, build time
reduced from 75 min to 45 min.

The new cube build has following steps.

#1 Create Intermediate Flat Hive Table
   - Same as before
#2 Extract Fact Table Distinct Columns
   - One round MR that scans full input
   - Extract distinct values of each column that requires dictionary (same
as before)
   - Estimate cuboid size using HyperLogLog
#3 Save Cuboid Statistics
   - Saves stats collected in #2 to metadata store
#4 Build Dimension Dictionary
   - Same as before
#5 Create HTable
   - Create HTable according to stats collected in #2
#6 Build Cube
   - One round MR to calculate whole cube from input.
   - Each mapper takes a split and calculate all cuboids in memory. The
output is like a cube segment built from the split.
   - Each reducer corresponds to a HTable region, get inputs from mappers
and do a final round of aggregate if mapper key space overlaps.
#7 Step Name: Load HFile to HBase Table
#8 Step Name: Update Cube Info
#9 Step Name: Garbage Collection

Discussions on why new build is faster.

   - The new algorithm reduces shuffling a lot, because aggregation first
happens in mapper and then the aggregated result is shuffled to reducer. In
the current algorithm, it's the records BEFORE aggregation gets shuffled.
   - Only two MR jobs, saves some MR overhead especially if your cluster is
busy.
   - Mapper becomes CPU intensive task because it does the in-mem cubing.
   - Mapper splits are cut very small, say 10 MB each. Because of cube
expansion, 10 MB input may already yield 2.5 GB output bytes without
compression.
   - The new build creates HFile directly in #6, versus in the current
build, cuboids are saved in HDFS first and an additional step is required
to convert HFile.

Will continue to do more thorough testing on the new algorithm with larger
data sets.

Cheers
Yang

Reply via email to