Awesome result, saved a lot time and computing resources.
Best Regards! --------------------- Luke Han On Wed, Jul 1, 2015 at 6:38 PM, liyang (JIRA) <[email protected]> wrote: > > [ > https://issues.apache.org/jira/browse/KYLIN-607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14609904#comment-14609904 > ] > > liyang commented on KYLIN-607: > ------------------------------ > > We've got good test result about the new cubing algorithm, ~50% reduced > build time according to initial test. The implementation is on branch 0.8. > - Test 1, 128 MB input, cube size 3.3 GB, expansion rate 26, build time > reduced from 112 min to 49 min. > - Test 2, 50 GB input, cube size 1.6 GB, expansion rate 0.03, build time > reduced from 75 min to 45 min. > > The new cube build has following steps. > > #1 Create Intermediate Flat Hive Table > - Same as before > #2 Extract Fact Table Distinct Columns > - One round MR that scans full input > - Extract distinct values of each column that requires dictionary (same > as before) > - Estimate cuboid size using HyperLogLog > #3 Save Cuboid Statistics > - Saves stats collected in #2 to metadata store > #4 Build Dimension Dictionary > - Same as before > #5 Create HTable > - Create HTable according to stats collected in #2 > #6 Build Cube > - One round MR to calculate whole cube from input. > - Each mapper takes a split and calculate all cuboids in memory. The > output is like a cube segment built from the split. > - Each reducer corresponds to a HTable region, get inputs from mappers > and do a final round of aggregate if mapper key space overlaps. > #7 Step Name: Load HFile to HBase Table > #8 Step Name: Update Cube Info > #9 Step Name: Garbage Collection > > Discussions on why new build is faster. > > - The new algorithm reduces shuffling a lot, because aggregation first > happens in mapper and then the aggregated result is shuffled to reducer. In > the current algorithm, it's the records BEFORE aggregation gets shuffled. > - Only two MR jobs, saves some MR overhead especially if your cluster is > busy. > - Mapper becomes CPU intensive task because it does the in-mem cubing. > - Mapper splits are cut very small, say 10 MB each. Because of cube > expansion, 10 MB input may already yield 2.5 GB output bytes without > compression. > - The new build creates HFile directly in #6, versus in the current build, > cuboids are saved in HDFS first and an additional step is required to > convert HFile. > > Will continue to do more thorough testing on the new algorithm with larger > data sets. > > > > More efficient cube building > > ---------------------------- > > > > Key: KYLIN-607 > > URL: https://issues.apache.org/jira/browse/KYLIN-607 > > Project: Kylin > > Issue Type: New Feature > > Components: Job Engine > > Reporter: liyang > > Assignee: liyang > > Fix For: v0.8.1 > > > > > > Right now cube building is by layer of spanning trees. The algorithm > results a total shuffle size around [Avg Cardinality] * [Total Cube Size]. > This is the current biggest bottleneck of cube building in eBay deployment. > > Propose a different algorithm: > > 1. Each mapper builds a cube segment independent, and output. > > 2. One round of shuffle merge sorts the segments. > > 3. Reducer outputs the final merged cube. > > This could achieve 1 * [Total Cube Size] shuffling when there's a > mandatory dimension and each mapper takes a different piece on the > dimension. E.g. month is mandatory and each mapper is assign a different > month data. > > This algorithm is also more friendly to streaming. > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332) >
