[ 
https://issues.apache.org/jira/browse/KYLIN-607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14609904#comment-14609904
 ] 

liyang commented on KYLIN-607:
------------------------------

We've got good test result about the new cubing algorithm, ~50% reduced build 
time according to initial test. The implementation is on branch 0.8.
- Test 1, 128 MB input, cube size 3.3 GB, expansion rate 26, build time reduced 
from 112 min to 49 min.
- Test 2, 50 GB input, cube size 1.6 GB, expansion rate 0.03, build time 
reduced from 75 min to 45 min.

The new cube build has following steps.

#1 Create Intermediate Flat Hive Table
   - Same as before
#2 Extract Fact Table Distinct Columns
   - One round MR that scans full input
   - Extract distinct values of each column that requires dictionary (same as 
before)
   - Estimate cuboid size using HyperLogLog
#3 Save Cuboid Statistics
   - Saves stats collected in #2 to metadata store
#4 Build Dimension Dictionary
   - Same as before
#5 Create HTable
   - Create HTable according to stats collected in #2
#6 Build Cube
   - One round MR to calculate whole cube from input.
   - Each mapper takes a split and calculate all cuboids in memory. The output 
is like a cube segment built from the split.
   - Each reducer corresponds to a HTable region, get inputs from mappers and 
do a final round of aggregate if mapper key space overlaps.
#7 Step Name: Load HFile to HBase Table
#8 Step Name: Update Cube Info
#9 Step Name: Garbage Collection

Discussions on why new build is faster.

- The new algorithm reduces shuffling a lot, because aggregation first happens 
in mapper and then the aggregated result is shuffled to reducer. In the current 
algorithm, it's the records BEFORE aggregation gets shuffled.
- Only two MR jobs, saves some MR overhead especially if your cluster is busy.
- Mapper becomes CPU intensive task because it does the in-mem cubing.
- Mapper splits are cut very small, say 10 MB each. Because of cube expansion, 
10 MB input may already yield 2.5 GB output bytes without compression.
- The new build creates HFile directly in #6, versus in the current build, 
cuboids are saved in HDFS first and an additional step is required to convert 
HFile.

Will continue to do more thorough testing on the new algorithm with larger data 
sets.


> More efficient cube building
> ----------------------------
>
>                 Key: KYLIN-607
>                 URL: https://issues.apache.org/jira/browse/KYLIN-607
>             Project: Kylin
>          Issue Type: New Feature
>          Components: Job Engine
>            Reporter: liyang
>            Assignee: liyang
>             Fix For: v0.8.1
>
>
> Right now cube building is by layer of spanning trees. The algorithm results 
> a total shuffle size around [Avg Cardinality] * [Total Cube Size]. This is 
> the current biggest bottleneck of cube building in eBay deployment.
> Propose a different algorithm:
> 1. Each mapper builds a cube segment independent, and output.
> 2. One round of shuffle merge sorts the segments.
> 3. Reducer outputs the final merged cube.
> This could achieve 1 * [Total Cube Size] shuffling when there's a mandatory 
> dimension and each mapper takes a different piece on the dimension. E.g. 
> month is mandatory and each mapper is assign a different month data.
> This algorithm is also more friendly to streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to