Re: [jira] [Commented] (KYLIN-607) More efficient cube building

Luke Han Wed, 01 Jul 2015 07:13:53 -0700

Awesome result, saved a lot time and computing resources.


Best Regards!
---------------------

Luke Han

On Wed, Jul 1, 2015 at 6:38 PM, liyang (JIRA) <[email protected]> wrote:

>
>     [
> https://issues.apache.org/jira/browse/KYLIN-607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14609904#comment-14609904
> ]
>
> liyang commented on KYLIN-607:
> ------------------------------
>
> We've got good test result about the new cubing algorithm, ~50% reduced
> build time according to initial test. The implementation is on branch 0.8.
> - Test 1, 128 MB input, cube size 3.3 GB, expansion rate 26, build time
> reduced from 112 min to 49 min.
> - Test 2, 50 GB input, cube size 1.6 GB, expansion rate 0.03, build time
> reduced from 75 min to 45 min.
>
> The new cube build has following steps.
>
> #1 Create Intermediate Flat Hive Table
>    - Same as before
> #2 Extract Fact Table Distinct Columns
>    - One round MR that scans full input
>    - Extract distinct values of each column that requires dictionary (same
> as before)
>    - Estimate cuboid size using HyperLogLog
> #3 Save Cuboid Statistics
>    - Saves stats collected in #2 to metadata store
> #4 Build Dimension Dictionary
>    - Same as before
> #5 Create HTable
>    - Create HTable according to stats collected in #2
> #6 Build Cube
>    - One round MR to calculate whole cube from input.
>    - Each mapper takes a split and calculate all cuboids in memory. The
> output is like a cube segment built from the split.
>    - Each reducer corresponds to a HTable region, get inputs from mappers
> and do a final round of aggregate if mapper key space overlaps.
> #7 Step Name: Load HFile to HBase Table
> #8 Step Name: Update Cube Info
> #9 Step Name: Garbage Collection
>
> Discussions on why new build is faster.
>
> - The new algorithm reduces shuffling a lot, because aggregation first
> happens in mapper and then the aggregated result is shuffled to reducer. In
> the current algorithm, it's the records BEFORE aggregation gets shuffled.
> - Only two MR jobs, saves some MR overhead especially if your cluster is
> busy.
> - Mapper becomes CPU intensive task because it does the in-mem cubing.
> - Mapper splits are cut very small, say 10 MB each. Because of cube
> expansion, 10 MB input may already yield 2.5 GB output bytes without
> compression.
> - The new build creates HFile directly in #6, versus in the current build,
> cuboids are saved in HDFS first and an additional step is required to
> convert HFile.
>
> Will continue to do more thorough testing on the new algorithm with larger
> data sets.
>
>
> > More efficient cube building
> > ----------------------------
> >
> >                 Key: KYLIN-607
> >                 URL: https://issues.apache.org/jira/browse/KYLIN-607
> >             Project: Kylin
> >          Issue Type: New Feature
> >          Components: Job Engine
> >            Reporter: liyang
> >            Assignee: liyang
> >             Fix For: v0.8.1
> >
> >
> > Right now cube building is by layer of spanning trees. The algorithm
> results a total shuffle size around [Avg Cardinality] * [Total Cube Size].
> This is the current biggest bottleneck of cube building in eBay deployment.
> > Propose a different algorithm:
> > 1. Each mapper builds a cube segment independent, and output.
> > 2. One round of shuffle merge sorts the segments.
> > 3. Reducer outputs the final merged cube.
> > This could achieve 1 * [Total Cube Size] shuffling when there's a
> mandatory dimension and each mapper takes a different piece on the
> dimension. E.g. month is mandatory and each mapper is assign a different
> month data.
> > This algorithm is also more friendly to streaming.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>

Re: [jira] [Commented] (KYLIN-607) More efficient cube building

Reply via email to