[GitHub] spark pull request #17000: [SPARK-18946][ML] sliceAggregate which is a new a...

ZunwenYou Mon, 20 Feb 2017 06:47:26 -0800

GitHub user ZunwenYou opened a pull request:

    https://github.com/apache/spark/pull/17000


    [SPARK-18946][ML] sliceAggregate which is a new aggregate operator for 
high-dimensional data

    In many machine learning cases, driver has to aggregate high-dimensional 
vectors/arrays from executors.
    TreeAggregate is good solution for aggregating vectors to driver, and you 
can increase depth of tree when data is large.
    However, treeAggregate would still failed, when the parition number of RDD 
and the dimension of vector grows up.
    
    We propose a new operator of RDD, named sliceAggregate, which split the 
vector into **_n_** slices and each slice is assigned a key(from 0 to 
**_n-1_**). The RDD[key, slice] will be transform to RDD[slice] by using 
reduceByKey operator.
    Finally driver will collect and compose the **_n_** slices to obtain result.
    
    ![qq 
20170220214746](https://cloud.githubusercontent.com/assets/4936059/23129541/0f462474-f7be-11e6-997e-e494b98eee69.png)
    
    
    I run an experiment which calculate the statistic values of features.
    The number of samples is 1000. The feature dimension ranges from 10k to 
20m, the comparition of time cost between treeAggregate and sliceAggregate is 
shown as follows. When feature dimension reach 20 million, treeAggregate was 
failed.
    
    The table of time cost(ms) between sliceAggregate and treeAggregate.
    
    | feature dimension |       sliceAggregate |        treeAggregate |
    | ---- | ---- |---- | 
    | 10K |     617 |   607 |
    |100K |     1470  |     967  | 
    |1M    |    4019 |  4731 |
    |2.5M |     7679 |  13348 |
    |5M  |      14722 | 22858 |
    |7.5M |     20821 | 36472 |
    |10M |      28526 | 50184 |
    |20M |       47014 |  | |
     
    
![image](https://cloud.githubusercontent.com/assets/4936059/23129580/32e50260-f7be-11e6-9152-badc6423c356.png)
    
    The code relate to this experiment is 
[here](https://github.com/ZunwenYou/spark/blob/slice-aggregate-experiment/mllib/src/main/scala/org/apache/spark/ml/classification/SliceAggregate.scala)
 .

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ZunwenYou/spark slice-aggregate1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17000.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17000
    
----
commit 0e1ef6fdbc8e89c9eefecdc22d7957959117f747
Author: kevinzwyou <[email protected]>
Date:   2017-02-14T12:38:02Z

    slice aggregte

commit d480c4fc9d0f70e6ff93aeb691ce4afb3f28c4c4
Author: kevinzwyou <[email protected]>
Date:   2017-02-20T11:49:29Z

    remove sample files.

commit a86892b3cf05ba6959491fbd3409c978bb8afe8b
Author: kevinzwyou <[email protected]>
Date:   2017-02-20T11:51:41Z

    add an end line

commit f39cfad04cabd414bb48126fced9560f2e3883c0
Author: kevinzwyou <[email protected]>
Date:   2017-02-20T14:12:20Z

    fix an import style.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #17000: [SPARK-18946][ML] sliceAggregate which is a new a...

Reply via email to