GitHub user ZunwenYou opened a pull request:
https://github.com/apache/spark/pull/17000
[SPARK-18946][ML] sliceAggregate which is a new aggregate operator for
high-dimensional data
In many machine learning cases, driver has to aggregate high-dimensional
vectors/arrays from executors.
TreeAggregate is good solution for aggregating vectors to driver, and you
can increase depth of tree when data is large.
However, treeAggregate would still failed, when the parition number of RDD
and the dimension of vector grows up.
We propose a new operator of RDD, named sliceAggregate, which split the
vector into **_n_** slices and each slice is assigned a key(from 0 to
**_n-1_**). The RDD[key, slice] will be transform to RDD[slice] by using
reduceByKey operator.
Finally driver will collect and compose the **_n_** slices to obtain result.

I run an experiment which calculate the statistic values of features.
The number of samples is 1000. The feature dimension ranges from 10k to
20m, the comparition of time cost between treeAggregate and sliceAggregate is
shown as follows. When feature dimension reach 20 million, treeAggregate was
failed.
The table of time cost(ms) between sliceAggregate and treeAggregate.
| feature dimension | sliceAggregate | treeAggregate |
| ---- | ---- |---- |
| 10K | 617 | 607 |
|100K | 1470 | 967 |
|1M | 4019 | 4731 |
|2.5M | 7679 | 13348 |
|5M | 14722 | 22858 |
|7.5M | 20821 | 36472 |
|10M | 28526 | 50184 |
|20M | 47014 | | |

The code relate to this experiment is
[here](https://github.com/ZunwenYou/spark/blob/slice-aggregate-experiment/mllib/src/main/scala/org/apache/spark/ml/classification/SliceAggregate.scala)
.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ZunwenYou/spark slice-aggregate1
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/17000.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #17000
----
commit 0e1ef6fdbc8e89c9eefecdc22d7957959117f747
Author: kevinzwyou <[email protected]>
Date: 2017-02-14T12:38:02Z
slice aggregte
commit d480c4fc9d0f70e6ff93aeb691ce4afb3f28c4c4
Author: kevinzwyou <[email protected]>
Date: 2017-02-20T11:49:29Z
remove sample files.
commit a86892b3cf05ba6959491fbd3409c978bb8afe8b
Author: kevinzwyou <[email protected]>
Date: 2017-02-20T11:51:41Z
add an end line
commit f39cfad04cabd414bb48126fced9560f2e3883c0
Author: kevinzwyou <[email protected]>
Date: 2017-02-20T14:12:20Z
fix an import style.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]