GitHub user yinxusen opened a pull request:
https://github.com/apache/spark/pull/268
[WIP] [SPARK-1328] Add vector statistics
As with the new vector system in MLlib, we find that it is good to add some
new APIs to precess the `RDD[Vector]`. Beside, the former implementation of
`computeStat` is not stable which could loss precision, and has the possibility
to cause `Nan` in scientific computing, just as said in the
[SPARK-1328](https://spark-project.atlassian.net/browse/SPARK-1328).
APIs contain:
* rowMeans(): RDD[Double]
* rowNorm2(): RDD[Double]
* rowSDs(): RDD[Double]
* colMeans(): Vector
* colMeans(size: Int): Vector
* colNorm2(): Vector
* colNorm2(size: Int): Vector
* colSDs(): Vector
* colSDs(size: Int): Vector
* maxOption((Vector, Vector) => Boolean): Option[Vector]
* minOption((Vector, Vector) => Boolean): Option[Vector]
* rowShrink(): RDD[Vector]
* colShrink(): RDD[Vector]
This is working in process now, and some more APIs will add to
`LabeledPoint`. Moreover, the implicit declaration will move from `MLUtils` to
`MLContext` later.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/yinxusen/spark vector-statistics
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/268.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #268
----
commit cae6c9e0a9307c9102fddd864f879ef1f11407b2
Author: Xusen Yin <[email protected]>
Date: 2014-03-28T03:40:43Z
add basic statistics
commit 317f2c1e52b3f3eb91ecf685faeb30790045b803
Author: Xusen Yin <[email protected]>
Date: 2014-03-28T10:23:54Z
add new API to shrink RDD[Vector]
commit 6243332579a33fee58aed1cd6c35c525aef5b90c
Author: Xusen Yin <[email protected]>
Date: 2014-03-29T01:25:35Z
fix error of column means
commit 6f07e17f680d4d8e4d190e265f058178b90138d0
Author: Xusen Yin <[email protected]>
Date: 2014-03-29T01:42:39Z
pass all tests
commit 2a5ed37cf5b2b26dbf1c94203e419171d664b86d
Author: Xusen Yin <[email protected]>
Date: 2014-03-29T02:48:44Z
add scala docs and refine shrink method
commit 95dbc6e1288914234bafa43b6ace662847b9242c
Author: Xusen Yin <[email protected]>
Date: 2014-03-29T03:08:33Z
add shrink test
commit ed6fdf836275832e167025d848eaeb28a2538cfa
Author: Xusen Yin <[email protected]>
Date: 2014-03-29T03:40:03Z
refine the code style
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---