GitHub user yinxusen opened a pull request:

    https://github.com/apache/spark/pull/268

    [WIP] [SPARK-1328] Add vector statistics

    As with the new vector system in MLlib, we find that it is good to add some 
new APIs to precess the `RDD[Vector]`. Beside, the former implementation of 
`computeStat` is not stable which could loss precision, and has the possibility 
to cause `Nan` in scientific computing, just as said in the 
[SPARK-1328](https://spark-project.atlassian.net/browse/SPARK-1328).
    
    APIs contain:
    
    * rowMeans(): RDD[Double]
    * rowNorm2(): RDD[Double]
    * rowSDs(): RDD[Double]
    * colMeans(): Vector
    * colMeans(size: Int): Vector
    * colNorm2(): Vector
    * colNorm2(size: Int): Vector
    * colSDs(): Vector
    * colSDs(size: Int): Vector
    * maxOption((Vector, Vector) => Boolean): Option[Vector]
    * minOption((Vector, Vector) => Boolean): Option[Vector]
    * rowShrink(): RDD[Vector]
    * colShrink(): RDD[Vector]
    
    This is working in process now, and some more APIs will add to 
`LabeledPoint`. Moreover, the implicit declaration will move from `MLUtils` to 
`MLContext` later.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yinxusen/spark vector-statistics

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/268.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #268
    
----
commit cae6c9e0a9307c9102fddd864f879ef1f11407b2
Author: Xusen Yin <[email protected]>
Date:   2014-03-28T03:40:43Z

    add basic statistics

commit 317f2c1e52b3f3eb91ecf685faeb30790045b803
Author: Xusen Yin <[email protected]>
Date:   2014-03-28T10:23:54Z

    add new API to shrink RDD[Vector]

commit 6243332579a33fee58aed1cd6c35c525aef5b90c
Author: Xusen Yin <[email protected]>
Date:   2014-03-29T01:25:35Z

    fix error of column means

commit 6f07e17f680d4d8e4d190e265f058178b90138d0
Author: Xusen Yin <[email protected]>
Date:   2014-03-29T01:42:39Z

    pass all tests

commit 2a5ed37cf5b2b26dbf1c94203e419171d664b86d
Author: Xusen Yin <[email protected]>
Date:   2014-03-29T02:48:44Z

    add scala docs and refine shrink method

commit 95dbc6e1288914234bafa43b6ace662847b9242c
Author: Xusen Yin <[email protected]>
Date:   2014-03-29T03:08:33Z

    add shrink test

commit ed6fdf836275832e167025d848eaeb28a2538cfa
Author: Xusen Yin <[email protected]>
Date:   2014-03-29T03:40:03Z

    refine the code style

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to