GitHub user mengxr opened a pull request:
https://github.com/apache/spark/pull/245
[SPARK-1212, Part II] [WIP] Support sparse data in MLlib
In PR https://github.com/apache/spark/pull/117, we added dense/sparse
vector data model and updated KMeans to support sparse input. This PR is to
replace all other `Array[Double]` usage by `Vector` in generalized linear
models and Naive Bayes. Major changes:
1. `LabeledPoint` becomes `LabeledPoint(Double, Vector)`.
2. Methods that accept `RDD[Array[Double]]` now accept `RDD[Vector]`. We
cannot support both in an elegant way because of type erasure.
3. Mark 'createModel' and 'predictPoint' protected because they are not for
end users.
TODO:
1. Use axpy when possible.
2. Optimize Naive Bayes.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/mengxr/spark vector
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/245.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #245
----
commit 3f346baec3424fd5ec58716dacbb144dd85d2429
Author: Xiangrui Meng <[email protected]>
Date: 2014-03-26T05:47:23Z
update some ml algorithms to use Vector
commit d7f629f902aab81cf3637f07f9eb9f7119d9230c
Author: Xiangrui Meng <[email protected]>
Date: 2014-03-26T06:05:35Z
fix a bug in GLM when intercept is not used
commit 0e57aa43f61a62a70faf27aed58dea201b494809
Author: Xiangrui Meng <[email protected]>
Date: 2014-03-26T18:44:48Z
update Lasso and RidgeRegression to parse the weights correctly from GLM
mark createModel protected
mark predictPoint protected
commit 135ab72f1f715e71a4982fca66eb1556bbb43986
Author: Xiangrui Meng <[email protected]>
Date: 2014-03-26T19:26:57Z
merge glm
commit 834ada23f66e871576ab8e3f38a4929f0c913a12
Author: Xiangrui Meng <[email protected]>
Date: 2014-03-26T20:49:49Z
optimized MLUtils.computeStats
update some ml algorithms to use Vector (cont.)
commit 18597011768fa857747ab809302c6df351d24cb6
Author: Xiangrui Meng <[email protected]>
Date: 2014-03-26T22:10:33Z
passed compile
commit 75c83a4697f17db00eb877b2d9fd741ec708ee23
Author: Xiangrui Meng <[email protected]>
Date: 2014-03-26T22:34:13Z
passed test compile
commit befa5929b2ecb7a7e966d1d88a8e9f94e0234cd8
Author: Xiangrui Meng <[email protected]>
Date: 2014-03-27T00:24:53Z
passed scala/java tests
commit db808a156d1a298597ae4590987d20c984a14e49
Author: Xiangrui Meng <[email protected]>
Date: 2014-03-27T00:50:37Z
update JavaLR example
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---