GitHub user jkbradley opened a pull request:
https://github.com/apache/spark/pull/1878
[SPARK-2850] [mllib] MLlib stats examples + small fixes
Added examples for statistical summarization:
* Scala: StatisticalSummary.scala
** Tests: correlation, MultivariateOnlineSummarizer
* python: statistical_summary.py
** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
Added examples for random and sampled RDDs:
* Scala: RandomAndSampledRDDs.scala
* python: random_and_sampled_rdds.py
* Both test:
** RandomRDDGenerators.normalRDD, normalVectorRDD
** RDD.sample, takeSample, sampleByKey
Added sc.stop() to all examples.
CorrelationSuite.scala
* Added 1 test for RDDs with only 1 value
RowMatrix.scala
* numCols(): Added check for numRows = 0, with error message.
* computeCovariance(): Added check for numRows <= 1, with error message.
Python SparseVector (pyspark/mllib/linalg.py)
* Added toDense() function
python/run-tests script
* Added stat.py (doc test)
CC: @mengxr @dorx Main changes were examples to show usage across APIs.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jkbradley/spark mllib-stats-api-check
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1878.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1878
----
commit ee918e9e165a02dc55235877484502baaaf906e0
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-07T21:34:11Z
Added examples for statistical summarization:
* Scala: StatisticalSummary.scala
** Tests: correlation, MultivariateOnlineSummarizer
* python: statistical_summary.py
** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
Added sc.stop() to all examples.
CorrelationSuite.scala
* Added 1 test for RDDs with only 1 value
Python SparseVector (pyspark/mllib/linalg.py)
* Added toDense() function
python/run-tests script
* Added stat.py (doc test)
commit 064985bd59b854bbca70290256348177415b5bda
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-07T23:34:38Z
Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
commit 8195c78a312087ee18375b745600946e47fcdd46
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-08T01:42:52Z
Added examples for random and sampled RDDs:
* Scala: RandomAndSampledRDDs.scala
* python: random_and_sampled_rdds.py
* Both test:
** RandomRDDGenerators.normalRDD, normalVectorRDD
** RDD.sample, takeSample, sampleByKey
commit 65e4ebc8c07c7fb4bf76f80c11b28f790362533e
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-10T17:36:10Z
Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
commit ab48f6eb01541309ffa2d86febb0a039f435a60a
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-10T18:26:03Z
RowMatrix.scala
* numCols(): Added check for numRows = 0, with error message.
* computeCovariance(): Added check for numRows <= 1, with error message.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]