GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/1878

    [SPARK-2850] [mllib] MLlib stats examples + small fixes

    Added examples for statistical summarization:
    * Scala: StatisticalSummary.scala
    ** Tests: correlation, MultivariateOnlineSummarizer
    * python: statistical_summary.py
    ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
    
    Added examples for random and sampled RDDs:
    * Scala: RandomAndSampledRDDs.scala
    * python: random_and_sampled_rdds.py
    * Both test:
    ** RandomRDDGenerators.normalRDD, normalVectorRDD
    ** RDD.sample, takeSample, sampleByKey
    
    Added sc.stop() to all examples.
    
    CorrelationSuite.scala
    * Added 1 test for RDDs with only 1 value
    
    RowMatrix.scala
    * numCols(): Added check for numRows = 0, with error message.
    * computeCovariance(): Added check for numRows <= 1, with error message.
    
    Python SparseVector (pyspark/mllib/linalg.py)
    * Added toDense() function
    
    python/run-tests script
    * Added stat.py (doc test)
    
    CC: @mengxr @dorx  Main changes were examples to show usage across APIs.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark mllib-stats-api-check

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1878.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1878
    
----
commit ee918e9e165a02dc55235877484502baaaf906e0
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-07T21:34:11Z

    Added examples for statistical summarization:
    * Scala: StatisticalSummary.scala
    ** Tests: correlation, MultivariateOnlineSummarizer
    * python: statistical_summary.py
    ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
    
    Added sc.stop() to all examples.
    
    CorrelationSuite.scala
    * Added 1 test for RDDs with only 1 value
    
    Python SparseVector (pyspark/mllib/linalg.py)
    * Added toDense() function
    
    python/run-tests script
    * Added stat.py (doc test)

commit 064985bd59b854bbca70290256348177415b5bda
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-07T23:34:38Z

    Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check

commit 8195c78a312087ee18375b745600946e47fcdd46
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-08T01:42:52Z

    Added examples for random and sampled RDDs:
    * Scala: RandomAndSampledRDDs.scala
    * python: random_and_sampled_rdds.py
    * Both test:
    ** RandomRDDGenerators.normalRDD, normalVectorRDD
    ** RDD.sample, takeSample, sampleByKey

commit 65e4ebc8c07c7fb4bf76f80c11b28f790362533e
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-10T17:36:10Z

    Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check

commit ab48f6eb01541309ffa2d86febb0a039f435a60a
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-10T18:26:03Z

    RowMatrix.scala
    * numCols(): Added check for numRows = 0, with error message.
    * computeCovariance(): Added check for numRows <= 1, with error message.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to