Github user mengxr commented on a diff in the pull request:
https://github.com/apache/spark/pull/2123#discussion_r16695698
--- Diff: docs/mllib-stats.md ---
@@ -99,69 +99,336 @@ v = u.map(lambda x: 1.0 + 2.0 * x)
</div>
-## Stratified Sampling
+## Correlation Calculation
+
+Calculating the correlation between two series of data is a common
operation in Statistics. In MLlib
+we provide the flexibility to calculate correlation between many series.
The supported correlation
+methods are currently Pearson's and Spearman's correlation.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$)
provides methods to
+calculate correlations between series. Depending on the type of input, two
`RDD[Double]`s or
+an `RDD[Vector]`, the output will be a `Double` or the correlation
`Matrix` respectively.
+
+{% highlight scala %}
+import org.apache.spark.SparkContext
+import org.apache.spark.mllib.linalg._
+import org.apache.spark.mllib.stat.Statistics
+
+val sc: SparkContext = ...
+
+val seriesX: RDD[Double] = ... // a series
+val seriesY: RDD[Double] = ... // must have the same number of partitions
and cardinality as seriesX
+
+// compute the correlation using Pearson's method. Enter "spearman" for
Spearman's method. If a
+// method is not specified, Pearson's method will be used by default.
+val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
+
+val data: RDD[Vector] = ... // note that each Vector is a row and not a
column
+
+// calculate the correlation matrix using Pearson's method. Use "spearman"
for Spearman's method.
+// If a method is not specified, Pearson's method will be used by default.
+val correlMatrix: Matrix = Statistics.corr(data, "pearson")
+
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html)
provides methods to
+calculate correlations between series. Depending on the type of input, two
`JavaDoubleRDD`s or
+a `JavaRDD<Vector>`, the output will be a `Double` or the correlation
`Matrix` respectively.
+
+{% highlight java %}
+import org.apache.spark.api.java.JavaDoubleRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.mllib.linalg.*;
+import org.apache.spark.mllib.stat.Statistics;
+
+JavaSparkContext jsc = ...
+
+JavaDoubleRDD seriesX = ... // a series
+JavaDoubleRDD seriesY = ... // must have the same number of partitions and
cardinality as seriesX
+
+// compute the correlation using Pearson's method. Enter "spearman" for
Spearman's method. If a
+// method is not specified, Pearson's method will be used by default.
+Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(),
"pearson");
+
+JavaRDD<Vector> data = ... // note that each Vector is a row and not a
column
+
+// calculate the correlation matrix using Pearson's method. Use "spearman"
for Spearman's method.
+// If a method is not specified, Pearson's method will be used by default.
+Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");
+
+{% endhighlight %}
+</div>
+
+<div data-lang="python" markdown="1">
+[`Statistics`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$)
provides methods to
--- End diff --
The Python link doesn't work on my computer.
`api/python/pyspark.mllib.stat.Statistics-class.html` works. Could you verify
the link on your computer?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]