Github user mengxr commented on a diff in the pull request:
https://github.com/apache/spark/pull/2123#discussion_r16695722
--- Diff: docs/mllib-stats.md ---
@@ -99,69 +99,336 @@ v = u.map(lambda x: 1.0 + 2.0 * x)
</div>
-## Stratified Sampling
+## Correlation Calculation
+
+Calculating the correlation between two series of data is a common
operation in Statistics. In MLlib
+we provide the flexibility to calculate correlation between many series.
The supported correlation
+methods are currently Pearson's and Spearman's correlation.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$)
provides methods to
+calculate correlations between series. Depending on the type of input, two
`RDD[Double]`s or
+an `RDD[Vector]`, the output will be a `Double` or the correlation
`Matrix` respectively.
+
+{% highlight scala %}
+import org.apache.spark.SparkContext
+import org.apache.spark.mllib.linalg._
+import org.apache.spark.mllib.stat.Statistics
+
+val sc: SparkContext = ...
+
+val seriesX: RDD[Double] = ... // a series
+val seriesY: RDD[Double] = ... // must have the same number of partitions
and cardinality as seriesX
+
+// compute the correlation using Pearson's method. Enter "spearman" for
Spearman's method. If a
+// method is not specified, Pearson's method will be used by default.
+val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
+
+val data: RDD[Vector] = ... // note that each Vector is a row and not a
column
+
+// calculate the correlation matrix using Pearson's method. Use "spearman"
for Spearman's method.
+// If a method is not specified, Pearson's method will be used by default.
+val correlMatrix: Matrix = Statistics.corr(data, "pearson")
+
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html)
provides methods to
+calculate correlations between series. Depending on the type of input, two
`JavaDoubleRDD`s or
+a `JavaRDD<Vector>`, the output will be a `Double` or the correlation
`Matrix` respectively.
+
+{% highlight java %}
+import org.apache.spark.api.java.JavaDoubleRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.mllib.linalg.*;
+import org.apache.spark.mllib.stat.Statistics;
+
+JavaSparkContext jsc = ...
+
+JavaDoubleRDD seriesX = ... // a series
+JavaDoubleRDD seriesY = ... // must have the same number of partitions and
cardinality as seriesX
+
+// compute the correlation using Pearson's method. Enter "spearman" for
Spearman's method. If a
+// method is not specified, Pearson's method will be used by default.
+Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(),
"pearson");
+
+JavaRDD<Vector> data = ... // note that each Vector is a row and not a
column
+
+// calculate the correlation matrix using Pearson's method. Use "spearman"
for Spearman's method.
+// If a method is not specified, Pearson's method will be used by default.
+Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");
+
+{% endhighlight %}
+</div>
+
+<div data-lang="python" markdown="1">
+[`Statistics`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$)
provides methods to
+calculate correlations between series. Depending on the type of input, two
`RDD[Double]`s or
+an `RDD[Vector]`, the output will be a `Double` or the correlation
`Matrix` respectively.
+
+Support for `RowMatrix` operations in python currently don't exist, but
will be added in future
+releases.
+
+{% highlight python %}
+from pyspark.mllib.stat import Statistics
+
+sc = ... # SparkContext
+
+seriesX = ... # a series
+seriesY = ... # must have the same number of partitions and cardinality as
seriesX
+
+# Compute the correlation using Pearson's method. Enter "spearman" for
Spearman's method. If a
+# method is not specified, Pearson's method will be used by default.
+print Statistics.corr(seriesX, seriesY, method="pearson")
+
+data = ... # an RDD of Vectors
+# calculate the correlation matrix using Pearson's method. Use "spearman"
for Spearman's method.
+# If a method is not specified, Pearson's method will be used by default.
+print Statistics.corr(data, method="pearson")
+
+{% endhighlight %}
+</div>
+
+</div>
+
+## Stratified Sampling
+
+Unlike the other statistics functions, which reside in MLLib, stratified
sampling methods,
+`sampleByKey` and `sampleByKeyExact`, can be accessed in
`PairRDDFunctions` in core, as stratified
+sampling is tightly coupled with the PairRDD data type, and the function
signature conforms to the
+other *ByKey* methods in PairRDDFunctions. A separate method for exact
sample size support exists
+as it requires significant more resources than the per-stratum simple
random sampling used in
+`sampleByKey`.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+[`sampleByKeyExact()`](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions)
allows users to
+sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items,
where $f_k$ is the desired
+fraction for key $k$, and $n_k$ is the number of key-value pairs for key
$k$.
+Sampling without replacement requires one additional pass over the RDD to
guarantee sample
+size, whereas sampling with replacement requires two additional passes.
+
+{% highlight scala %}
+import org.apache.spark.SparkContext
+import org.apache.spark.rdd.PairRDDFunctions
+
+val sc: SparkContext = ...
+
+val data = ... // an RDD[(K,V)] of any key value pairs
+val fractions: Map[K, Double] = ... // specify the exact fraction desired
from each key
+
+// Get an exact sample from each stratum
+val sample = data.sampleByKeyExact(withReplacement=false, fractions)
+
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+[`sampleByKeyExact()`](api/java/org/apache/spark/api/java/JavaPairRDD.html)
allows users to
+sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items,
where $f_k$ is the desired
+fraction for key $k$, and $n_k$ is the number of key-value pairs for key
$k$.
+Sampling without replacement requires one additional pass over the RDD to
guarantee sample
+size, whereas sampling with replacement requires two additional passes.
+
+{% highlight java %}
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+
+JavaSparkContext jsc = ...
+
+JavaPairRDD<K,V> data = ... // an RDD of any key value pairs
+java.util.Map<K,Object> fractions = ... // specify the exact fraction
desired from each key
+
+// Get an exact sample from each stratum
+JavaPairRDD<K,V> sample = data.sampleByKeyExact(false, fractions);
+
+{% endhighlight %}
+</div>
+
+</div>
## Summary Statistics
### Multivariate summary statistics
-We provide column summary statistics for `RowMatrix` (note: this
functionality is not currently supported in `IndexedRowMatrix` or
`CoordinateMatrix`).
-If the number of columns is not large, e.g., on the order of thousands,
then the
-covariance matrix can also be computed as a local matrix, which requires
$\mathcal{O}(n^2)$ storage where $n$ is the
-number of columns. The total CPU time is $\mathcal{O}(m n^2)$, where $m$
is the number of rows,
-and is faster if the rows are sparse.
+We provide column summary statistics for `RDD[Vector]` through the static
function `colStats`
+available in `Statistics`.
<div class="codetabs">
<div data-lang="scala" markdown="1">
-[`computeColumnSummaryStatistics()`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix)
returns an instance of
+[`colStats()`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$)
returns an instance of
[`MultivariateStatisticalSummary`](api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary),
which contains the column-wise max, min, mean, variance, and number of
nonzeros, as well as the
total count.
{% highlight scala %}
-import org.apache.spark.mllib.linalg.Matrix
-import org.apache.spark.mllib.linalg.distributed.RowMatrix
-import org.apache.spark.mllib.stat.MultivariateStatisticalSummary
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary,
Statistics}
-val mat: RowMatrix = ... // a RowMatrix
+val mat: RDD[Vector] = ... // an RDD of Vectors
// Compute column summary statistics.
-val summary: MultivariateStatisticalSummary =
mat.computeColumnSummaryStatistics()
+val summary: MultivariateStatisticalSummary = Statistics.colStats(mat)
println(summary.mean) // a dense vector containing the mean value for each
column
println(summary.variance) // column-wise variance
println(summary.numNonzeros) // number of nonzeros in each column
-// Compute the covariance matrix.
-val cov: Matrix = mat.computeCovariance()
{% endhighlight %}
</div>
<div data-lang="java" markdown="1">
-[`RowMatrix#computeColumnSummaryStatistics`](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html#computeColumnSummaryStatistics())
returns an instance of
+[`colStats()`](api/java/org/apache/spark/mllib/stat/Statistics.html)
returns an instance of
[`MultivariateStatisticalSummary`](api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html),
which contains the column-wise max, min, mean, variance, and number of
nonzeros, as well as the
total count.
{% highlight java %}
-import org.apache.spark.mllib.linalg.Matrix;
-import org.apache.spark.mllib.linalg.distributed.RowMatrix;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
+import org.apache.spark.mllib.stat.Statistics;
+
+JavaSparkContext jsc = ...
-RowMatrix mat = ... // a RowMatrix
+JavaRDD<Vector> mat = ... // an RDD of Vectors
// Compute column summary statistics.
-MultivariateStatisticalSummary summary =
mat.computeColumnSummaryStatistics();
+MultivariateStatisticalSummary summary = Statistics.colStats(mat.rdd());
System.out.println(summary.mean()); // a dense vector containing the mean
value for each column
System.out.println(summary.variance()); // column-wise variance
System.out.println(summary.numNonzeros()); // number of nonzeros in each
column
-// Compute the covariance matrix.
-Matrix cov = mat.computeCovariance();
{% endhighlight %}
</div>
+
+<div data-lang="python" markdown="1">
+[`colStats()`](api/python/index.html#org.apache.spark.mllib.stat.Statistics$)
returns an instance of
+[`MultivariateStatisticalSummary`](api/python/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary$),
+which contains the column-wise max, min, mean, variance, and number of
nonzeros, as well as the
+total count.
+
+{% highlight python %}
+from pyspark.mllib.stat import Statistics
+
+sc = ... # SparkContext
+
+mat = ... # an RDD of Vectors
+
+# Compute column summary statistics.
+summary = Statistics.colStats(mat)
+print summary.mean()
+print summary.variance()
+print summary.numNonzeros()
+
+{% endhighlight %}
</div>
+</div>
+
+
+## Hypothesis Testing
+
+Hypothesis testing is a power tool in statistics to determine whether a
result is statistically
+significant, whether this result occurred by chance or not. MLlib
currently supports Pearson's
+chi-squared ( $\chi^2$) tests for goodness of fit and independence. The
input data types determine
+whether the goodness of fit or the independence test is conducted. The
goodness of fit test requires
+an input type of `Vector`, whereas the independence test requires a
`Matrix` as input.
+
+MLlib also supports the input type `RDD[LabeledPoint]` to enable feature
selection via chi-squared
+independence tests.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$)
provides methods to
+run Pearson's chi-squared tests. The following example demonstrates how to
run and interpret
+hypothesis tests.
-## Hypothesis Testing
+{% highlight scala %}
+import org.apache.spark.SparkContext
+import org.apache.spark.mllib.linalg._
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.stat.Statistics._
+
+val sc: SparkContext = ...
+
+val vec: Vector = ... // a vector composed of the frequencies of events
+
+// compute the goodness of fit. If a second vector to test against is not
supplied as a parameter,
+// the test runs against a uniform distribution.
+val goodnessOfFitTestResult = Statistics.chiSqTest(vec)
+println(goodnessOfFitTestResult) // summary of the test including the
p-value, degrees of freedom,
+ // test statistic, the method used, and
the null hypothesis.
+
+val mat: Matrix = ... // a contingency matrix
+
+// conduct Pearson's independence test on the input contingency matrix
+val independenceTestResult = Statistics.chiSqTest(mat)
+println(independenceTestResult) // summary of the test including the
p-value, degrees of freedom...
+
+val obs: RDD[LabeledPoint] = ... // (feature, label) pairs.
+
+// The contingency table is constructed from the raw (feature, label)
pairs and used to conduct
+// the independence test. Returns an array containing the
ChiSquaredTestResult for every feature
+// against the label.
+val featureTestResults: Array[ChiSqTestResult] = Statistics.chiSqTest(obs)
+var i: Integer = 1
--- End diff --
remove `Integer`
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]