[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

mengxr Tue, 26 Aug 2014 10:23:39 -0700

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2130#discussion_r16728034
  
    --- Diff: docs/mllib-stats.md ---
    @@ -99,69 +180,277 @@ v = u.map(lambda x: 1.0 + 2.0 * x)
     
     </div>
     
    -## Stratified Sampling 
    +## Correlations calculation
     
    -## Summary Statistics 
    +Calculating the correlation between two series of data is a common 
operation in Statistics. In MLlib
    +we provide the flexibility to calculate pairwise correlations among many 
series. The supported 
    +correlation methods are currently Pearson's and Spearman's correlation.
    + 
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    
+[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) 
provides methods to 
    +calculate correlations between series. Depending on the type of input, two 
`RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation 
`Matrix` respectively.
     
    -### Multivariate summary statistics
    +{% highlight scala %}
    +import org.apache.spark.SparkContext
    +import org.apache.spark.mllib.linalg._
    +import org.apache.spark.mllib.stat.Statistics
    +
    +val sc: SparkContext = ...
    +
    +val seriesX: RDD[Double] = ... // a series
    +val seriesY: RDD[Double] = ... // must have the same number of partitions 
and cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for 
Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
    +
    +val data: RDD[Vector] = ... // note that each Vector is a row and not a 
column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" 
for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +val correlMatrix: Matrix = Statistics.corr(data, "pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) 
provides methods to 
    +calculate correlations between series. Depending on the type of input, two 
`JavaDoubleRDD`s or 
    +a `JavaRDD<Vector>`, the output will be a `Double` or the correlation 
`Matrix` respectively.
    +
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaDoubleRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.*;
    +import org.apache.spark.mllib.stat.Statistics;
    +
    +JavaSparkContext jsc = ...
    +
    +JavaDoubleRDD seriesX = ... // a series
    +JavaDoubleRDD seriesY = ... // must have the same number of partitions and 
cardinality as seriesX
    +
    +// compute the correlation using Pearson's method. Enter "spearman" for 
Spearman's method. If a 
    +// method is not specified, Pearson's method will be used by default. 
    +Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), 
"pearson");
    +
    +JavaRDD<Vector> data = ... // note that each Vector is a row and not a 
column
    +
    +// calculate the correlation matrix using Pearson's method. Use "spearman" 
for Spearman's method.
    +// If a method is not specified, Pearson's method will be used by default. 
    +Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");
    +
    +{% endhighlight %}
    +</div>
     
    -We provide column summary statistics for `RowMatrix` (note: this 
functionality is not currently supported in `IndexedRowMatrix` or 
`CoordinateMatrix`). 
    -If the number of columns is not large, e.g., on the order of thousands, 
then the 
    -covariance matrix can also be computed as a local matrix, which requires 
$\mathcal{O}(n^2)$ storage where $n$ is the
    -number of columns. The total CPU time is $\mathcal{O}(m n^2)$, where $m$ 
is the number of rows,
    -and is faster if the rows are sparse.
    +<div data-lang="python" markdown="1">
    +[`Statistics`](api/python/pyspark.mllib.stat.Statistics-class.html) 
provides methods to 
    +calculate correlations between series. Depending on the type of input, two 
`RDD[Double]`s or 
    +an `RDD[Vector]`, the output will be a `Double` or the correlation 
`Matrix` respectively.
    +
    +{% highlight python %}
    +from pyspark.mllib.stat import Statistics
    +
    +sc = ... # SparkContext
    +
    +seriesX = ... # a series
    +seriesY = ... # must have the same number of partitions and cardinality as 
seriesX
    +
    +# Compute the correlation using Pearson's method. Enter "spearman" for 
Spearman's method. If a 
    +# method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(seriesX, seriesY, method="pearson")
    +
    +data = ... # an RDD of Vectors
    +# calculate the correlation matrix using Pearson's method. Use "spearman" 
for Spearman's method.
    +# If a method is not specified, Pearson's method will be used by default. 
    +print Statistics.corr(data, method="pearson")
    +
    +{% endhighlight %}
    +</div>
    +
    +</div>
    +
    +## Stratified sampling
    +
    +Unlike the other statistics functions, which reside in MLLib, stratified 
sampling methods, 
    +`sampleByKey` and `sampleByKeyExact`, can be performed on RDD's of 
key-value pairs. For stratified
    +sampling, the keys can be thought of as a label and the value as a 
specific attribute. For example 
    +the key can be man or woman, or document ids, and the respective values 
can be the list of ages 
    +of the people in the population or the list of words in the documents. A 
separate method for exact 
    +sample size support exists as it requires significant more resources than 
the per-stratum simple 
    +random sampling used in `sampleByKey`. `sampleByKeyExact` is currently not 
supported in python.
     
     <div class="codetabs">
     <div data-lang="scala" markdown="1">
    -
    
-[`computeColumnSummaryStatistics()`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix)
 returns an instance of
    
-[`MultivariateStatisticalSummary`](api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary),
    -which contains the column-wise max, min, mean, variance, and number of 
nonzeros, as well as the
    -total count.
    
+[`sampleByKeyExact()`](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions)
 allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, 
where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key 
$k$. 
    +Sampling without replacement requires one additional pass over the RDD to 
guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
     
     {% highlight scala %}
    -import org.apache.spark.mllib.linalg.Matrix
    -import org.apache.spark.mllib.linalg.distributed.RowMatrix
    -import org.apache.spark.mllib.stat.MultivariateStatisticalSummary
    +import org.apache.spark.SparkContext
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.rdd.PairRDDFunctions
     
    -val mat: RowMatrix = ... // a RowMatrix
    +val sc: SparkContext = ...
     
    -// Compute column summary statistics.
    -val summary: MultivariateStatisticalSummary = 
mat.computeColumnSummaryStatistics()
    -println(summary.mean) // a dense vector containing the mean value for each 
column
    -println(summary.variance) // column-wise variance
    -println(summary.numNonzeros) // number of nonzeros in each column
    +val data = ... // an RDD[(K, V)] of any key value pairs
    +val fractions: Map[K, Double] = ... // specify the exact fraction desired 
from each key
    +
    +// Get an exact sample from each stratum
    +val sample = data.sampleByKeyExact(withReplacement = false, fractions)
     
    -// Compute the covariance matrix.
    -val cov: Matrix = mat.computeCovariance()
     {% endhighlight %}
     </div>
     
     <div data-lang="java" markdown="1">
    -
    
-[`RowMatrix#computeColumnSummaryStatistics`](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html#computeColumnSummaryStatistics())
 returns an instance of
    
-[`MultivariateStatisticalSummary`](api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html),
    -which contains the column-wise max, min, mean, variance, and number of 
nonzeros, as well as the
    -total count.
    
+[`sampleByKeyExact()`](api/java/org/apache/spark/api/java/JavaPairRDD.html) 
allows users to
    +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, 
where $f_k$ is the desired 
    +fraction for key $k$, and $n_k$ is the number of key-value pairs for key 
$k$. 
    +Sampling without replacement requires one additional pass over the RDD to 
guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
     
     {% highlight java %}
    -import org.apache.spark.mllib.linalg.Matrix;
    -import org.apache.spark.mllib.linalg.distributed.RowMatrix;
    -import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
    +import java.util.Map;
     
    -RowMatrix mat = ... // a RowMatrix
    +import org.apache.spark.api.java.JavaPairRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
     
    -// Compute column summary statistics.
    -MultivariateStatisticalSummary summary = 
mat.computeColumnSummaryStatistics();
    -System.out.println(summary.mean()); // a dense vector containing the mean 
value for each column
    -System.out.println(summary.variance()); // column-wise variance
    -System.out.println(summary.numNonzeros()); // number of nonzeros in each 
column
    +JavaSparkContext jsc = ...
    +
    +JavaPairRDD<K, V> data = ... // an RDD of any key value pairs
    +Map<K, Object> fractions = ... // specify the exact fraction desired from 
each key
    +
    +// Get an exact sample from each stratum
    +JavaPairRDD<K, V> sample = data.sampleByKeyExact(false, fractions);
    +
    +{% endhighlight %}
    +</div>
    +<div data-lang="python" markdown="1">
    +[`sampleByKey()`](api/python/pyspark.rdd.RDD-class.html#sampleByKey) 
allows users to
    +sample approximately $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ 
items, where $f_k$ is the 
    +desired fraction for key $k$, and $n_k$ is the number of key-value pairs 
for key $k$. 
    +Sampling without replacement requires one additional pass over the RDD to 
guarantee sample 
    +size, whereas sampling with replacement requires two additional passes.
    +
    +*Note:* `sampleByKeyExact()` is currently not supported in Python.
    +
    +
    +{% highlight python %}
    +
    +sc = ... # SparkContext
    +
    +data = ... # an RDD of any key value pairs
    +fractions = ... # specify the exact fraction desired from each key as a 
dictionary
    +
    +sample = data.sampleByKeyExact(False, fractions);
    --- End diff --
    
    `approxSample = data.sampleByKey(False, fractions)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...

Reply via email to