Repository: spark
Updated Branches:
  refs/heads/branch-1.1 b5f808398 -> cffd9bb8d


SPARK-2830 [MLlib]: re-organize mllib documentation

As per discussions with Xiangrui, I've reorganized and edited the mllib 
documentation.

Author: Ameet Talwalkar <[email protected]>

Closes #1908 from atalwalkar/master and squashes the following commits:

fe6938a [Ameet Talwalkar] made xiangruis suggested changes
840028b [Ameet Talwalkar] made xiangruis suggested changes
7ec366a [Ameet Talwalkar] reorganize and edit mllib documentation

(cherry picked from commit c235b83e2782cce0626ecc403c0a67e442be52c1)
Signed-off-by: Xiangrui Meng <[email protected]>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cffd9bb8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cffd9bb8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cffd9bb8

Branch: refs/heads/branch-1.1
Commit: cffd9bb8d3d025ac2008b54822ee772ec3b28127
Parents: b5f8083
Author: Ameet Talwalkar <[email protected]>
Authored: Tue Aug 12 17:15:21 2014 -0700
Committer: Xiangrui Meng <[email protected]>
Committed: Tue Aug 12 17:15:33 2014 -0700

----------------------------------------------------------------------
 docs/mllib-basics.md                    | 117 ++++++-----------------
 docs/mllib-classification-regression.md |  37 ++++++++
 docs/mllib-clustering.md                |  15 +--
 docs/mllib-collaborative-filtering.md   |  21 +++--
 docs/mllib-dimensionality-reduction.md  |  44 +++++----
 docs/mllib-feature-extraction.md        |  12 +++
 docs/mllib-guide.md                     |  30 +++---
 docs/mllib-linear-methods.md            | 134 ++++++++++++++-------------
 docs/mllib-naive-bayes.md               |  32 +++----
 docs/mllib-stats.md                     |  95 +++++++++++++++++++
 10 files changed, 317 insertions(+), 220 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/cffd9bb8/docs/mllib-basics.md
----------------------------------------------------------------------
diff --git a/docs/mllib-basics.md b/docs/mllib-basics.md
index f958525..8752df4 100644
--- a/docs/mllib-basics.md
+++ b/docs/mllib-basics.md
@@ -9,17 +9,17 @@ displayTitle: <a href="mllib-guide.html">MLlib</a> - Basics
 
 MLlib supports local vectors and matrices stored on a single machine, 
 as well as distributed matrices backed by one or more RDDs.
-In the current implementation, local vectors and matrices are simple data 
models 
-to serve public interfaces. The underlying linear algebra operations are 
provided by
+Local vectors and local matrices are simple data models 
+that serve as public interfaces. The underlying linear algebra operations are 
provided by
 [Breeze](http://www.scalanlp.org/) and [jblas](http://jblas.org/).
-A training example used in supervised learning is called "labeled point" in 
MLlib.
+A training example used in supervised learning is called a "labeled point" in 
MLlib.
 
 ## Local vector
 
 A local vector has integer-typed and 0-based indices and double-typed values, 
stored on a single
 machine.  MLlib supports two types of local vectors: dense and sparse.  A 
dense vector is backed by
 a double array representing its entry values, while a sparse vector is backed 
by two parallel
-arrays: indices and values.  For example, a vector $(1.0, 0.0, 3.0)$ can be 
represented in dense
+arrays: indices and values.  For example, a vector `(1.0, 0.0, 3.0)` can be 
represented in dense
 format as `[1.0, 0.0, 3.0]` or in sparse format as `(3, [0, 2], [1.0, 3.0])`, 
where `3` is the size
 of the vector.
 
@@ -44,8 +44,7 @@ val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 
3.0))
 val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))
 {% endhighlight %}
 
-***Note***
-
+***Note:***
 Scala imports `scala.collection.immutable.Vector` by default, so you have to 
import
 `org.apache.spark.mllib.linalg.Vector` explicitly to use MLlib's `Vector`.
 
@@ -110,8 +109,8 @@ sv2 = sps.csc_matrix((np.array([1.0, 3.0]), np.array([0, 
2]), np.array([0, 2])),
 A labeled point is a local vector, either dense or sparse, associated with a 
label/response.
 In MLlib, labeled points are used in supervised learning algorithms.
 We use a double to store a label, so we can use labeled points in both 
regression and classification.
-For binary classification, label should be either $0$ (negative) or $1$ 
(positive).
-For multiclass classification, labels should be class indices staring from 
zero: $0, 1, 2, \ldots$.
+For binary classification, a label should be either `0` (negative) or `1` 
(positive).
+For multiclass classification, labels should be class indices starting from 
zero: `0, 1, 2, ...`.
 
 <div class="codetabs">
 
@@ -172,7 +171,7 @@ neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))
 It is very common in practice to have sparse training data.  MLlib supports 
reading training
 examples stored in `LIBSVM` format, which is the default format used by
 [`LIBSVM`](http://www.csie.ntu.edu.tw/~cjlin/libsvm/) and
-[`LIBLINEAR`](http://www.csie.ntu.edu.tw/~cjlin/liblinear/).  It is a text 
format.  Each line
+[`LIBLINEAR`](http://www.csie.ntu.edu.tw/~cjlin/liblinear/).  It is a text 
format in which each line
 represents a labeled sparse feature vector using the following format:
 
 ~~~
@@ -226,7 +225,7 @@ examples = MLUtils.loadLibSVMFile(sc, 
"data/mllib/sample_libsvm_data.txt")
 ## Local matrix
 
 A local matrix has integer-typed row and column indices and double-typed 
values, stored on a single
-machine.  MLlib supports dense matrix, whose entry values are stored in a 
single double array in
+machine.  MLlib supports dense matrices, whose entry values are stored in a 
single double array in
 column major.  For example, the following matrix `\[ \begin{pmatrix}
 1.0 & 2.0 \\
 3.0 & 4.0 \\
@@ -234,7 +233,6 @@ column major.  For example, the following matrix `\[ 
\begin{pmatrix}
 \end{pmatrix}
 \]`
 is stored in a one-dimensional array `[1.0, 3.0, 5.0, 2.0, 4.0, 6.0]` with the 
matrix size `(3, 2)`.
-We are going to add sparse matrix in the next release.
 
 <div class="codetabs">
 <div data-lang="scala" markdown="1">
@@ -242,7 +240,7 @@ We are going to add sparse matrix in the next release.
 The base class of local matrices is
 [`Matrix`](api/scala/index.html#org.apache.spark.mllib.linalg.Matrix), and we 
provide one
 implementation: 
[`DenseMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.DenseMatrix).
-Sparse matrix will be added in the next release.  We recommend using the 
factory methods implemented
+We recommend using the factory methods implemented
 in [`Matrices`](api/scala/index.html#org.apache.spark.mllib.linalg.Matrices) 
to create local
 matrices.
 
@@ -259,7 +257,7 @@ val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 
2.0, 4.0, 6.0))
 The base class of local matrices is
 [`Matrix`](api/java/org/apache/spark/mllib/linalg/Matrix.html), and we provide 
one
 implementation: 
[`DenseMatrix`](api/java/org/apache/spark/mllib/linalg/DenseMatrix.html).
-Sparse matrix will be added in the next release.  We recommend using the 
factory methods implemented
+We recommend using the factory methods implemented
 in [`Matrices`](api/java/org/apache/spark/mllib/linalg/Matrices.html) to 
create local
 matrices.
 
@@ -279,28 +277,30 @@ Matrix dm = Matrices.dense(3, 2, new double[] {1.0, 3.0, 
5.0, 2.0, 4.0, 6.0});
 A distributed matrix has long-typed row and column indices and double-typed 
values, stored
 distributively in one or more RDDs.  It is very important to choose the right 
format to store large
 and distributed matrices.  Converting a distributed matrix to a different 
format may require a
-global shuffle, which is quite expensive.  We implemented three types of 
distributed matrices in
-this release and will add more types in the future.
+global shuffle, which is quite expensive.  Three types of distributed matrices 
have been implemented
+so far.
 
 The basic type is called `RowMatrix`. A `RowMatrix` is a row-oriented 
distributed
 matrix without meaningful row indices, e.g., a collection of feature vectors.
 It is backed by an RDD of its rows, where each row is a local vector.
-We assume that the number of columns is not huge for a `RowMatrix`.
+We assume that the number of columns is not huge for a `RowMatrix` so that a 
single
+local vector can be reasonably communicated to the driver and can also be 
stored /
+operated on using a single node. 
 An `IndexedRowMatrix` is similar to a `RowMatrix` but with row indices,
-which can be used for identifying rows and joins.
-A `CoordinateMatrix` is a distributed matrix stored in [coordinate list 
(COO)](https://en.wikipedia.org/wiki/Sparse_matrix) format,
+which can be used for identifying rows and executing joins.
+A `CoordinateMatrix` is a distributed matrix stored in [coordinate list 
(COO)](https://en.wikipedia.org/wiki/Sparse_matrix#Coordinate_list_.28COO.29) 
format,
 backed by an RDD of its entries.
 
 ***Note***
 
 The underlying RDDs of a distributed matrix must be deterministic, because we 
cache the matrix size.
-It is always error-prone to have non-deterministic RDDs.
+In general the use of non-deterministic RDDs can lead to errors.
 
 ### RowMatrix
 
 A `RowMatrix` is a row-oriented distributed matrix without meaningful row 
indices, backed by an RDD
-of its rows, where each row is a local vector.  This is similar to `data 
matrix` in the context of
-multivariate statistics.  Since each row is represented by a local vector, the 
number of columns is
+of its rows, where each row is a local vector.
+Since each row is represented by a local vector, the number of columns is
 limited by the integer range but it should be much smaller in practice.
 
 <div class="codetabs">
@@ -344,70 +344,10 @@ long n = mat.numCols();
 </div>
 </div>
 
-#### Multivariate summary statistics
-
-We provide column summary statistics for `RowMatrix`. 
-If the number of columns is not large, say, smaller than 3000, you can also 
compute
-the covariance matrix as a local matrix, which requires $\mathcal{O}(n^2)$ 
storage where $n$ is the
-number of columns. The total CPU time is $\mathcal{O}(m n^2)$, where $m$ is 
the number of rows,
-which could be faster if the rows are sparse.
-
-<div class="codetabs">
-<div data-lang="scala" markdown="1">
-
-[`RowMatrix#computeColumnSummaryStatistics`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix)
 returns an instance of
-[`MultivariateStatisticalSummary`](api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary),
-which contains the column-wise max, min, mean, variance, and number of 
nonzeros, as well as the
-total count.
-
-{% highlight scala %}
-import org.apache.spark.mllib.linalg.Matrix
-import org.apache.spark.mllib.linalg.distributed.RowMatrix
-import org.apache.spark.mllib.stat.MultivariateStatisticalSummary
-
-val mat: RowMatrix = ... // a RowMatrix
-
-// Compute column summary statistics.
-val summary: MultivariateStatisticalSummary = 
mat.computeColumnSummaryStatistics()
-println(summary.mean) // a dense vector containing the mean value for each 
column
-println(summary.variance) // column-wise variance
-println(summary.numNonzeros) // number of nonzeros in each column
-
-// Compute the covariance matrix.
-val cov: Matrix = mat.computeCovariance()
-{% endhighlight %}
-</div>
-
-<div data-lang="java" markdown="1">
-
-[`RowMatrix#computeColumnSummaryStatistics`](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html#computeColumnSummaryStatistics())
 returns an instance of
-[`MultivariateStatisticalSummary`](api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html),
-which contains the column-wise max, min, mean, variance, and number of 
nonzeros, as well as the
-total count.
-
-{% highlight java %}
-import org.apache.spark.mllib.linalg.Matrix;
-import org.apache.spark.mllib.linalg.distributed.RowMatrix;
-import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
-
-RowMatrix mat = ... // a RowMatrix
-
-// Compute column summary statistics.
-MultivariateStatisticalSummary summary = mat.computeColumnSummaryStatistics();
-System.out.println(summary.mean()); // a dense vector containing the mean 
value for each column
-System.out.println(summary.variance()); // column-wise variance
-System.out.println(summary.numNonzeros()); // number of nonzeros in each column
-
-// Compute the covariance matrix.
-Matrix cov = mat.computeCovariance();
-{% endhighlight %}
-</div>
-</div>
-
 ### IndexedRowMatrix
 
 An `IndexedRowMatrix` is similar to a `RowMatrix` but with meaningful row 
indices.  It is backed by
-an RDD of indexed rows, which each row is represented by its index 
(long-typed) and a local vector.
+an RDD of indexed rows, so that each row is represented by its index 
(long-typed) and a local vector.
 
 <div class="codetabs">
 <div data-lang="scala" markdown="1">
@@ -467,7 +407,7 @@ RowMatrix rowMat = mat.toRowMatrix();
 
 A `CoordinateMatrix` is a distributed matrix backed by an RDD of its entries.  
Each entry is a tuple
 of `(i: Long, j: Long, value: Double)`, where `i` is the row index, `j` is the 
column index, and
-`value` is the entry value.  A `CoordinateMatrix` should be used only in the 
case when both
+`value` is the entry value.  A `CoordinateMatrix` should be used only when both
 dimensions of the matrix are huge and the matrix is very sparse.
 
 <div class="codetabs">
@@ -477,9 +417,9 @@ A
 
[`CoordinateMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.CoordinateMatrix)
 can be created from an `RDD[MatrixEntry]` instance, where
 
[`MatrixEntry`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.MatrixEntry)
 is a
-wrapper over `(Long, Long, Double)`.  A `CoordinateMatrix` can be converted to 
a `IndexedRowMatrix`
-with sparse rows by calling `toIndexedRowMatrix`.  In this release, we do not 
provide other
-computation for `CoordinateMatrix`.
+wrapper over `(Long, Long, Double)`.  A `CoordinateMatrix` can be converted to 
an `IndexedRowMatrix`
+with sparse rows by calling `toIndexedRowMatrix`.  Other computations for 
+`CoordinateMatrix` are not currently supported.
 
 {% highlight scala %}
 import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, 
MatrixEntry}
@@ -503,8 +443,9 @@ A
 
[`CoordinateMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/CoordinateMatrix.html)
 can be created from a `JavaRDD<MatrixEntry>` instance, where
 
[`MatrixEntry`](api/java/org/apache/spark/mllib/linalg/distributed/MatrixEntry.html)
 is a
-wrapper over `(long, long, double)`.  A `CoordinateMatrix` can be converted to 
a `IndexedRowMatrix`
-with sparse rows by calling `toIndexedRowMatrix`.
+wrapper over `(long, long, double)`.  A `CoordinateMatrix` can be converted to 
an `IndexedRowMatrix`
+with sparse rows by calling `toIndexedRowMatrix`. Other computations for 
+`CoordinateMatrix` are not currently supported.
 
 {% highlight java %}
 import org.apache.spark.api.java.JavaRDD;

http://git-wip-us.apache.org/repos/asf/spark/blob/cffd9bb8/docs/mllib-classification-regression.md
----------------------------------------------------------------------
diff --git a/docs/mllib-classification-regression.md 
b/docs/mllib-classification-regression.md
new file mode 100644
index 0000000..719cc95
--- /dev/null
+++ b/docs/mllib-classification-regression.md
@@ -0,0 +1,37 @@
+---
+layout: global
+title: Classification and Regression - MLlib
+displayTitle: <a href="mllib-guide.html">MLlib</a> - Classification and 
Regression
+---
+
+MLlib supports various methods for 
+[binary classification](http://en.wikipedia.org/wiki/Binary_classification),
+[multiclass
+classification](http://en.wikipedia.org/wiki/Multiclass_classification), and
+[regression analysis](http://en.wikipedia.org/wiki/Regression_analysis). The 
table below outlines
+the supported algorithms for each type of problem.
+
+<table class="table">
+  <thead>
+    <tr><th>Problem Type</th><th>Supported Methods</th></tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>Binary Classification</td><td>linear SVMs, logistic regression, 
decision trees, naive Bayes</td>
+    </tr>
+    <tr>
+      <td>Multiclass Classification</td><td>decision trees, naive Bayes</td>
+    </tr>
+    <tr>
+      <td>Regression</td><td>linear least squares, Lasso, ridge regression, 
decision trees</td>
+    </tr>
+  </tbody>
+</table>
+
+More details for these methods can be found here:
+
+* [Linear models](mllib-linear-methods.html)
+  * [binary classification (SVMs, logistic 
regression)](mllib-linear-methods.html#binary-classification)
+  * [linear regression (least squares, Lasso, 
ridge)](mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression)
+* [Decision trees](mllib-decision-tree.html)
+* [Naive Bayes](mllib-naive-bayes.html)

http://git-wip-us.apache.org/repos/asf/spark/blob/cffd9bb8/docs/mllib-clustering.md
----------------------------------------------------------------------
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
index 561de48..dfd9cd5 100644
--- a/docs/mllib-clustering.md
+++ b/docs/mllib-clustering.md
@@ -38,7 +38,7 @@ a given dataset, the algorithm returns the best clustering 
result).
 
 <div class="codetabs">
 <div data-lang="scala" markdown="1">
-Following code snippets can be executed in `spark-shell`.
+The following code snippets can be executed in `spark-shell`.
 
 In the following example after loading and parsing data, we use the
 [`KMeans`](api/scala/index.html#org.apache.spark.mllib.clustering.KMeans) 
object to cluster the data
@@ -70,7 +70,7 @@ All of MLlib's methods use Java-friendly types, so you can 
import and call them
 way you do in Scala. The only caveat is that the methods take Scala RDD 
objects, while the
 Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD to 
a Scala one by
 calling `.rdd()` on your `JavaRDD` object. A standalone application example
-that is equivalent to the provided example in Scala is given bellow:
+that is equivalent to the provided example in Scala is given below:
 
 {% highlight java %}
 import org.apache.spark.api.java.*;
@@ -113,14 +113,15 @@ public class KMeansExample {
 }
 {% endhighlight %}
 
-In order to run the above standalone application using Spark framework make
-sure that you follow the instructions provided at section [Standalone
-Applications](quick-start.html) of the quick-start guide. What is more, you
-should include to your build file *spark-mllib* as a dependency.
+In order to run the above standalone application, follow the instructions
+provided in the [Standalone
+Applications](quick-start.html#standalone-applications) section of the Spark
+quick-start guide. Be sure to also include *spark-mllib* to your build file as
+a dependency.
 </div>
 
 <div data-lang="python" markdown="1">
-Following examples can be tested in the PySpark shell.
+The following examples can be tested in the PySpark shell.
 
 In the following example after loading and parsing data, we use the KMeans 
object to cluster the
 data into two clusters. The number of desired clusters is passed to the 
algorithm. We then compute

http://git-wip-us.apache.org/repos/asf/spark/blob/cffd9bb8/docs/mllib-collaborative-filtering.md
----------------------------------------------------------------------
diff --git a/docs/mllib-collaborative-filtering.md 
b/docs/mllib-collaborative-filtering.md
index 0d28b5f..ab10b2f 100644
--- a/docs/mllib-collaborative-filtering.md
+++ b/docs/mllib-collaborative-filtering.md
@@ -14,13 +14,13 @@ is commonly used for recommender systems.  These techniques 
aim to fill in the
 missing entries of a user-item association matrix.  MLlib currently supports
 model-based collaborative filtering, in which users and products are described
 by a small set of latent factors that can be used to predict missing entries.
-In particular, we implement the [alternating least squares
+MLlib uses the [alternating least squares
 (ALS)](http://dl.acm.org/citation.cfm?id=1608614)
 algorithm to learn these latent factors. The implementation in MLlib has the
 following parameters:
 
 * *numBlocks* is the number of blocks used to parallelize computation (set to 
-1 to auto-configure).
-* *rank* is the number of latent factors in our model.
+* *rank* is the number of latent factors in the model.
 * *iterations* is the number of iterations to run.
 * *lambda* specifies the regularization parameter in ALS.
 * *implicitPrefs* specifies whether to use the *explicit feedback* ALS variant 
or one adapted for
@@ -86,8 +86,8 @@ val MSE = ratesAndPreds.map { case ((user, product), (r1, 
r2)) =>
 println("Mean Squared Error = " + MSE)
 {% endhighlight %}
 
-If the rating matrix is derived from other source of information (i.e., it is 
inferred from
-other signals), you can use the trainImplicit method to get better results.
+If the rating matrix is derived from another source of information (e.g., it 
is inferred from
+other signals), you can use the `trainImplicit` method to get better results.
 
 {% highlight scala %}
 val alpha = 0.01
@@ -174,10 +174,11 @@ public class CollaborativeFiltering {
 }
 {% endhighlight %}
 
-In order to run the above standalone application using Spark framework make
-sure that you follow the instructions provided at section [Standalone
-Applications](quick-start.html) of the quick-start guide. What is more, you
-should include to your build file *spark-mllib* as a dependency.
+In order to run the above standalone application, follow the instructions
+provided in the [Standalone
+Applications](quick-start.html#standalone-applications) section of the Spark
+quick-start guide. Be sure to also include *spark-mllib* to your build file as
+a dependency.
 </div>
 
 <div data-lang="python" markdown="1">
@@ -219,5 +220,5 @@ model = ALS.trainImplicit(ratings, rank, numIterations, 
alpha = 0.01)
 
 ## Tutorial
 
-[AMP Camp](http://ampcamp.berkeley.edu/) provides a hands-on tutorial for
-[personalized movie recommendation with 
MLlib](http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html).
+The [training 
exercises](https://databricks-training.s3.amazonaws.com/index.html) from the 
Spark Summit 2014 include a hands-on tutorial for
+[personalized movie recommendation with 
MLlib](https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html).

http://git-wip-us.apache.org/repos/asf/spark/blob/cffd9bb8/docs/mllib-dimensionality-reduction.md
----------------------------------------------------------------------
diff --git a/docs/mllib-dimensionality-reduction.md 
b/docs/mllib-dimensionality-reduction.md
index 8e43499..065d646 100644
--- a/docs/mllib-dimensionality-reduction.md
+++ b/docs/mllib-dimensionality-reduction.md
@@ -9,9 +9,9 @@ displayTitle: <a href="mllib-guide.html">MLlib</a> - 
Dimensionality Reduction
 
 [Dimensionality 
reduction](http://en.wikipedia.org/wiki/Dimensionality_reduction) is the 
process 
 of reducing the number of variables under consideration.
-It is used to extract latent features from raw and noisy features,
+It can be used to extract latent features from raw and noisy features
 or compress data while maintaining the structure.
-In this release, we provide preliminary support for dimensionality reduction 
on tall-and-skinny matrices.
+MLlib provides support for dimensionality reduction on tall-and-skinny 
matrices.
 
 ## Singular value decomposition (SVD)
 
@@ -30,17 +30,17 @@ where
 * $V$ is an orthonormal matrix, whose columns are called right singular 
vectors.
  
 For large matrices, usually we don't need the complete factorization but only 
the top singular
-values and its associated singular vectors.  This can save storage, and more 
importantly, de-noise
+values and its associated singular vectors.  This can save storage, de-noise
 and recover the low-rank structure of the matrix.
 
-If we keep the top $k$ singular values, then the dimensions of the return will 
be:
+If we keep the top $k$ singular values, then the dimensions of the resulting 
low-rank matrix will be:
 
 * `$U$`: `$m \times k$`,
 * `$\Sigma$`: `$k \times k$`,
 * `$V$`: `$n \times k$`.
  
-In this release, we provide SVD computation to row-oriented matrices that have 
only a few columns,
-say, less than $1000$, but many rows, which we call *tall-and-skinny*.
+MLlib provides SVD functionality to row-oriented matrices that have only a few 
columns,
+say, less than $1000$, but many rows, i.e., *tall-and-skinny* matrices.
 
 <div class="codetabs">
 <div data-lang="scala" markdown="1">
@@ -58,15 +58,10 @@ val s: Vector = svd.s // The singular values are stored in 
a local dense vector.
 val V: Matrix = svd.V // The V factor is a local dense matrix.
 {% endhighlight %}
 
-Same code applies to `IndexedRowMatrix`.
-The only difference that the `U` matrix becomes an `IndexedRowMatrix`.
+The same code applies to `IndexedRowMatrix` if `U` is defined as an
+`IndexedRowMatrix`.
 </div>
 <div data-lang="java" markdown="1">
-In order to run the following standalone application using Spark framework make
-sure that you follow the instructions provided at section [Standalone
-Applications](quick-start.html) of the quick-start guide. What is more, you
-should include to your build file *spark-mllib* as a dependency.
-
 {% highlight java %}
 import java.util.LinkedList;
 
@@ -104,8 +99,16 @@ public class SVD {
   }
 }
 {% endhighlight %}
-Same code applies to `IndexedRowMatrix`.
-The only difference that the `U` matrix becomes an `IndexedRowMatrix`.
+
+The same code applies to `IndexedRowMatrix` if `U` is defined as an
+`IndexedRowMatrix`.
+
+In order to run the above standalone application, follow the instructions
+provided in the [Standalone
+Applications](quick-start.html#standalone-applications) section of the Spark
+quick-start guide. Be sure to also include *spark-mllib* to your build file as
+a dependency.
+
 </div>
 </div>
 
@@ -116,7 +119,7 @@ statistical method to find a rotation such that the first 
coordinate has the lar
 possible, and each succeeding coordinate in turn has the largest variance 
possible. The columns of
 the rotation matrix are called principal components. PCA is used widely in 
dimensionality reduction.
 
-In this release, we implement PCA for tall-and-skinny matrices stored in 
row-oriented format.
+MLlib supports PCA for tall-and-skinny matrices stored in row-oriented format.
 
 <div class="codetabs">
 <div data-lang="scala" markdown="1">
@@ -180,9 +183,10 @@ public class PCA {
 }
 {% endhighlight %}
 
-In order to run the above standalone application using Spark framework make
-sure that you follow the instructions provided at section [Standalone
-Applications](quick-start.html) of the quick-start guide. What is more, you
-should include to your build file *spark-mllib* as a dependency.
+In order to run the above standalone application, follow the instructions
+provided in the [Standalone
+Applications](quick-start.html#standalone-applications) section of the Spark
+quick-start guide. Be sure to also include *spark-mllib* to your build file as
+a dependency.
 </div>
 </div>

http://git-wip-us.apache.org/repos/asf/spark/blob/cffd9bb8/docs/mllib-feature-extraction.md
----------------------------------------------------------------------
diff --git a/docs/mllib-feature-extraction.md b/docs/mllib-feature-extraction.md
new file mode 100644
index 0000000..21453cb
--- /dev/null
+++ b/docs/mllib-feature-extraction.md
@@ -0,0 +1,12 @@
+---
+layout: global
+title: Feature Extraction - MLlib
+displayTitle: <a href="mllib-guide.html">MLlib</a> - Feature Extraction 
+---
+
+* Table of contents
+{:toc}
+
+## Word2Vec 
+
+## TFIDF

http://git-wip-us.apache.org/repos/asf/spark/blob/cffd9bb8/docs/mllib-guide.md
----------------------------------------------------------------------
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index 95ee6bc..23d5a0c 100644
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -3,18 +3,19 @@ layout: global
 title: Machine Learning Library (MLlib)
 ---
 
-MLlib is a Spark implementation of some common machine learning algorithms and 
utilities,
+MLlib is Spark's scalable machine learning library consisting of common 
learning algorithms and utilities,
 including classification, regression, clustering, collaborative
-filtering, dimensionality reduction, as well as underlying optimization 
primitives:
+filtering, dimensionality reduction, as well as underlying optimization 
primitives, as outlined below:
 
-* [Basics](mllib-basics.html)
-  * data types 
+* [Data types](mllib-basics.html)
+* [Basic statistics](mllib-stats.html)
+  * data generators  
+  * stratified sampling
   * summary statistics
-* Classification and regression
-  * [linear support vector machine 
(SVM)](mllib-linear-methods.html#linear-support-vector-machine-svm)
-  * [logistic regression](mllib-linear-methods.html#logistic-regression)
-  * [linear least squares, Lasso, and ridge 
regression](mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression)
-  * [decision tree](mllib-decision-tree.html)
+  * hypothesis testing
+* [Classification and regression](mllib-classification-regression.html)
+  * [linear models (SVMs, logistic regression, linear 
regression)](mllib-linear-methods.html)
+  * [decision trees](mllib-decision-tree.html)
   * [naive Bayes](mllib-naive-bayes.html)
 * [Collaborative filtering](mllib-collaborative-filtering.html)
   * alternating least squares (ALS)
@@ -23,17 +24,18 @@ filtering, dimensionality reduction, as well as underlying 
optimization primitiv
 * [Dimensionality reduction](mllib-dimensionality-reduction.html)
   * singular value decomposition (SVD)
   * principal component analysis (PCA)
-* [Optimization](mllib-optimization.html)
+* [Feature extraction and transformation](mllib-feature-extraction.html)
+* [Optimization (developer)](mllib-optimization.html)
   * stochastic gradient descent
   * limited-memory BFGS (L-BFGS)
 
-MLlib is a new component under active development.
+MLlib is under active development.
 The APIs marked `Experimental`/`DeveloperApi` may change in future releases, 
-and we will provide migration guide between releases.
+and the migration guide below will explain all changes between releases.
 
 # Dependencies
 
-MLlib uses linear algebra packages [Breeze](http://www.scalanlp.org/), which 
depends on
+MLlib uses the linear algebra package [Breeze](http://www.scalanlp.org/), 
which depends on
 [netlib-java](https://github.com/fommil/netlib-java), and
 [jblas](https://github.com/mikiobraun/jblas). 
 `netlib-java` and `jblas` depend on native Fortran routines.
@@ -56,7 +58,7 @@ To use MLlib in Python, you will need 
[NumPy](http://www.numpy.org) version 1.4
 
 In MLlib v1.0, we support both dense and sparse input in a unified way, which 
introduces a few
 breaking changes.  If your data is sparse, please store it in a sparse format 
instead of dense to
-take advantage of sparsity in both storage and computation.
+take advantage of sparsity in both storage and computation. Details are 
described below.
 
 <div class="codetabs">
 <div data-lang="scala" markdown="1">

http://git-wip-us.apache.org/repos/asf/spark/blob/cffd9bb8/docs/mllib-linear-methods.md
----------------------------------------------------------------------
diff --git a/docs/mllib-linear-methods.md b/docs/mllib-linear-methods.md
index 2542011..e504cd7 100644
--- a/docs/mllib-linear-methods.md
+++ b/docs/mllib-linear-methods.md
@@ -33,24 +33,24 @@ the task of finding a minimizer of a convex function `$f$` 
that depends on a var
 Formally, we can write this as the optimization problem `$\min_{\wv \in\R^d} 
\; f(\wv)$`, where
 the objective function is of the form
 `\begin{equation}
-    f(\wv) := 
-    \frac1n \sum_{i=1}^n L(\wv;\x_i,y_i) +
-    \lambda\, R(\wv_i)
+    f(\wv) := \lambda\, R(\wv) +
+    \frac1n \sum_{i=1}^n L(\wv;\x_i,y_i)
     \label{eq:regPrimal}
     \ .
 \end{equation}`
 Here the vectors `$\x_i\in\R^d$` are the training data examples, for `$1\le 
i\le n$`, and
 `$y_i\in\R$` are their corresponding labels, which we want to predict. 
 We call the method *linear* if $L(\wv; \x, y)$ can be expressed as a function 
of $\wv^T x$ and $y$.
-Several MLlib's classification and regression algorithms fall into this 
category,
+Several of MLlib's classification and regression algorithms fall into this 
category,
 and are discussed here.
 
 The objective function `$f$` has two parts:
-the loss that measures the error of the model on the training data, 
-and the regularizer that measures the complexity of the model.
-The loss function `$L(\wv;.)$` must be a convex function in `$\wv$`.
-The fixed regularization parameter `$\lambda \ge 0$` (`regParam` in the code) 
defines the trade-off
-between the two goals of small loss and small model complexity.
+the regularizer that controls the complexity of the model,
+and the loss that measures the error of the model on the training data.
+The loss function `$L(\wv;.)$` is typically a convex function in `$\wv$`.  The
+fixed regularization parameter `$\lambda \ge 0$` (`regParam` in the code)
+defines the trade-off between the two goals of minimizing the loss (i.e.,
+training error) and minimizing model complexity (i.e., to avoid overfitting).
 
 ### Loss functions
 
@@ -80,10 +80,10 @@ methods MLlib supports:
 
 ### Regularizers
 
-The purpose of the 
[regularizer](http://en.wikipedia.org/wiki/Regularization_(mathematics)) is to
-encourage simple models, by punishing the complexity of the model `$\wv$`, in 
order to e.g. avoid
-over-fitting.
-We support the following regularizers in MLlib:
+The purpose of the
+[regularizer](http://en.wikipedia.org/wiki/Regularization_(mathematics)) is to
+encourage simple models and avoid overfitting.  We support the following
+regularizers in MLlib:
 
 <table class="table">
   <thead>
@@ -106,27 +106,28 @@ Here `$\mathrm{sign}(\wv)$` is the vector consisting of 
the signs (`$\pm1$`) of
 of `$\wv$`.
 
 L2-regularized problems are generally easier to solve than L1-regularized due 
to smoothness.
-However, L1 regularization can help promote sparsity in weights, leading to 
simpler models, which is
-also used for feature selection.  It is not recommended to train models 
without any regularization,
+However, L1 regularization can help promote sparsity in weights leading to 
smaller and more interpretable models, the latter of which can be useful for 
feature selection.
+It is not recommended to train models without any regularization,
 especially when the number of training examples is small.
 
 ## Binary classification
 
-[Binary classification](http://en.wikipedia.org/wiki/Binary_classification) is 
to divide items into
-two categories: positive and negative.  MLlib supports two linear methods for 
binary classification:
-linear support vector machine (SVM) and logistic regression.  The training 
data set is represented
-by an RDD of [LabeledPoint](mllib-data-types.html) in MLlib.  Note that, in 
the mathematical
-formulation, a training label $y$ is either $+1$ (positive) or $-1$ 
(negative), which is convenient
-for the formulation.  *However*, the negative label is represented by $0$ in 
MLlib instead of $-1$,
-to be consistent with multiclass labeling.
+[Binary classification](http://en.wikipedia.org/wiki/Binary_classification)
+aims to divide items into two categories: positive and negative.  MLlib
+supports two linear methods for binary classification: linear support vector
+machines (SVMs) and logistic regression. For both methods, MLlib supports
+L1 and L2 regularized variants. The training data set is represented by an RDD
+of [LabeledPoint](mllib-data-types.html) in MLlib.  Note that, in the
+mathematical formulation in this guide, a training label $y$ is denoted as
+either $+1$ (positive) or $-1$ (negative), which is convenient for the
+formulation.  *However*, the negative label is represented by $0$ in MLlib
+instead of $-1$, to be consistent with multiclass labeling.
 
-### Linear support vector machine (SVM)
+### Linear support vector machines (SVMs)
 
 The [linear 
SVM](http://en.wikipedia.org/wiki/Support_vector_machine#Linear_SVM)
-has become a standard choice for large-scale classification tasks.
-The name "linear SVM" is actually ambiguous.
-By "linear SVM", we mean specifically the linear method with the loss function 
in formulation
-`$\eqref{eq:regPrimal}$` given by the hinge loss
+is a standard method for large-scale classification tasks. It is a linear 
method as described above in equation `$\eqref{eq:regPrimal}$`, with the loss 
function in the formulation given by the hinge loss:
+
 `\[
 L(\wv;\x,y) := \max \{0, 1-y \wv^T \x \}.
 \]`
@@ -134,39 +135,44 @@ By default, linear SVMs are trained with an L2 
regularization.
 We also support alternative L1 regularization. In this case,
 the problem becomes a [linear 
program](http://en.wikipedia.org/wiki/Linear_programming).
 
-Linear SVM algorithm outputs a SVM model, which makes predictions based on the 
value of $\wv^T \x$.
-By the default, if $\wv^T \x \geq 0$, the outcome is positive, or negative 
otherwise.
-However, quite often in practice, the default threshold $0$ is not a good 
choice.
-The threshold should be determined via model evaluation.
+The linear SVMs algorithm outputs an SVM model. Given a new data point,
+denoted by $\x$, the model makes predictions based on the value of $\wv^T \x$.
+By the default, if $\wv^T \x \geq 0$ then the outcome is positive, and negative
+otherwise.
 
 ### Logistic regression
 
 [Logistic regression](http://en.wikipedia.org/wiki/Logistic_regression) is 
widely used to predict a
-binary response.  It is a linear method with the loss function in formulation
-`$\eqref{eq:regPrimal}$` given by the logistic loss
+binary response. 
+It is a linear method as described above in equation `$\eqref{eq:regPrimal}$`, 
with the loss
+function in the formulation given by the logistic loss:
 `\[
 L(\wv;\x,y) :=  \log(1+\exp( -y \wv^T \x)).
 \]`
 
-Logistic regression algorithm outputs a logistic regression model, which makes 
predictions by
+The logistic regression algorithm outputs a logistic regression model.  Given a
+new data point, denoted by $\x$, the model makes predictions by
 applying the logistic function
 `\[
 \mathrm{f}(z) = \frac{1}{1 + e^{-z}}
 \]`
 where $z = \wv^T \x$.
-By default, if $\mathrm{f}(\wv^T x) > 0.5$, the outcome is positive, or 
negative otherwise.
-For the same reason mentioned above, quite often in practice, this default 
threshold is not a good choice.
-The threshold should be determined via model evaluation.
+By default, if $\mathrm{f}(\wv^T x) > 0.5$, the outcome is positive, or
+negative otherwise, though unlike linear SVMs, the raw output of the logistic 
regression
+model, $\mathrm{f}(z)$, has a probabilistic interpretation (i.e., the 
probability
+that $\x$ is positive).
 
 ### Evaluation metrics
 
-MLlib supports common evaluation metrics for binary classification (not 
available in Python).  This
+MLlib supports common evaluation metrics for binary classification (not 
available in PySpark). 
+This
 includes precision, recall, [F-measure](http://en.wikipedia.org/wiki/F1_score),
 [receiver operating characteristic 
(ROC)](http://en.wikipedia.org/wiki/Receiver_operating_characteristic),
 precision-recall curve, and
 [area under the curves 
(AUC)](http://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve).
-Among the metrics, area under ROC is commonly used to compare models and 
precision/recall/F-measure
-can help determine the threshold to use.
+AUC is commonly used to compare the performance of various models while
+precision/recall/F-measure can help determine the appropriate threshold to use
+for prediction purposes. 
 
 ### Examples
 
@@ -233,8 +239,7 @@ svmAlg.optimizer.
 val modelL1 = svmAlg.run(training)
 {% endhighlight %}
 
-Similarly, you can use replace `SVMWithSGD` by
-[`LogisticRegressionWithSGD`](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithSGD).
+[`LogisticRegressionWithSGD`](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithSGD)
 can be used in a similar fashion as `SVMWithSGD`.
 
 </div>
 
@@ -318,10 +323,11 @@ svmAlg.optimizer()
 final SVMModel modelL1 = svmAlg.run(training.rdd());
 {% endhighlight %}
 
-In order to run the above standalone application using Spark framework make
-sure that you follow the instructions provided at section [Standalone
-Applications](quick-start.html) of the quick-start guide. What is more, you
-should include to your build file *spark-mllib* as a dependency.
+In order to run the above standalone application, follow the instructions
+provided in the [Standalone
+Applications](quick-start.html#standalone-applications) section of the Spark
+quick-start guide. Be sure to also include *spark-mllib* to your build file as
+a dependency.
 </div>
 
 <div data-lang="python" markdown="1">
@@ -354,24 +360,22 @@ print("Training Error = " + str(trainErr))
 
 ## Linear least squares, Lasso, and ridge regression
 
-Linear least squares is a family of linear methods with the loss function in 
formulation
-`$\eqref{eq:regPrimal}$` given by the squared loss
 
+Linear least squares is the most common formulation for regression problems. 
+It is a linear method as described above in equation `$\eqref{eq:regPrimal}$`, 
with the loss
+function in the formulation given by the squared loss:
 `\[
 L(\wv;\x,y) :=  \frac{1}{2} (\wv^T \x - y)^2.
 \]`
 
-Depending on the regularization type, we call the method
-[*ordinary least 
squares*](http://en.wikipedia.org/wiki/Ordinary_least_squares) or simply
-[*linear least 
squares*](http://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)) if 
there
-is no regularization, [*ridge 
regression*](http://en.wikipedia.org/wiki/Ridge_regression) if L2
-regularization is used, and 
[*Lasso*](http://en.wikipedia.org/wiki/Lasso_(statistics)) if L1
-regularization is used.  This average loss $\frac{1}{n} \sum_{i=1}^n (\wv^T 
x_i - y_i)^2$ is also
+Various related regression methods are derived by using different types of 
regularization:
+[*ordinary least 
squares*](http://en.wikipedia.org/wiki/Ordinary_least_squares) or 
+[*linear least 
squares*](http://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)) uses 
+ no regularization; [*ridge 
regression*](http://en.wikipedia.org/wiki/Ridge_regression) uses L2
+regularization; and [*Lasso*](http://en.wikipedia.org/wiki/Lasso_(statistics)) 
uses L1
+regularization.  For all of these models, the average loss or training error, 
$\frac{1}{n} \sum_{i=1}^n (\wv^T x_i - y_i)^2$, is
 known as the [mean squared 
error](http://en.wikipedia.org/wiki/Mean_squared_error).
 
-Note that the squared loss is sensitive to outliers. 
-Regularization or a robust alternative (e.g., $\ell_1$ regression) is usually 
necessary in practice.
-
 ### Examples
 
 <div class="codetabs">
@@ -379,7 +383,7 @@ Regularization or a robust alternative (e.g., $\ell_1$ 
regression) is usually ne
 <div data-lang="scala" markdown="1">
 The following example demonstrate how to load training data, parse it as an 
RDD of LabeledPoint.
 The example then uses LinearRegressionWithSGD to build a simple linear model 
to predict label 
-values. We compute the Mean Squared Error at the end to evaluate
+values. We compute the mean squared error at the end to evaluate
 [goodness of fit](http://en.wikipedia.org/wiki/Goodness_of_fit).
 
 {% highlight scala %}
@@ -407,9 +411,8 @@ val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - 
p), 2)}.mean()
 println("training Mean Squared Error = " + MSE)
 {% endhighlight %}
 
-Similarly you can use
 
[`RidgeRegressionWithSGD`](api/scala/index.html#org.apache.spark.mllib.regression.RidgeRegressionWithSGD)
-and 
[`LassoWithSGD`](api/scala/index.html#org.apache.spark.mllib.regression.LassoWithSGD).
+and 
[`LassoWithSGD`](api/scala/index.html#org.apache.spark.mllib.regression.LassoWithSGD)
 can be used in a similar fashion as `LinearRegressionWithSGD`.
 
 </div>
 
@@ -479,16 +482,17 @@ public class LinearRegression {
 }
 {% endhighlight %}
 
-In order to run the above standalone application using Spark framework make
-sure that you follow the instructions provided at section [Standalone
-Applications](quick-start.html) of the quick-start guide. What is more, you
-should include to your build file *spark-mllib* as a dependency.
+In order to run the above standalone application, follow the instructions
+provided in the [Standalone
+Applications](quick-start.html#standalone-applications) section of the Spark
+quick-start guide. Be sure to also include *spark-mllib* to your build file as
+a dependency.
 </div>
 
 <div data-lang="python" markdown="1">
 The following example demonstrate how to load training data, parse it as an 
RDD of LabeledPoint.
 The example then uses LinearRegressionWithSGD to build a simple linear model 
to predict label 
-values. We compute the Mean Squared Error at the end to evaluate
+values. We compute the mean squared error at the end to evaluate
 [goodness of fit](http://en.wikipedia.org/wiki/Goodness_of_fit).
 
 {% highlight python %}

http://git-wip-us.apache.org/repos/asf/spark/blob/cffd9bb8/docs/mllib-naive-bayes.md
----------------------------------------------------------------------
diff --git a/docs/mllib-naive-bayes.md b/docs/mllib-naive-bayes.md
index b1650c8..86d94ae 100644
--- a/docs/mllib-naive-bayes.md
+++ b/docs/mllib-naive-bayes.md
@@ -4,23 +4,23 @@ title: Naive Bayes - MLlib
 displayTitle: <a href="mllib-guide.html">MLlib</a> - Naive Bayes
 ---
 
-Naive Bayes is a simple multiclass classification algorithm with the 
assumption of independence
-between every pair of features. Naive Bayes can be trained very efficiently. 
Within a single pass to
-the training data, it computes the conditional probability distribution of 
each feature given label,
-and then it applies Bayes' theorem to compute the conditional probability 
distribution of label
-given an observation and use it for prediction. For more details, please visit 
the Wikipedia page
-[Naive Bayes classifier](http://en.wikipedia.org/wiki/Naive_Bayes_classifier).
-
-In MLlib, we implemented multinomial naive Bayes, which is typically used for 
document
-classification. Within that context, each observation is a document, each 
feature represents a term,
-whose value is the frequency of the term. For its formulation, please visit 
the Wikipedia page
-[Multinomial Naive 
Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes)
-or the section
-[Naive Bayes text 
classification](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html)
-from the book Introduction to Information
-Retrieval. [Additive 
smoothing](http://en.wikipedia.org/wiki/Lidstone_smoothing) can be used by
+[Naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) is a simple
+multiclass classification algorithm with the assumption of independence between
+every pair of features. Naive Bayes can be trained very efficiently. Within a
+single pass to the training data, it computes the conditional probability
+distribution of each feature given label, and then it applies Bayes' theorem to
+compute the conditional probability distribution of label given an observation
+and use it for prediction.
+
+MLlib supports [multinomial naive
+Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes),
+which is typically used for [document
+classification](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html).
+Within that context, each observation is a document and each
+feature represents a term whose value is the frequency of the term. 
+[Additive smoothing](http://en.wikipedia.org/wiki/Lidstone_smoothing) can be 
used by
 setting the parameter $\lambda$ (default to $1.0$). For document 
classification, the input feature
-vectors are usually sparse. Please supply sparse vectors as input to take 
advantage of
+vectors are usually sparse, and sparse vectors should be supplied as input to 
take advantage of
 sparsity. Since the training data is only used once, it is not necessary to 
cache it.
 
 ## Examples

http://git-wip-us.apache.org/repos/asf/spark/blob/cffd9bb8/docs/mllib-stats.md
----------------------------------------------------------------------
diff --git a/docs/mllib-stats.md b/docs/mllib-stats.md
new file mode 100644
index 0000000..ca9ef46
--- /dev/null
+++ b/docs/mllib-stats.md
@@ -0,0 +1,95 @@
+---
+layout: global
+title: Statistics Functionality - MLlib
+displayTitle: <a href="mllib-guide.html">MLlib</a> - Statistics Functionality 
+---
+
+* Table of contents
+{:toc}
+
+
+`\[
+\newcommand{\R}{\mathbb{R}}
+\newcommand{\E}{\mathbb{E}} 
+\newcommand{\x}{\mathbf{x}}
+\newcommand{\y}{\mathbf{y}}
+\newcommand{\wv}{\mathbf{w}}
+\newcommand{\av}{\mathbf{\alpha}}
+\newcommand{\bv}{\mathbf{b}}
+\newcommand{\N}{\mathbb{N}}
+\newcommand{\id}{\mathbf{I}} 
+\newcommand{\ind}{\mathbf{1}} 
+\newcommand{\0}{\mathbf{0}} 
+\newcommand{\unit}{\mathbf{e}} 
+\newcommand{\one}{\mathbf{1}} 
+\newcommand{\zero}{\mathbf{0}}
+\]`
+
+## Data Generators 
+
+## Stratified Sampling 
+
+## Summary Statistics 
+
+### Multivariate summary statistics
+
+We provide column summary statistics for `RowMatrix` (note: this functionality 
is not currently supported in `IndexedRowMatrix` or `CoordinateMatrix`). 
+If the number of columns is not large, e.g., on the order of thousands, then 
the 
+covariance matrix can also be computed as a local matrix, which requires 
$\mathcal{O}(n^2)$ storage where $n$ is the
+number of columns. The total CPU time is $\mathcal{O}(m n^2)$, where $m$ is 
the number of rows,
+and is faster if the rows are sparse.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+[`computeColumnSummaryStatistics()`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix)
 returns an instance of
+[`MultivariateStatisticalSummary`](api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary),
+which contains the column-wise max, min, mean, variance, and number of 
nonzeros, as well as the
+total count.
+
+{% highlight scala %}
+import org.apache.spark.mllib.linalg.Matrix
+import org.apache.spark.mllib.linalg.distributed.RowMatrix
+import org.apache.spark.mllib.stat.MultivariateStatisticalSummary
+
+val mat: RowMatrix = ... // a RowMatrix
+
+// Compute column summary statistics.
+val summary: MultivariateStatisticalSummary = 
mat.computeColumnSummaryStatistics()
+println(summary.mean) // a dense vector containing the mean value for each 
column
+println(summary.variance) // column-wise variance
+println(summary.numNonzeros) // number of nonzeros in each column
+
+// Compute the covariance matrix.
+val cov: Matrix = mat.computeCovariance()
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+
+[`RowMatrix#computeColumnSummaryStatistics`](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html#computeColumnSummaryStatistics())
 returns an instance of
+[`MultivariateStatisticalSummary`](api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html),
+which contains the column-wise max, min, mean, variance, and number of 
nonzeros, as well as the
+total count.
+
+{% highlight java %}
+import org.apache.spark.mllib.linalg.Matrix;
+import org.apache.spark.mllib.linalg.distributed.RowMatrix;
+import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
+
+RowMatrix mat = ... // a RowMatrix
+
+// Compute column summary statistics.
+MultivariateStatisticalSummary summary = mat.computeColumnSummaryStatistics();
+System.out.println(summary.mean()); // a dense vector containing the mean 
value for each column
+System.out.println(summary.variance()); // column-wise variance
+System.out.println(summary.numNonzeros()); // number of nonzeros in each column
+
+// Compute the covariance matrix.
+Matrix cov = mat.computeCovariance();
+{% endhighlight %}
+</div>
+</div>
+
+
+## Hypothesis Testing 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to