spark git commit: [SPARK-5806] re-organize sections in mllib-clustering.md

meng Fri, 13 Feb 2015 15:10:40 -0800

Repository: spark
Updated Branches:
  refs/heads/master 2e0c08452 -> cc56c8729



[SPARK-5806] re-organize sections in mllib-clustering.md

Put example code close to the algorithm description.

Author: Xiangrui Meng <m...@databricks.com>

Closes #4598 from mengxr/SPARK-5806 and squashes the following commits:

a137872 [Xiangrui Meng] re-organize sections in mllib-clustering.md


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cc56c872
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cc56c872
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cc56c872

Branch: refs/heads/master
Commit: cc56c8729a76af85aa6eb5d2f99787cca5e5b38f
Parents: 2e0c084
Author: Xiangrui Meng <m...@databricks.com>
Authored: Fri Feb 13 15:09:27 2015 -0800
Committer: Xiangrui Meng <m...@databricks.com>
Committed: Fri Feb 13 15:09:27 2015 -0800

----------------------------------------------------------------------
 docs/mllib-clustering.md | 156 +++++++++++++++++++-----------------------
 docs/mllib-guide.md      |   8 ++-
 2 files changed, 77 insertions(+), 87 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/cc56c872/docs/mllib-clustering.md
----------------------------------------------------------------------
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
index 0450e6f..09b5657 100644
--- a/docs/mllib-clustering.md
+++ b/docs/mllib-clustering.md
@@ -4,12 +4,6 @@ title: Clustering - MLlib
 displayTitle: <a href="mllib-guide.html">MLlib</a> - Clustering
 ---
 
-* Table of contents
-{:toc}
-
-
-## Clustering
-
 Clustering is an unsupervised learning problem whereby we aim to group subsets
 of entities with one another based on some notion of similarity.  Clustering is
 often used for exploratory analysis and/or as a component of a hierarchical
@@ -18,7 +12,10 @@ models are trained for each cluster).
 
 MLlib supports the following models:
 
-### k-means
+* Table of contents
+{:toc}
+
+## K-means
 
 [k-means](http://en.wikipedia.org/wiki/K-means_clustering) is one of the
 most commonly used clustering algorithms that clusters the data points into a
@@ -37,72 +34,7 @@ a given dataset, the algorithm returns the best clustering 
result).
 * *initializationSteps* determines the number of steps in the k-means\|\| 
algorithm.
 * *epsilon* determines the distance threshold within which we consider k-means 
to have converged.
 
-### Gaussian mixture
-
-A [Gaussian Mixture 
Model](http://en.wikipedia.org/wiki/Mixture_model#Multivariate_Gaussian_mixture_model)
-represents a composite distribution whereby points are drawn from one of *k* 
Gaussian sub-distributions,
-each with its own probability.  The MLlib implementation uses the
-[expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
- algorithm to induce the maximum-likelihood model given a set of samples.  The 
implementation
-has the following parameters:
-
-* *k* is the number of desired clusters.
-* *convergenceTol* is the maximum change in log-likelihood at which we 
consider convergence achieved.
-* *maxIterations* is the maximum number of iterations to perform without 
reaching convergence.
-* *initialModel* is an optional starting point from which to start the EM 
algorithm. If this parameter is omitted, a random starting point will be 
constructed from the data.
-
-### Power Iteration Clustering
-
-Power iteration clustering is a scalable and efficient algorithm for 
clustering points given pointwise mutual affinity values.  Internally the 
algorithm:
-
-* accepts a [Graph](api/graphx/index.html#org.apache.spark.graphx.Graph) that 
represents a  normalized pairwise affinity between all input points.
-* calculates the principal eigenvalue and eigenvector
-* Clusters each of the input points according to their principal eigenvector 
component value
-
-Details of this algorithm are found within [Power Iteration Clustering, Lin 
and Cohen]{www.icml2010.org/papers/387.pdf}
-
-Example outputs for a dataset inspired by the paper - but with five clusters 
instead of three- have he following output from our implementation:
-
-<p style="text-align: center;">
-  <img src="img/PIClusteringFiveCirclesInputsAndOutputs.png"
-       title="The Property Graph"
-       alt="The Property Graph"
-       width="50%" />
-  <!-- Images are downsized intentionally to improve quality on retina 
displays -->
-</p>
-
-### Latent Dirichlet Allocation (LDA)
-
-[Latent Dirichlet Allocation 
(LDA)](http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)
-is a topic model which infers topics from a collection of text documents.
-LDA can be thought of as a clustering algorithm as follows:
-
-* Topics correspond to cluster centers, and documents correspond to examples 
(rows) in a dataset.
-* Topics and documents both exist in a feature space, where feature vectors 
are vectors of word counts.
-* Rather than estimating a clustering using a traditional distance, LDA uses a 
function based
- on a statistical model of how text documents are generated.
-
-LDA takes in a collection of documents as vectors of word counts.
-It learns clustering using 
[expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
-on the likelihood function. After fitting on the documents, LDA provides:
-
-* Topics: Inferred topics, each of which is a probability distribution over 
terms (words).
-* Topic distributions for documents: For each document in the training set, 
LDA gives a probability distribution over topics.
-
-LDA takes the following parameters:
-
-* `k`: Number of topics (i.e., cluster centers)
-* `maxIterations`: Limit on the number of iterations of EM used for learning
-* `docConcentration`: Hyperparameter for prior over documents' distributions 
over topics. Currently must be > 1, where larger values encourage smoother 
inferred distributions.
-* `topicConcentration`: Hyperparameter for prior over topics' distributions 
over terms (words). Currently must be > 1, where larger values encourage 
smoother inferred distributions.
-* `checkpointInterval`: If using checkpointing (set in the Spark 
configuration), this parameter specifies the frequency with which checkpoints 
will be created.  If `maxIterations` is large, using checkpointing can help 
reduce shuffle file sizes on disk and help with failure recovery.
-
-*Note*: LDA is a new feature with some missing functionality.  In particular, 
it does not yet
-support prediction on new documents, and it does not have a Python API.  These 
will be added in the future.
-
-### Examples
-
-#### k-means
+**Examples**
 
 <div class="codetabs">
 <div data-lang="scala" markdown="1">
@@ -216,7 +148,21 @@ print("Within Set Sum of Squared Error = " + str(WSSSE))
 
 </div>
 
-#### GaussianMixture
+## Gaussian mixture
+
+A [Gaussian Mixture 
Model](http://en.wikipedia.org/wiki/Mixture_model#Multivariate_Gaussian_mixture_model)
+represents a composite distribution whereby points are drawn from one of *k* 
Gaussian sub-distributions,
+each with its own probability.  The MLlib implementation uses the
+[expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
+ algorithm to induce the maximum-likelihood model given a set of samples.  The 
implementation
+has the following parameters:
+
+* *k* is the number of desired clusters.
+* *convergenceTol* is the maximum change in log-likelihood at which we 
consider convergence achieved.
+* *maxIterations* is the maximum number of iterations to perform without 
reaching convergence.
+* *initialModel* is an optional starting point from which to start the EM 
algorithm. If this parameter is omitted, a random starting point will be 
constructed from the data.
+
+**Examples**
 
 <div class="codetabs">
 <div data-lang="scala" markdown="1">
@@ -322,7 +268,56 @@ for i in range(2):
 
 </div>
 
-#### Latent Dirichlet Allocation (LDA) Example
+## Power iteration clustering (PIC)
+
+Power iteration clustering (PIC) is a scalable and efficient algorithm for 
clustering points given pointwise mutual affinity values.  Internally the 
algorithm:
+
+* accepts a [Graph](api/graphx/index.html#org.apache.spark.graphx.Graph) that 
represents a  normalized pairwise affinity between all input points.
+* calculates the principal eigenvalue and eigenvector
+* Clusters each of the input points according to their principal eigenvector 
component value
+
+Details of this algorithm are found within [Power Iteration Clustering, Lin 
and Cohen]{www.icml2010.org/papers/387.pdf}
+
+Example outputs for a dataset inspired by the paper - but with five clusters 
instead of three- have he following output from our implementation:
+
+<p style="text-align: center;">
+  <img src="img/PIClusteringFiveCirclesInputsAndOutputs.png"
+       title="The Property Graph"
+       alt="The Property Graph"
+       width="50%" />
+  <!-- Images are downsized intentionally to improve quality on retina 
displays -->
+</p>
+
+## Latent Dirichlet allocation (LDA)
+
+[Latent Dirichlet allocation 
(LDA)](http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)
+is a topic model which infers topics from a collection of text documents.
+LDA can be thought of as a clustering algorithm as follows:
+
+* Topics correspond to cluster centers, and documents correspond to examples 
(rows) in a dataset.
+* Topics and documents both exist in a feature space, where feature vectors 
are vectors of word counts.
+* Rather than estimating a clustering using a traditional distance, LDA uses a 
function based
+ on a statistical model of how text documents are generated.
+
+LDA takes in a collection of documents as vectors of word counts.
+It learns clustering using 
[expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
+on the likelihood function. After fitting on the documents, LDA provides:
+
+* Topics: Inferred topics, each of which is a probability distribution over 
terms (words).
+* Topic distributions for documents: For each document in the training set, 
LDA gives a probability distribution over topics.
+
+LDA takes the following parameters:
+
+* `k`: Number of topics (i.e., cluster centers)
+* `maxIterations`: Limit on the number of iterations of EM used for learning
+* `docConcentration`: Hyperparameter for prior over documents' distributions 
over topics. Currently must be > 1, where larger values encourage smoother 
inferred distributions.
+* `topicConcentration`: Hyperparameter for prior over topics' distributions 
over terms (words). Currently must be > 1, where larger values encourage 
smoother inferred distributions.
+* `checkpointInterval`: If using checkpointing (set in the Spark 
configuration), this parameter specifies the frequency with which checkpoints 
will be created.  If `maxIterations` is large, using checkpointing can help 
reduce shuffle file sizes on disk and help with failure recovery.
+
+*Note*: LDA is a new feature with some missing functionality.  In particular, 
it does not yet
+support prediction on new documents, and it does not have a Python API.  These 
will be added in the future.
+
+**Examples**
 
 In the following example, we load word count vectors representing a corpus of 
documents.
 We then use [LDA](api/scala/index.html#org.apache.spark.mllib.clustering.LDA)
@@ -419,14 +414,7 @@ public class JavaLDAExample {
 
 </div>
 
-
-In order to run the above application, follow the instructions
-provided in the [Self-Contained 
Applications](quick-start.html#self-contained-applications)
-section of the Spark
-Quick Start guide. Be sure to also include *spark-mllib* to your build file as
-a dependency.
-
-## Streaming clustering
+## Streaming k-means
 
 When data arrive in a stream, we may want to estimate clusters dynamically,
 updating them as new data arrive. MLlib provides support for streaming k-means 
clustering,
@@ -454,7 +442,7 @@ at time `t`, its contribution by time `t + halfLife` will 
have dropped to 0.5.
 The unit of time can be specified either as `batches` or `points` and the 
update rule
 will be adjusted accordingly.
 
-### Examples
+**Examples**
 
 This example shows how to estimate clusters on streaming data.
 

http://git-wip-us.apache.org/repos/asf/spark/blob/cc56c872/docs/mllib-guide.md
----------------------------------------------------------------------
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index 3d32d03..5091dbf 100644
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -24,9 +24,11 @@ filtering, dimensionality reduction, as well as underlying 
optimization primitiv
 * [Collaborative filtering](mllib-collaborative-filtering.html)
   * alternating least squares (ALS)
 * [Clustering](mllib-clustering.html)
-  * k-means
-  * Gaussian mixture
-  * power iteration
+  * [k-means](mllib-clustering.html#k-means)
+  * [Gaussian mixture](mllib-clustering.html#gaussian-mixture)
+  * [power iteration clustering 
(PIC)](mllib-clustering.html#power-iteration-clustering-pic)
+  * [latent Dirichlet allocation 
(LDA)](mllib-clustering.html#latent-dirichlet-allocation-lda)
+  * [streaming k-means](mllib-clustering.html#streaming-k-means)
 * [Dimensionality reduction](mllib-dimensionality-reduction.html)
   * singular value decomposition (SVD)
   * principal component analysis (PCA)


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-5806] re-organize sections in mllib-clustering.md

Reply via email to