spark git commit: [SPARK-6518][MLLIB][EXAMPLE][DOC] Add example code and user guide for bisecting k-means

2015-12-16 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/branch-1.6 e5b85713d -> e1adf6d7d


[SPARK-6518][MLLIB][EXAMPLE][DOC] Add example code and user guide for bisecting 
k-means

This PR includes only an example code in order to finish it quickly.
I'll send another PR for the docs soon.

Author: Yu ISHIKAWA 

Closes #9952 from yu-iskw/SPARK-6518.

(cherry picked from commit 7b6dc29d0ebbfb3bb941130f8542120b6bc3e234)
Signed-off-by: Joseph K. Bradley 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e1adf6d7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e1adf6d7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e1adf6d7

Branch: refs/heads/branch-1.6
Commit: e1adf6d7d1c755fb16a0030e66ce9cff348c3de8
Parents: e5b8571
Author: Yu ISHIKAWA 
Authored: Wed Dec 16 10:55:42 2015 -0800
Committer: Joseph K. Bradley 
Committed: Wed Dec 16 10:55:54 2015 -0800

--
 docs/mllib-clustering.md| 35 ++
 docs/mllib-guide.md |  1 +
 .../mllib/JavaBisectingKMeansExample.java   | 69 
 .../examples/mllib/BisectingKMeansExample.scala | 60 +
 4 files changed, 165 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e1adf6d7/docs/mllib-clustering.md
--
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
index 48d64cd..93cd0c1 100644
--- a/docs/mllib-clustering.md
+++ b/docs/mllib-clustering.md
@@ -718,6 +718,41 @@ sameModel = LDAModel.load(sc, "myModelPath")
 
 
 
+## Bisecting k-means
+
+Bisecting K-means can often be much faster than regular K-means, but it will 
generally produce a different clustering.
+
+Bisecting k-means is a kind of [hierarchical 
clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering).
+Hierarchical clustering is one of the most commonly used  method of cluster 
analysis which seeks to build a hierarchy of clusters.
+Strategies for hierarchical clustering generally fall into two types:
+
+- Agglomerative: This is a "bottom up" approach: each observation starts in 
its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
+- Divisive: This is a "top down" approach: all observations start in one 
cluster, and splits are performed recursively as one moves down the hierarchy.
+
+Bisecting k-means algorithm is a kind of divisive algorithms.
+The implementation in MLlib has the following parameters:
+
+* *k*: the desired number of leaf clusters (default: 4). The actual number 
could be smaller if there are no divisible leaf clusters.
+* *maxIterations*: the max number of k-means iterations to split clusters 
(default: 20)
+* *minDivisibleClusterSize*: the minimum number of points (if >= 1.0) or the 
minimum proportion of points (if < 1.0) of a divisible cluster (default: 1)
+* *seed*: a random seed (default: hash value of the class name)
+
+**Examples**
+
+
+
+Refer to the [`BisectingKMeans` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeans) 
and [`BisectingKMeansModel` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeansModel)
 for details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/mllib/BisectingKMeansExample.scala %}
+
+
+
+Refer to the [`BisectingKMeans` Java 
docs](api/java/org/apache/spark/mllib/clustering/BisectingKMeans.html) and 
[`BisectingKMeansModel` Java 
docs](api/java/org/apache/spark/mllib/clustering/BisectingKMeansModel.html) for 
details on the API.
+
+{% include_example 
java/org/apache/spark/examples/mllib/JavaBisectingKMeansExample.java %}
+
+
+
 ## Streaming k-means
 
 When data arrive in a stream, we may want to estimate clusters dynamically,

http://git-wip-us.apache.org/repos/asf/spark/blob/e1adf6d7/docs/mllib-guide.md
--
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index 7fef6b5..680ed48 100644
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -49,6 +49,7 @@ We list major functionality from both below, with links to 
detailed guides.
   * [Gaussian mixture](mllib-clustering.html#gaussian-mixture)
   * [power iteration clustering 
(PIC)](mllib-clustering.html#power-iteration-clustering-pic)
   * [latent Dirichlet allocation 
(LDA)](mllib-clustering.html#latent-dirichlet-allocation-lda)
+  * [bisecting k-means](mllib-clustering.html#bisecting-kmeans)
   * [streaming k-means](mllib-clustering.html#streaming-k-means)
 * [Dimensionality reduction](mllib-dimensionality-reduction.html)
   * [singular value decomposition 
(SVD)](mllib-dimensionality-reduction.html#singular-value-decomposition-svd)

http://git-wip-us.apache.org/repos/asf/spark/blob/e1adf6d7/examples/src/main/java/org/apache

spark git commit: [SPARK-6518][MLLIB][EXAMPLE][DOC] Add example code and user guide for bisecting k-means

2015-12-16 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/master ad8c1f0b8 -> 7b6dc29d0


[SPARK-6518][MLLIB][EXAMPLE][DOC] Add example code and user guide for bisecting 
k-means

This PR includes only an example code in order to finish it quickly.
I'll send another PR for the docs soon.

Author: Yu ISHIKAWA 

Closes #9952 from yu-iskw/SPARK-6518.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7b6dc29d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7b6dc29d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7b6dc29d

Branch: refs/heads/master
Commit: 7b6dc29d0ebbfb3bb941130f8542120b6bc3e234
Parents: ad8c1f0
Author: Yu ISHIKAWA 
Authored: Wed Dec 16 10:55:42 2015 -0800
Committer: Joseph K. Bradley 
Committed: Wed Dec 16 10:55:42 2015 -0800

--
 docs/mllib-clustering.md| 35 ++
 docs/mllib-guide.md |  1 +
 .../mllib/JavaBisectingKMeansExample.java   | 69 
 .../examples/mllib/BisectingKMeansExample.scala | 60 +
 4 files changed, 165 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7b6dc29d/docs/mllib-clustering.md
--
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
index 48d64cd..93cd0c1 100644
--- a/docs/mllib-clustering.md
+++ b/docs/mllib-clustering.md
@@ -718,6 +718,41 @@ sameModel = LDAModel.load(sc, "myModelPath")
 
 
 
+## Bisecting k-means
+
+Bisecting K-means can often be much faster than regular K-means, but it will 
generally produce a different clustering.
+
+Bisecting k-means is a kind of [hierarchical 
clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering).
+Hierarchical clustering is one of the most commonly used  method of cluster 
analysis which seeks to build a hierarchy of clusters.
+Strategies for hierarchical clustering generally fall into two types:
+
+- Agglomerative: This is a "bottom up" approach: each observation starts in 
its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
+- Divisive: This is a "top down" approach: all observations start in one 
cluster, and splits are performed recursively as one moves down the hierarchy.
+
+Bisecting k-means algorithm is a kind of divisive algorithms.
+The implementation in MLlib has the following parameters:
+
+* *k*: the desired number of leaf clusters (default: 4). The actual number 
could be smaller if there are no divisible leaf clusters.
+* *maxIterations*: the max number of k-means iterations to split clusters 
(default: 20)
+* *minDivisibleClusterSize*: the minimum number of points (if >= 1.0) or the 
minimum proportion of points (if < 1.0) of a divisible cluster (default: 1)
+* *seed*: a random seed (default: hash value of the class name)
+
+**Examples**
+
+
+
+Refer to the [`BisectingKMeans` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeans) 
and [`BisectingKMeansModel` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeansModel)
 for details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/mllib/BisectingKMeansExample.scala %}
+
+
+
+Refer to the [`BisectingKMeans` Java 
docs](api/java/org/apache/spark/mllib/clustering/BisectingKMeans.html) and 
[`BisectingKMeansModel` Java 
docs](api/java/org/apache/spark/mllib/clustering/BisectingKMeansModel.html) for 
details on the API.
+
+{% include_example 
java/org/apache/spark/examples/mllib/JavaBisectingKMeansExample.java %}
+
+
+
 ## Streaming k-means
 
 When data arrive in a stream, we may want to estimate clusters dynamically,

http://git-wip-us.apache.org/repos/asf/spark/blob/7b6dc29d/docs/mllib-guide.md
--
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index 7fef6b5..680ed48 100644
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -49,6 +49,7 @@ We list major functionality from both below, with links to 
detailed guides.
   * [Gaussian mixture](mllib-clustering.html#gaussian-mixture)
   * [power iteration clustering 
(PIC)](mllib-clustering.html#power-iteration-clustering-pic)
   * [latent Dirichlet allocation 
(LDA)](mllib-clustering.html#latent-dirichlet-allocation-lda)
+  * [bisecting k-means](mllib-clustering.html#bisecting-kmeans)
   * [streaming k-means](mllib-clustering.html#streaming-k-means)
 * [Dimensionality reduction](mllib-dimensionality-reduction.html)
   * [singular value decomposition 
(SVD)](mllib-dimensionality-reduction.html#singular-value-decomposition-svd)

http://git-wip-us.apache.org/repos/asf/spark/blob/7b6dc29d/examples/src/main/java/org/apache/spark/examples/mllib/JavaBisectingKMeansExample.java
--