Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/2906#discussion_r22641223
--- Diff: docs/mllib-clustering.md ---
@@ -154,6 +156,175 @@ section of the Spark
Quick Start guide. Be sure to also include *spark-mllib* to your build
file as
a dependency.
+
+### Hierarchical Clustering
+
+MLlib supports
+[hierarchical
clustering](http://en.wikipedia.org/wiki/Hierarchical_clustering), one of the
most commonly used clustering algorithm which seeks to build a hierarchy of
clusters.
+Strategies for hierarchical clustering generally fall into two types.
+One is the agglomerative clustering which is a "bottom up" approach: each
observation starts in its own cluster, and pairs of clusters are merged as one
moves up the hierarchy.
+The other is the divisive clustering which is a "top down" approach: all
observations start in one cluster, and splits are performed recursively as one
moves down the hierarchy.
+The MLlib implementation only includes a divisive hierarchical clustering
algorithm.
+
+The implementation in MLlib has the following parameters:
+
+* *k* is the number of maximum desired clusters.
+* *subIterations* is the maximum number of iterations to split a cluster
to its 2 sub clusters.
+* *numRetries* is the maximum number of retries if a splitting doesn't
work as expected.
+* *epsilon* determines the saturate threshold to consider the splitting to
have converged.
+
+
+
+### Hierarchical Clustering Example
+
+<div class="codetabs">
+
+<div data-lang="scala" markdown="1">
+The following code snippets can be executed in `spark-shell`.
+
+In the following example after loading and parsing data,
+we use the hierarchical clustering object to cluster the sample data into
three clusters.
+The number of desired clusters is passed to the algorithm.
+Hoerver, even though the number of clusters is less than *k* in the middle
of the clustering,
+the clustering is stopped if they can not be split any more.
+
+{% highlight scala %}
+import org.apache.spark.mllib.clustering.HierarchicalClustering
+import org.apache.spark.mllib.linalg.Vectors
+
+// Load and parse the data
+val data = sc.textFile("data/mllib/sample_hierarchical_data.csv")
+val parsedData = data.map(s =>
Vectors.dense(s.split(',').map(_.toDouble))).cache()
+
+// Cluster the data into three classes using HierarchicalClustering object
+val numClusters = 10
+val model = HierarchicalClustering.train(parsedData, numClusters)
+println(s"# Clusters: ${model.getClusters().size}")
+
+// Show the cluster centers
+model.getCenters.foreach(println)
+
+// Evaluate clustering by computing the sum of variance of the clusters
+val variance = model.getClusters.map(_.getVariance.get).sum
+println(s"Sum of Variance of the Clusters = ${variance}")
+
+// Cut the cluster tree by height
+val cut_model = model.cut(4.0)
+println(s"# Clusters: ${cut_model.getClusters().size}")
+val variance = cut_model.getClusters.map(_.getVariance.get).sum
+println(s"Sum of Variance of the Clusters = ${variance}")
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+All of MLlib's methods use Java-friendly types, so you can import and call
them there the same
+way you do in Scala. The only caveat is that the methods take Scala RDD
objects, while the
+Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD
to a Scala one by
+calling `.rdd()` on your `JavaRDD` object. A self-contained application
example
+that is equivalent to the provided example in Scala is given below:
+
+{% highlight java %}
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.function.Function;
+import org.apache.spark.mllib.clustering.HierarchicalClustering;
+import org.apache.spark.mllib.clustering.HierarchicalClusteringModel;
+import org.apache.spark.mllib.linalg.Vector;
+import org.apache.spark.mllib.linalg.Vectors;
+
+public class JavaHierarchicalClustering {
--- End diff --
The other example code I see foregoes a lot of the boilerplate here of
declaring a class, main method, System.out, etc. The indentation here is also
significantly deeper than the 2-space indent in the code. Addressing these
might make it easier to scan as an example on the web page.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]