[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

srowen Thu, 08 Jan 2015 01:09:07 -0800

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22641223
  
    --- Diff: docs/mllib-clustering.md ---
    @@ -154,6 +156,175 @@ section of the Spark
     Quick Start guide. Be sure to also include *spark-mllib* to your build 
file as
     a dependency.
     
    +
    +### Hierarchical Clustering
    +
    +MLlib supports
    +[hierarchical 
clustering](http://en.wikipedia.org/wiki/Hierarchical_clustering), one of the 
most commonly used clustering algorithm which seeks to build a hierarchy of 
clusters.
    +Strategies for hierarchical clustering generally fall into two types.
    +One is the agglomerative clustering which is a "bottom up" approach: each 
observation starts in its own cluster, and pairs of clusters are merged as one 
moves up the hierarchy.
    +The other is the divisive clustering which is a "top down" approach: all 
observations start in one cluster, and splits are performed recursively as one 
moves down the hierarchy.
    +The MLlib implementation only includes a divisive hierarchical clustering 
algorithm.
    +
    +The implementation in MLlib has the following parameters:
    +
    +* *k* is the number of maximum desired clusters. 
    +* *subIterations* is the maximum number of iterations to split a cluster 
to its 2 sub clusters.
    +* *numRetries* is the maximum number of retries if a splitting doesn't 
work as expected.
    +* *epsilon* determines the saturate threshold to consider the splitting to 
have converged.
    +
    +
    +
    +### Hierarchical Clustering Example
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +The following code snippets can be executed in `spark-shell`.
    +
    +In the following example after loading and parsing data, 
    +we use the hierarchical clustering object to cluster the sample data into 
three clusters. 
    +The number of desired clusters is passed to the algorithm. 
    +Hoerver, even though the number of clusters is less than *k* in the middle 
of the clustering,
    +the clustering is stopped if they can not be split any more.
    +
    +{% highlight scala %}
    +import org.apache.spark.mllib.clustering.HierarchicalClustering
    +import org.apache.spark.mllib.linalg.Vectors
    +
    +// Load and parse the data
    +val data = sc.textFile("data/mllib/sample_hierarchical_data.csv")
    +val parsedData = data.map(s => 
Vectors.dense(s.split(',').map(_.toDouble))).cache()
    +
    +// Cluster the data into three classes using HierarchicalClustering object
    +val numClusters = 10
    +val model = HierarchicalClustering.train(parsedData, numClusters)
    +println(s"# Clusters: ${model.getClusters().size}")
    +
    +// Show the cluster centers
    +model.getCenters.foreach(println)
    +
    +// Evaluate clustering by computing the sum of variance of the clusters
    +val variance = model.getClusters.map(_.getVariance.get).sum
    +println(s"Sum of Variance of the Clusters = ${variance}")
    +
    +// Cut the cluster tree by height
    +val cut_model = model.cut(4.0)
    +println(s"# Clusters: ${cut_model.getClusters().size}")
    +val variance = cut_model.getClusters.map(_.getVariance.get).sum
    +println(s"Sum of Variance of the Clusters = ${variance}")
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +All of MLlib's methods use Java-friendly types, so you can import and call 
them there the same
    +way you do in Scala. The only caveat is that the methods take Scala RDD 
objects, while the
    +Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD 
to a Scala one by
    +calling `.rdd()` on your `JavaRDD` object. A self-contained application 
example
    +that is equivalent to the provided example in Scala is given below:
    +
    +{% highlight java %}
    +import org.apache.spark.SparkConf;
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.api.java.function.Function;
    +import org.apache.spark.mllib.clustering.HierarchicalClustering;
    +import org.apache.spark.mllib.clustering.HierarchicalClusteringModel;
    +import org.apache.spark.mllib.linalg.Vector;
    +import org.apache.spark.mllib.linalg.Vectors;
    +
    +public class JavaHierarchicalClustering {
    --- End diff --
    
    The other example code I see foregoes a lot of the boilerplate here of 
declaring a class, main method, System.out, etc. The indentation here is also 
significantly deeper than the 2-space indent in the code. Addressing these 
might make it easier to scan as an example on the web page.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Reply via email to