Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/2906#discussion_r19288355 --- Diff: docs/mllib-clustering.md --- @@ -153,3 +157,152 @@ provided in the [Self-Contained Applications](quick-start.html#self-contained-ap section of the Spark Quick Start guide. Be sure to also include *spark-mllib* to your build file as a dependency. + + +### Hierarchical Clustering + +MLlib supports +[hierarchical clustering](http://en.wikipedia.org/wiki/Hierarchical_clustering), one of the most commonly used clustering algorithm which seeks to build a hierarchy of clusters. +Strategies for hierarchical clustering generally fall into two types. +One is the agglomerative clustering which is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. +The other is the divisive clustering which is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. +The MLlib implementation only includes a divisive hierarchical clustering algorithm. + +The implementation in MLlib has the following parameters: + +* *k* is the number of maximum desired clusters. +* *subIterations* is the maximum number of iterations to split a cluster to its 2 sub clusters. +* *numRetries* is the maximum number of retries if a splitting doesn't work as expected. +* *epsilon* determines the saturate threshold to consider the splitting to have converged. + + + +### Hierarchical Clustering Example + +<div class="codetabs"> + +<div data-lang="scala" markdown="1"> +The following code snippets can be executed in `spark-shell`. + +In the following example after loading and parsing data, +we use the hierarchical clustering object to cluster the sample data into three clusters. +The number of desired clusters is passed to the algorithm. +Hoerver, even though the number of clusters is less than *k* in the middle of the clustering, --- End diff -- Horever -> However, and 'not be splitted' -> 'not be split'
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org