[ https://issues.apache.org/jira/browse/SPARK-4510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14243900#comment-14243900 ]
Fan Jiang commented on SPARK-4510: ---------------------------------- [~mengxr] 1) The PAM K-Medoids complexity is: O( (1 + Beta) * K^2 * (N - K)^2) where K = Number of Medoids N = Number of datapoints Beta = Mean # of successful swaps of DataPoints with existing Medoids This is basically a factor of N greater than K-Means - due to recalculation of the Costs as part of considering every swap. This characteristic of K-Medoids has been widely discussed in the academia. And there have been some comparisons (vs K-Means) and possible improvement suggestions made. Seems like it's a tradeoff between performance and complexity when choosing from these 2 algorithms. 2) Regarding the performance: I added a dataset as an example (data/mllib/sample_pam_data.txt). This was found on a paper comparing K-Means and K-Medoids. You can plot the datapoints to see the clusters. If user runs examples/mllib/DenseKMeans.scala examples/mllib/Kmedoids.scala on this dataset for K = 3, the correct clustering can be visualized by the results with a smaller cost. K-Medoids has is more robust in terms of correctness. The papers we referred to are: Article KMedoids complexity http://www.researchgate.net/profile/Dr_Velmurugan_T/publication/47554407_Computational_Complexity_between_K-Means_and_K-Medoids_Clustering_Algorithms_for_Normal_and_Uniform_Distributions_of_Data_Points/links/0fcfd50dc5cd08aa29000000 Article on efficient K-Medoids (Springer) http://link.springer.com/chapter/10.1007%2F3-540-46145-0_7#page-1 Clustering for Large Applications CLARA (used in one of Medoid papers) http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/CLARA > Add k-medoids Partitioning Around Medoids (PAM) algorithm > --------------------------------------------------------- > > Key: SPARK-4510 > URL: https://issues.apache.org/jira/browse/SPARK-4510 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: Fan Jiang > Assignee: Fan Jiang > Labels: features > Original Estimate: 0h > Remaining Estimate: 0h > > PAM (k-medoids) is more robust to noise and outliers as compared to k-means > because it minimizes a sum of pairwise dissimilarities instead of a sum of > squared Euclidean distances. A medoid can be defined as the object of a > cluster, whose average dissimilarity to all the objects in the cluster is > minimal i.e. it is a most centrally located point in the cluster. > The most common realisation of k-medoid clustering is the Partitioning Around > Medoids (PAM) algorithm and is as follows: > Initialize: randomly select (without replacement) k of the n data points as > the medoids > Associate each data point to the closest medoid. ("closest" here is defined > using any valid distance metric, most commonly Euclidean distance, Manhattan > distance or Minkowski distance) > For each medoid m > For each non-medoid data point o > Swap m and o and compute the total cost of the configuration > Select the configuration with the lowest cost. > Repeat steps 2 to 4 until there is no change in the medoid. > The new feature for MLlib will contain 5 new files > /main/scala/org/apache/spark/mllib/clustering/PAM.scala > /main/scala/org/apache/spark/mllib/clustering/PAMModel.scala > /main/scala/org/apache/spark/mllib/clustering/LocalPAM.scala > /test/scala/org/apache/spark/mllib/clustering/PAMSuite.scala > /main/scala/org/apache/spark/examples/mllib/KMedoids.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org