[ 
https://issues.apache.org/jira/browse/SPARK-4510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14243900#comment-14243900
 ] 

Fan Jiang edited comment on SPARK-4510 at 12/12/14 9:20 AM:
------------------------------------------------------------

[~mengxr]

1)
The PAM K-Medoids complexity is:

O( (1 + Beta) * K^2 * (N - K)^2)

where K = Number of Medoids
        N = Number of datapoints
        Beta = Mean # of successful swaps of DataPoints with existing Medoids

This is basically a factor of N greater than K-Means - due to recalculation of 
the Costs as part of considering every swap. This characteristic of K-Medoids 
has been widely discussed in the academia. And there have been some comparisons 
(vs K-Means) and possible improvement suggestions made. Seems like it's a 
tradeoff between performance and complexity when choosing from these 2 
algorithms.

2)
Regarding the performance: 
I added a dataset as an example (data/mllib/sample_pam_data.txt). This was 
found on a paper comparing K-Means and K-Medoids. You can plot the datapoints 
to see the clusters. If user runs 
examples/mllib/DenseKMeans.scala
examples/mllib/Kmedoids.scala

on this dataset for K = 3, the correct clustering can be visualized by the 
results with a smaller cost. K-Medoids is more robust in terms of correctness.


The papers we referred to are:

Article KMedoids complexity  
http://www.researchgate.net/profile/Dr_Velmurugan_T/publication/47554407_Computational_Complexity_between_K-Means_and_K-Medoids_Clustering_Algorithms_for_Normal_and_Uniform_Distributions_of_Data_Points/links/0fcfd50dc5cd08aa29000000

Article on efficient K-Medoids (Springer)  
http://link.springer.com/chapter/10.1007%2F3-540-46145-0_7#page-1
Clustering for Large Applications CLARA (used in one of Medoid papers) 
http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/CLARA



was (Author: fjiang6):
[~mengxr]

1)
The PAM K-Medoids complexity is:

O( (1 + Beta) * K^2 * (N - K)^2)

where K = Number of Medoids
        N = Number of datapoints
        Beta = Mean # of successful swaps of DataPoints with existing Medoids

This is basically a factor of N greater than K-Means - due to recalculation of 
the Costs as part of considering every swap. This characteristic of K-Medoids 
has been widely discussed in the academia. And there have been some comparisons 
(vs K-Means) and possible improvement suggestions made. Seems like it's a 
tradeoff between performance and complexity when choosing from these 2 
algorithms.

2)
Regarding the performance: 
I added a dataset as an example (data/mllib/sample_pam_data.txt). This was 
found on a paper comparing K-Means and K-Medoids. You can plot the datapoints 
to see the clusters. If user runs 
examples/mllib/DenseKMeans.scala
examples/mllib/Kmedoids.scala

on this dataset for K = 3, the correct clustering can be visualized by the 
results with a smaller cost. K-Medoids has is more robust in terms of 
correctness.


The papers we referred to are:

Article KMedoids complexity  
http://www.researchgate.net/profile/Dr_Velmurugan_T/publication/47554407_Computational_Complexity_between_K-Means_and_K-Medoids_Clustering_Algorithms_for_Normal_and_Uniform_Distributions_of_Data_Points/links/0fcfd50dc5cd08aa29000000

Article on efficient K-Medoids (Springer)  
http://link.springer.com/chapter/10.1007%2F3-540-46145-0_7#page-1
Clustering for Large Applications CLARA (used in one of Medoid papers) 
http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/CLARA


> Add k-medoids Partitioning Around Medoids (PAM) algorithm
> ---------------------------------------------------------
>
>                 Key: SPARK-4510
>                 URL: https://issues.apache.org/jira/browse/SPARK-4510
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Fan Jiang
>            Assignee: Fan Jiang
>              Labels: features
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> PAM (k-medoids) is more robust to noise and outliers as compared to k-means 
> because it minimizes a sum of pairwise dissimilarities instead of a sum of 
> squared Euclidean distances. A medoid can be defined as the object of a 
> cluster, whose average dissimilarity to all the objects in the cluster is 
> minimal i.e. it is a most centrally located point in the cluster.
> The most common realisation of k-medoid clustering is the Partitioning Around 
> Medoids (PAM) algorithm and is as follows:
> Initialize: randomly select (without replacement) k of the n data points as 
> the medoids
> Associate each data point to the closest medoid. ("closest" here is defined 
> using any valid distance metric, most commonly Euclidean distance, Manhattan 
> distance or Minkowski distance)
> For each medoid m
> For each non-medoid data point o
> Swap m and o and compute the total cost of the configuration
> Select the configuration with the lowest cost.
> Repeat steps 2 to 4 until there is no change in the medoid.
> The new feature for MLlib will contain 5 new files
> /main/scala/org/apache/spark/mllib/clustering/PAM.scala
> /main/scala/org/apache/spark/mllib/clustering/PAMModel.scala
> /main/scala/org/apache/spark/mllib/clustering/LocalPAM.scala
> /test/scala/org/apache/spark/mllib/clustering/PAMSuite.scala
> /main/scala/org/apache/spark/examples/mllib/KMedoids.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to