[
https://issues.apache.org/jira/browse/SPARK-4510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14243900#comment-14243900
]
Fan Jiang edited comment on SPARK-4510 at 12/12/14 9:19 AM:
------------------------------------------------------------
[~mengxr]
1)
The PAM K-Medoids complexity is:
O( (1 + Beta) * K^2 * (N - K)^2)
where K = Number of Medoids
N = Number of datapoints
Beta = Mean # of successful swaps of DataPoints with existing Medoids
This is basically a factor of N greater than K-Means - due to recalculation of
the Costs as part of considering every swap. This characteristic of K-Medoids
has been widely discussed in the academia. And there have been some comparisons
(vs K-Means) and possible improvement suggestions made. Seems like it's a
tradeoff between performance and complexity when choosing from these 2
algorithms.
2)
Regarding the performance:
I added a dataset as an example (data/mllib/sample_pam_data.txt). This was
found on a paper comparing K-Means and K-Medoids. You can plot the datapoints
to see the clusters. If user runs
examples/mllib/DenseKMeans.scala
examples/mllib/Kmedoids.scala
on this dataset for K = 3, the correct clustering can be visualized by the
results with a smaller cost. K-Medoids has is more robust in terms of
correctness.
The papers we referred to are:
Article KMedoids complexity
http://www.researchgate.net/profile/Dr_Velmurugan_T/publication/47554407_Computational_Complexity_between_K-Means_and_K-Medoids_Clustering_Algorithms_for_Normal_and_Uniform_Distributions_of_Data_Points/links/0fcfd50dc5cd08aa29000000
Article on efficient K-Medoids (Springer)
http://link.springer.com/chapter/10.1007%2F3-540-46145-0_7#page-1
Clustering for Large Applications CLARA (used in one of Medoid papers)
http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/CLARA
was (Author: fjiang6):
[~mengxr]
1)
The PAM K-Medoids complexity is:
O( (1 + Beta) * K^2 * (N - K)^2)
where K = Number of Medoids
N = Number of datapoints
Beta = Mean # of successful swaps of DataPoints with existing Medoids
This is basically a factor of N greater than K-Means - due to recalculation of
the Costs as part of considering every swap. This characteristic of K-Medoids
has been widely discussed in the academia. And there have been some comparisons
(vs K-Means) and possible improvement suggestions made. Seems like it's a
tradeoff between performance and complexity
when choosing from these 2 algorithms.
2)
Regarding the performance:
I added a dataset as an example (data/mllib/sample_pam_data.txt). This was
found on a paper comparing K-Means and K-Medoids. You can plot the datapoints
to see the clusters. If user runs
examples/mllib/DenseKMeans.scala
examples/mllib/Kmedoids.scala
on this dataset for K = 3, the correct clustering can be visualized by the
results with a smaller cost. K-Medoids has is more robust in terms of
correctness.
The papers we referred to are:
Article KMedoids complexity
http://www.researchgate.net/profile/Dr_Velmurugan_T/publication/47554407_Computational_Complexity_between_K-Means_and_K-Medoids_Clustering_Algorithms_for_Normal_and_Uniform_Distributions_of_Data_Points/links/0fcfd50dc5cd08aa29000000
Article on efficient K-Medoids (Springer)
http://link.springer.com/chapter/10.1007%2F3-540-46145-0_7#page-1
Clustering for Large Applications CLARA (used in one of Medoid papers)
http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/CLARA
> Add k-medoids Partitioning Around Medoids (PAM) algorithm
> ---------------------------------------------------------
>
> Key: SPARK-4510
> URL: https://issues.apache.org/jira/browse/SPARK-4510
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
> Reporter: Fan Jiang
> Assignee: Fan Jiang
> Labels: features
> Original Estimate: 0h
> Remaining Estimate: 0h
>
> PAM (k-medoids) is more robust to noise and outliers as compared to k-means
> because it minimizes a sum of pairwise dissimilarities instead of a sum of
> squared Euclidean distances. A medoid can be defined as the object of a
> cluster, whose average dissimilarity to all the objects in the cluster is
> minimal i.e. it is a most centrally located point in the cluster.
> The most common realisation of k-medoid clustering is the Partitioning Around
> Medoids (PAM) algorithm and is as follows:
> Initialize: randomly select (without replacement) k of the n data points as
> the medoids
> Associate each data point to the closest medoid. ("closest" here is defined
> using any valid distance metric, most commonly Euclidean distance, Manhattan
> distance or Minkowski distance)
> For each medoid m
> For each non-medoid data point o
> Swap m and o and compute the total cost of the configuration
> Select the configuration with the lowest cost.
> Repeat steps 2 to 4 until there is no change in the medoid.
> The new feature for MLlib will contain 5 new files
> /main/scala/org/apache/spark/mllib/clustering/PAM.scala
> /main/scala/org/apache/spark/mllib/clustering/PAMModel.scala
> /main/scala/org/apache/spark/mllib/clustering/LocalPAM.scala
> /test/scala/org/apache/spark/mllib/clustering/PAMSuite.scala
> /main/scala/org/apache/spark/examples/mllib/KMedoids.scala
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]