[
https://issues.apache.org/jira/browse/SPARK-4510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14247389#comment-14247389
]
Fan Jiang commented on SPARK-4510:
----------------------------------
We propose include a user configurable parameter that adjusts the ratio of
Swaps attempted in the innermost loop.
Call "Alpha" the ratio of #Data Points Recalculated / N. Under the standard
K-Medoids Alpha would be 1.0 (default value as it is now) We propose to allow
the Alpha to be user configurable between 0.0 (no recalculations ) to 1.0
(full recalculations).
The effect of this sampling:
a. Consider if the K-medoids were run against 10 million points. If we set the
sampling Alpha to 0.01 then the inner loop would be reduced to 100000 points
and naturally the complexity is reduced by two orders of magnitude.
b. The tradeoff is that the final answer would not in general be the absolute
optimal one.
> Add k-medoids Partitioning Around Medoids (PAM) algorithm
> ---------------------------------------------------------
>
> Key: SPARK-4510
> URL: https://issues.apache.org/jira/browse/SPARK-4510
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
> Reporter: Fan Jiang
> Assignee: Fan Jiang
> Labels: features
> Original Estimate: 0h
> Remaining Estimate: 0h
>
> PAM (k-medoids) is more robust to noise and outliers as compared to k-means
> because it minimizes a sum of pairwise dissimilarities instead of a sum of
> squared Euclidean distances. A medoid can be defined as the object of a
> cluster, whose average dissimilarity to all the objects in the cluster is
> minimal i.e. it is a most centrally located point in the cluster.
> The most common realisation of k-medoid clustering is the Partitioning Around
> Medoids (PAM) algorithm and is as follows:
> Initialize: randomly select (without replacement) k of the n data points as
> the medoids
> Associate each data point to the closest medoid. ("closest" here is defined
> using any valid distance metric, most commonly Euclidean distance, Manhattan
> distance or Minkowski distance)
> For each medoid m
> For each non-medoid data point o
> Swap m and o and compute the total cost of the configuration
> Select the configuration with the lowest cost.
> Repeat steps 2 to 4 until there is no change in the medoid.
> The new feature for MLlib will contain 5 new files
> /main/scala/org/apache/spark/mllib/clustering/PAM.scala
> /main/scala/org/apache/spark/mllib/clustering/PAMModel.scala
> /main/scala/org/apache/spark/mllib/clustering/LocalPAM.scala
> /test/scala/org/apache/spark/mllib/clustering/PAMSuite.scala
> /main/scala/org/apache/spark/examples/mllib/KMedoids.scala
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]