[ 
https://issues.apache.org/jira/browse/SPARK-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14293526#comment-14293526
 ] 

Muhammad-Ali A'rabi commented on SPARK-3439:
--------------------------------------------

Input is a collection of vectors (RDD), T1 and T2. Output is a map from vectors 
to clusters (that each have a center). It could be just a set of centers, or 
both.
Vectors are processed sequentially. For each of them, distances to other 
vectors are calculated (parallel), and then some other vectors are mapped to 
current vector as a cluster center (parallel). Very simple.
The code is attached, because it is somehow more clear than the psudocode. :D

> Add Canopy Clustering Algorithm
> -------------------------------
>
>                 Key: SPARK-3439
>                 URL: https://issues.apache.org/jira/browse/SPARK-3439
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Yu Ishikawa
>            Assignee: Muhammad-Ali A'rabi
>            Priority: Minor
>
> The canopy clustering algorithm is an unsupervised pre-clustering algorithm. 
> It is often used as a preprocessing step for the K-means algorithm or the 
> Hierarchical clustering algorithm. It is intended to speed up clustering 
> operations on large data sets, where using another algorithm directly may be 
> impractical due to the size of the data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to