[jira] [Comment Edited] (SPARK-3439) Add Canopy Clustering Algorithm

Muhammad-Ali A'rabi (JIRA) Sat, 24 Jan 2015 06:43:39 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14290634#comment-14290634
 ]


Muhammad-Ali A'rabi edited comment on SPARK-3439 at 1/24/15 2:41 PM:
---------------------------------------------------------------------

Possible implementation:

{code:java}
        import org.apache.spark.mllib.linalg._
        import java.util.HashMap
        
        val vas = Array(Array(1.0, 0, 0), Array(1.1, 0, 0), Array(0.9, 0, 0), 
                Array(0, 1.0, 0), Array(0.01, 1.01, 0), Array(0, 0, 1.0), 
Array(0, 0, 1.1))
        val vs = vas.map(Vectors.dense(_))
        
        val t1 = 1.0
        val t2 = 0.5
        
        // starting canopy
        val map = new HashMap[Vector, Vector] // map from data to clusters
        val set = new HashMap[Vector, Boolean] // the set
        for(v <- vs) set.put(v, true)
        for(v <- vs) {
                if(set.get(v)) {
                        val dists = vs.map{ x => (x, Vectors.sqdist(x, v)) }
                        dists.foreach { case (x, d) =>
                                if(d < t1) map.put(x, v)
                                if(d < t2) set.put(x, false)
                        }
                }
        }
{code}

The algorithm is working with arrays and lists, but all of them could be 
converted to RDD.


was (Author: angellandros):
Possible implementation:

{code:scala}
        import org.apache.spark.mllib.linalg._
        import java.util.HashMap
        
        val vas = Array(Array(1.0, 0, 0), Array(1.1, 0, 0), Array(0.9, 0, 0), 
                Array(0, 1.0, 0), Array(0.01, 1.01, 0), Array(0, 0, 1.0), 
Array(0, 0, 1.1))
        val vs = vas.map(Vectors.dense(_))
        
        val t1 = 1.0
        val t2 = 0.5
        
        // starting canopy
        val map = new HashMap[Vector, Vector] // map from data to clusters
        val set = new HashMap[Vector, Boolean] // the set
        for(v <- vs) set.put(v, true)
        for(v <- vs) {
                if(set.get(v)) {
                        val dists = vs.map{ x => (x, Vectors.sqdist(x, v)) }
                        dists.foreach { case (x, d) =>
                                if(d < t1) map.put(x, v)
                                if(d < t2) set.put(x, false)
                        }
                }
        }
{code}

The algorithm is working with arrays and lists, but all of them could be 
converted to RDD.

> Add Canopy Clustering Algorithm
> -------------------------------
>
>                 Key: SPARK-3439
>                 URL: https://issues.apache.org/jira/browse/SPARK-3439
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Yu Ishikawa
>            Assignee: Muhammad-Ali A'rabi
>            Priority: Minor
>
> The canopy clustering algorithm is an unsupervised pre-clustering algorithm. 
> It is often used as a preprocessing step for the K-means algorithm or the 
> Hierarchical clustering algorithm. It is intended to speed up clustering 
> operations on large data sets, where using another algorithm directly may be 
> impractical due to the size of the data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-3439) Add Canopy Clustering Algorithm

Reply via email to