[ 
https://issues.apache.org/jira/browse/SPARK-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14313580#comment-14313580
 ] 

Muhammad-Ali A'rabi commented on SPARK-3439:
--------------------------------------------

I have implemented the canopy algorithm on pure scala, and it is very slow. The 
part that it calculates distances between every two nodes is very expensive. 
The idea to solve this problem is this: First, implement a regular canopy, as 
below. Then in another method, make a partition of the data into parts of size 
sqrt(n), where n is the size of data, and run the regular canopy on each one of 
them, and another regular canopy on the centers, and merge the two result like 
a function composition.

{code:java}
        def run[K](vs: Iterable[KeyedVector[K]], f: (KeyedVector[K], 
KeyedVector[K]) => Double): HashMap[KeyedVector[K], KeyedVector[K]] = {
                val map = new HashMap[KeyedVector[K], KeyedVector[K]] // map 
from data to clusters
                val set: Set[KeyedVector[K]] = Set() // the set
                var count = 0
                for(v <- vs) set += v
                for(v <- vs) {
                        count += 1
                        print(count + " ")
                        if(set.contains(v)) {
                                vs
                                        .map{ x => (x, f(x, v)) }
                                        .foreach { case (x, d) =>
                                                if(d < t1) map.put(x, v)
                                                if(d < t2) set -= x
                                        }
                        }
                }
                
                println()
                map
        }

        def mergedRun[K](vs: List[KeyedVector[K]], f: (KeyedVector[K], 
KeyedVector[K]) => Double) = {
                val partSize: Int = Math.sqrt(vs.length).ceil.toInt
                
                def runPartially(vs: List[KeyedVector[K]]): 
HashMap[KeyedVector[K], KeyedVector[K]] = {
                        if(vs.drop(partSize).isEmpty) run(vs.take(partSize), f)
                        else run(vs.take(partSize), f) ++ 
runPartially(vs.drop(partSize))
                }
                
                val map = runPartially(vs)
                val centers = map.map(_._2).toSet
                
                println("Merge run")
                val merge = run(centers, f)
                
                println("Merging")
                map mapValues merge
        }
{code}

> Add Canopy Clustering Algorithm
> -------------------------------
>
>                 Key: SPARK-3439
>                 URL: https://issues.apache.org/jira/browse/SPARK-3439
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Yu Ishikawa
>            Assignee: Muhammad-Ali A'rabi
>            Priority: Minor
>
> The canopy clustering algorithm is an unsupervised pre-clustering algorithm. 
> It is often used as a preprocessing step for the K-means algorithm or the 
> Hierarchical clustering algorithm. It is intended to speed up clustering 
> operations on large data sets, where using another algorithm directly may be 
> impractical due to the size of the data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to