[
https://issues.apache.org/jira/browse/SPARK-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14313580#comment-14313580
]
Muhammad-Ali A'rabi commented on SPARK-3439:
--------------------------------------------
I have implemented the canopy algorithm on pure scala, and it is very slow. The
part that it calculates distances between every two nodes is very expensive.
The idea to solve this problem is this: First, implement a regular canopy, as
below. Then in another method, make a partition of the data into parts of size
sqrt(n), where n is the size of data, and run the regular canopy on each one of
them, and another regular canopy on the centers, and merge the two result like
a function composition.
{code:java}
def run[K](vs: Iterable[KeyedVector[K]], f: (KeyedVector[K],
KeyedVector[K]) => Double): HashMap[KeyedVector[K], KeyedVector[K]] = {
val map = new HashMap[KeyedVector[K], KeyedVector[K]] // map
from data to clusters
val set: Set[KeyedVector[K]] = Set() // the set
var count = 0
for(v <- vs) set += v
for(v <- vs) {
count += 1
print(count + " ")
if(set.contains(v)) {
vs
.map{ x => (x, f(x, v)) }
.foreach { case (x, d) =>
if(d < t1) map.put(x, v)
if(d < t2) set -= x
}
}
}
println()
map
}
def mergedRun[K](vs: List[KeyedVector[K]], f: (KeyedVector[K],
KeyedVector[K]) => Double) = {
val partSize: Int = Math.sqrt(vs.length).ceil.toInt
def runPartially(vs: List[KeyedVector[K]]):
HashMap[KeyedVector[K], KeyedVector[K]] = {
if(vs.drop(partSize).isEmpty) run(vs.take(partSize), f)
else run(vs.take(partSize), f) ++
runPartially(vs.drop(partSize))
}
val map = runPartially(vs)
val centers = map.map(_._2).toSet
println("Merge run")
val merge = run(centers, f)
println("Merging")
map mapValues merge
}
{code}
> Add Canopy Clustering Algorithm
> -------------------------------
>
> Key: SPARK-3439
> URL: https://issues.apache.org/jira/browse/SPARK-3439
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
> Reporter: Yu Ishikawa
> Assignee: Muhammad-Ali A'rabi
> Priority: Minor
>
> The canopy clustering algorithm is an unsupervised pre-clustering algorithm.
> It is often used as a preprocessing step for the K-means algorithm or the
> Hierarchical clustering algorithm. It is intended to speed up clustering
> operations on large data sets, where using another algorithm directly may be
> impractical due to the size of the data set.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]