[
https://issues.apache.org/jira/browse/SPARK-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14290634#comment-14290634
]
Muhammad-Ali A'rabi edited comment on SPARK-3439 at 1/24/15 2:41 PM:
---------------------------------------------------------------------
Possible implementation:
{code:java}
import org.apache.spark.mllib.linalg._
import java.util.HashMap
val vas = Array(Array(1.0, 0, 0), Array(1.1, 0, 0), Array(0.9, 0, 0),
Array(0, 1.0, 0), Array(0.01, 1.01, 0), Array(0, 0, 1.0),
Array(0, 0, 1.1))
val vs = vas.map(Vectors.dense(_))
val t1 = 1.0
val t2 = 0.5
// starting canopy
val map = new HashMap[Vector, Vector] // map from data to clusters
val set = new HashMap[Vector, Boolean] // the set
for(v <- vs) set.put(v, true)
for(v <- vs) {
if(set.get(v)) {
val dists = vs.map{ x => (x, Vectors.sqdist(x, v)) }
dists.foreach { case (x, d) =>
if(d < t1) map.put(x, v)
if(d < t2) set.put(x, false)
}
}
}
{code}
The algorithm is working with arrays and lists, but all of them could be
converted to RDD.
was (Author: angellandros):
Possible implementation:
{code:scala}
import org.apache.spark.mllib.linalg._
import java.util.HashMap
val vas = Array(Array(1.0, 0, 0), Array(1.1, 0, 0), Array(0.9, 0, 0),
Array(0, 1.0, 0), Array(0.01, 1.01, 0), Array(0, 0, 1.0),
Array(0, 0, 1.1))
val vs = vas.map(Vectors.dense(_))
val t1 = 1.0
val t2 = 0.5
// starting canopy
val map = new HashMap[Vector, Vector] // map from data to clusters
val set = new HashMap[Vector, Boolean] // the set
for(v <- vs) set.put(v, true)
for(v <- vs) {
if(set.get(v)) {
val dists = vs.map{ x => (x, Vectors.sqdist(x, v)) }
dists.foreach { case (x, d) =>
if(d < t1) map.put(x, v)
if(d < t2) set.put(x, false)
}
}
}
{code}
The algorithm is working with arrays and lists, but all of them could be
converted to RDD.
> Add Canopy Clustering Algorithm
> -------------------------------
>
> Key: SPARK-3439
> URL: https://issues.apache.org/jira/browse/SPARK-3439
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
> Reporter: Yu Ishikawa
> Assignee: Muhammad-Ali A'rabi
> Priority: Minor
>
> The canopy clustering algorithm is an unsupervised pre-clustering algorithm.
> It is often used as a preprocessing step for the K-means algorithm or the
> Hierarchical clustering algorithm. It is intended to speed up clustering
> operations on large data sets, where using another algorithm directly may be
> impractical due to the size of the data set.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]