[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14190166#comment-14190166
 ] 

Yu Ishikawa commented on SPARK-2429:
------------------------------------

I compared training and predicting elapsed times of the hierarchical clustering 
with them of kmeans.
In fact, the theoretical computational complexity of hierarchical clustering 
assingment is smaller than that of kmeans.
However, not only predicting time but also predicting time of the hierarchical 
clustering are slower than them of kmeans.

I used the below url's program for this experiment.
https://github.com/yu-iskw/hierarchical-clustering-with-spark/blob/37488e306d583d0e1743bff432165e8c1bf4465e/src/main/scala/CompareWithKMeansApp.scala

{noformat}
{"maxCores" : "160", "numClusters" : "50", "dimension" : "500", "rows" : 
"1000000", "numPartitions" : "160"}
KMeans Training Elappsed Time: 28.179 [sec]
KMeans Predicting Elappsed Time: 0.011 [sec]
Hierarchical Training Elappsed Time: 46.539 [sec]
Hierarchical Predicting Elappsed Time: 0.3076923076923077 [sec]

{"maxCores" : "160", "numClusters" : "50", "dimension" : "500", "rows" : 
"5000000", "numPartitions" : "160"}
KMeans Training Elappsed Time: 55.187 [sec]
KMeans Predicting Elappsed Time: 0.008 [sec]
Hierarchical Training Elappsed Time: 210.238 [sec]
Hierarchical Predicting Elappsed Time: 0.3906093906093906 [sec]
{noformat}


> Hierarchical Implementation of KMeans
> -------------------------------------
>
>                 Key: SPARK-2429
>                 URL: https://issues.apache.org/jira/browse/SPARK-2429
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: RJ Nowling
>            Assignee: Yu Ishikawa
>            Priority: Minor
>         Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
> Result of Benchmarking a Hierarchical Clustering.pdf, 
> benchmark-result.2014-10-29.html, benchmark2.html
>
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to