[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14227371#comment-14227371
 ] 

Yu Ishikawa commented on SPARK-2429:
------------------------------------

Hi [~rnowling], 

Thank you for replying.

{quote}
I'm having trouble finding the function to cut a dendrogram – I see the tests 
but not the implementation.
{quote}

I'm very sorry. 
[Here|https://github.com/yu-iskw/spark/blob/hierarchical/mllib%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fmllib%2Fclustering%2FHierarchicalClustering.scala#L390]
 is the implementation. I think the function is specialized to SciPy dendrogram 
much.

{quote}
If clusters at the same levels in the hierarchy do not overlap, you should be 
able to choose the closest cluster at each level until you find a leaf. I'm 
assuming that the children of a given cluster are contained within that cluster 
(spacially) – can you show this or find a reference for this? If so, then 
assignment should be faster for a larger number of clusters as Jun was saying 
above.
{quote}

Exactly. I agree with you. 

I implemented the function for assignment as a recursive function in O(log N) 
time.
[https://github.com/yu-iskw/spark/blob/hierarchical/mllib%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fmllib%2Fclustering%2FHierarchicalClustering.scala#L482]

Although I checked the performance for assignment  of my implementation and 
that of KMenas, the elapsed time this assignment implementation is slower than 
that of KMeans. The result is 
[here|https://issues.apache.org/jira/browse/SPARK-2429?focusedCommentId=14190166&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14190166].
The elapsed time for assingment (predct) is not long from the beginning. For 
example, the time of assingnment of KMenas in O(N) is 0.011\[sec\], and that of 
my implementation in O(log N) is 0.307 \[sec\].

I'm very sorry if I misunderstand your question.
thanks,

> Hierarchical Implementation of KMeans
> -------------------------------------
>
>                 Key: SPARK-2429
>                 URL: https://issues.apache.org/jira/browse/SPARK-2429
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: RJ Nowling
>            Assignee: Yu Ishikawa
>            Priority: Minor
>         Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
> Result of Benchmarking a Hierarchical Clustering.pdf, 
> benchmark-result.2014-10-29.html, benchmark2.html
>
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to