[jira] Commented: (MAHOUT-19) Hierarchial clusterer

Karl Wettin (JIRA) Fri, 18 Apr 2008 16:01:37 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12590640#action_12590640
 ]


Karl Wettin commented on MAHOUT-19:
-----------------------------------

This is doing quite OK now. It needs to add quite a number of instances before 
it makes sense to distribute the calculations. A leaf can now contain many 
instances if they are similar enough as simple pruning to avoid super deep 
trees. There are a couple of optimizable todos in the code. 

All that's left to be usable is to persist the tree so other applications can 
access it.

This is a rather slow algorithm as the tree grows big. Perhaps it could work to 
keep track of all clusters and calculate their mean instance and look for the n 
closest mean cluster instance and in a second iteration look for the closest 
instances available. But I don't know, this might have similar effects as top 
feeding the tree..

I'm also not sure if I bottom feed the best way right now. Once the closest 
instance is found I look for the closest node in the chain of parents towards 
root. It might actually be good to also look if sibling nodes along the way up 
is closer to the new instance.

I don't know. I'll have to test some.



> Hierarchial clusterer
> ---------------------
>
>                 Key: MAHOUT-19
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-19.txt, MAHOUT-19.txt, TestBottomFeed.test.png, 
> TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where 
> branch nodes contains the mean features of and the distance between its 
> children.
> For performance reasons I always trained trees from the top->down. I have 
> been told that it can cause various effects I never encountered. And I 
> believe Huffman solved his problem by training bottom->up? The thing is, I 
> don't think it is possible to train the tree top->down using map reduce. I do 
> however think it is possible to train it bottom->up. I would very much 
> appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean 
> distance between all instances is usually a good maximum distance to allow 
> between nodes when navigating the tree in search for a cluster. 
> Navigating the tree and gather nodes that are not too far away from each 
> other is usually instant if the tree is available in memory or persisted in a 
> smart way. In my experience there is not much to win from extracting all 
> clusters from start. Also, it usually makes sense to allow for the user to 
> modify the cluster boundary variables in real time using a slider or perhaps 
> present the named summary of neighbouring clusters, blacklist paths in the 
> tree, etc. It is also not to bad to use secondary classification on the 
> instances to create worm holes in the tree. I always thought it would be cool 
> to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature 
> in search engines and use Tanimoto similarity on the vector spaces to 
> calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a 
> hierarchial clusterer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-19) Hierarchial clusterer

Reply via email to