[jira] Commented: (MAHOUT-19) Hierarchial clusterer

Ankur (JIRA) Tue, 13 Jan 2009 04:33:29 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663319#action_12663319
 ]


Ankur commented on MAHOUT-19:
-----------------------------

Hi Karl, Welcome back :-)
Can you share the following few things about this patch?

1. Assuming you are training the tree top-down, what is the division criteria 
you are using ?
2. How well does it scale ?
3. Was the data on which this was tried, sparse ?
4. What is the distance metric that has been used ?

Basically I have a use -case where-in I have a set of 5 - 10 million urls which 
have an inherent hierarchical relationship and a set of user-clicks on them. I 
would like to cluster them in a tree and use the model to answer the near 
neighborhood type queries i.e. what urls are related to what other urls. I did 
implement a sequential bottom-up hierarchical clustering algorithm but the 
complexity is too bad for my data-set. I then thought about implementing a 
top-down hierarchical clustering algorithm using Jaccard co-efficient as my 
distance measure and came across this patch.

Can you suggest if this patch will help?

> Hierarchial clusterer
> ---------------------
>
>                 Key: MAHOUT-19
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, 
> MAHOUT-19.txt, MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where 
> branch nodes contains the mean features of and the distance between its 
> children.
> For performance reasons I always trained trees from the top->down. I have 
> been told that it can cause various effects I never encountered. And I 
> believe Huffman solved his problem by training bottom->up? The thing is, I 
> don't think it is possible to train the tree top->down using map reduce. I do 
> however think it is possible to train it bottom->up. I would very much 
> appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean 
> distance between all instances is usually a good maximum distance to allow 
> between nodes when navigating the tree in search for a cluster. 
> Navigating the tree and gather nodes that are not too far away from each 
> other is usually instant if the tree is available in memory or persisted in a 
> smart way. In my experience there is not much to win from extracting all 
> clusters from start. Also, it usually makes sense to allow for the user to 
> modify the cluster boundary variables in real time using a slider or perhaps 
> present the named summary of neighbouring clusters, blacklist paths in the 
> tree, etc. It is also not to bad to use secondary classification on the 
> instances to create worm holes in the tree. I always thought it would be cool 
> to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature 
> in search engines and use Tanimoto similarity on the vector spaces to 
> calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a 
> hierarchial clusterer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-19) Hierarchial clusterer

Reply via email to