[
https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663319#action_12663319
]
Ankur commented on MAHOUT-19:
-----------------------------
Hi Karl, Welcome back :-)
Can you share the following few things about this patch?
1. Assuming you are training the tree top-down, what is the division criteria
you are using ?
2. How well does it scale ?
3. Was the data on which this was tried, sparse ?
4. What is the distance metric that has been used ?
Basically I have a use -case where-in I have a set of 5 - 10 million urls which
have an inherent hierarchical relationship and a set of user-clicks on them. I
would like to cluster them in a tree and use the model to answer the near
neighborhood type queries i.e. what urls are related to what other urls. I did
implement a sequential bottom-up hierarchical clustering algorithm but the
complexity is too bad for my data-set. I then thought about implementing a
top-down hierarchical clustering algorithm using Jaccard co-efficient as my
distance measure and came across this patch.
Can you suggest if this patch will help?
> Hierarchial clusterer
> ---------------------
>
> Key: MAHOUT-19
> URL: https://issues.apache.org/jira/browse/MAHOUT-19
> Project: Mahout
> Issue Type: New Feature
> Components: Clustering
> Reporter: Karl Wettin
> Assignee: Karl Wettin
> Priority: Minor
> Attachments: MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt,
> MAHOUT-19.txt, MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where
> branch nodes contains the mean features of and the distance between its
> children.
> For performance reasons I always trained trees from the top->down. I have
> been told that it can cause various effects I never encountered. And I
> believe Huffman solved his problem by training bottom->up? The thing is, I
> don't think it is possible to train the tree top->down using map reduce. I do
> however think it is possible to train it bottom->up. I would very much
> appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean
> distance between all instances is usually a good maximum distance to allow
> between nodes when navigating the tree in search for a cluster.
> Navigating the tree and gather nodes that are not too far away from each
> other is usually instant if the tree is available in memory or persisted in a
> smart way. In my experience there is not much to win from extracting all
> clusters from start. Also, it usually makes sense to allow for the user to
> modify the cluster boundary variables in real time using a slider or perhaps
> present the named summary of neighbouring clusters, blacklist paths in the
> tree, etc. It is also not to bad to use secondary classification on the
> instances to create worm holes in the tree. I always thought it would be cool
> to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature
> in search engines and use Tanimoto similarity on the vector spaces to
> calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a
> hierarchial clusterer.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.