Ted, Some time back I had thought about this idea. But, I sensed one potential problem with this approach. The resulting co-occurrence will be bi-directional. For document this property is fine, but for terms, it may not be desirable in some cases.
For example, if "Roger Federer" is the keyword, the co-occuring terms will be "Tennis", "Grand slam", "Wimbledon", etc. But, for "Tennis", the list of top co-occurring terms may not include "Roger Federer." Is there a way to identify the directional relationship among terms? Of course, this was just a thought and no real code was written to verify the assertion. --shashi On Wed, Sep 30, 2009 at 2:43 AM, Ted Dunning <[email protected]> wrote: > Another way to do this through the back door is to transpose the document > set so that you have a list of documents for each term. Index this and > cluster it just as if it were normal documents and you will have a form of > term clustering. > > On Tue, Sep 29, 2009 at 1:05 PM, Grant Ingersoll <[email protected]>wrote: > >> The LDA implementation kind of clusters on terms to generate topics. It >> sounds like you want some co-occurrence analysis, I'm not sure that the >> clustering algorithms are best for that, but perhaps others have insight. >> I could imagine doing this with HBase or Pig and just keeping a matrix >> where each cell kept track of the number of times both terms appear in a >> document (or even within some window in a document). >> >> >> >> On Sep 29, 2009, at 8:57 AM, Ole-Martin Mørk wrote: >> >> Hi. >>> I have been using org.apache.mahout.utils.vectors.lucene.Driver >>> and org.apache.mahout.clustering.kmeans.KMeansDriver to cluster documents >>> in >>> our Lucene index and it works great! I am wondering though, is it possible >>> to use Mahout to cluster terms? >>> >>> I want to cluster terms that often appear in the same documents. >>> >>> Thank you. >>> >>> -- >>> Ole-Martin Mørk >>> http://twitter.com/olemartin >>> http://flickr.com/olemartin >>> >> >> -------------------------- >> Grant Ingersoll >> http://www.lucidimagination.com/ >> >> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using >> Solr/Lucene: >> http://www.lucidimagination.com/search >> >> > > > -- > Ted Dunning, CTO > DeepDyve >
