Re: Using mahout to cluster terms in Lucene

Ted Dunning Tue, 29 Sep 2009 14:18:30 -0700

Yes.  Transposing is exactly what I was suggesting but in the context of,
say, k-means.


LDA has the equivalent of U and V matrices laying around that should allow
clustering of terms and documents in the same space.  That is an interesting
thing to be able to do in any case.  The words give you a description of the
content and the documents give you examples.

On Tue, Sep 29, 2009 at 2:14 PM, Jake Mannix <[email protected]> wrote:

> Clustering documents by term (a la LDA or SVD) also leads to a nice
> clustering of terms by just looking at "the transpose", right?  This is
> literally the case for SVD: if M = U S V' is your SVD, where M is
> represented as a row matrix and U and V are column matrices (document by
> reduced-dimension and term by reduced dimension, respectively), then
> typically you just keep V and S around.  In this case the transpose of V
> has, as row vectors, the projection of each term onto the reduced
> dimensional space, and doing clustering on that set of reduced vectors
> performs "concept-aware" term clustering (and if you just want the system
> to
> run as a search engine [find me the top terms "close" to a given term], you
> just sort by descending dot-product on the rows of V).
>
> For our LDA implementation, I'm not sure, but given the set of all topics,
> just as each topic has a probability of producing a term, and so the
> transpose of this matrix has the probability of any given term being
> produced by each of the topics.  I'm not sure if our current implementation
> has methods you can easily use to get access to this information and
> thereby
> cluster the terms, however.
>
> On Tue, Sep 29, 2009 at 1:05 PM, Grant Ingersoll <[email protected]
> >wrote:
>
> > The LDA implementation kind of clusters on terms to generate topics.  It
> > sounds like you want some co-occurrence analysis, I'm not sure that the
> > clustering algorithms are best for that, but perhaps others have insight.
> >  I could imagine doing this with HBase or Pig and just keeping a matrix
> > where each cell kept track of the number of times both terms appear in a
> > document (or even within some window in a document).
> >
> >
> >
> > On Sep 29, 2009, at 8:57 AM, Ole-Martin Mørk wrote:
> >
> >  Hi.
> >> I have been using org.apache.mahout.utils.vectors.lucene.Driver
> >> and org.apache.mahout.clustering.kmeans.KMeansDriver to cluster
> documents
> >> in
> >> our Lucene index and it works great! I am wondering though, is it
> possible
> >> to use Mahout to cluster terms?
> >>
> >> I want to cluster terms that often appear in the same documents.
> >>
> >> Thank you.
> >>
> >> --
> >> Ole-Martin Mørk
> >> http://twitter.com/olemartin
> >> http://flickr.com/olemartin
> >>
> >
> > --------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com/
> >
> > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> > Solr/Lucene:
> > http://www.lucidimagination.com/search
> >
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Using mahout to cluster terms in Lucene

Reply via email to